How do I handle file downloads with MechanicalSoup?
MechanicalSoup is a powerful Python library that simplifies web scraping and browser automation tasks. When it comes to downloading files from web pages, MechanicalSoup provides several approaches depending on your specific needs. This guide will walk you through various methods to handle file downloads effectively using MechanicalSoup.
Understanding File Downloads in MechanicalSoup
MechanicalSoup is built on top of the requests
library and BeautifulSoup
, which means it inherits robust HTTP handling capabilities. File downloads typically involve making HTTP requests to specific URLs and saving the response content to local files.
Basic File Download
The simplest way to download a file with MechanicalSoup is to navigate to the download URL and save the response content:
import mechanicalsoup
import os
# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()
# Navigate to the file URL
response = browser.get("https://example.com/path/to/file.pdf")
# Save the file
with open("downloaded_file.pdf", "wb") as file:
file.write(response.content)
print("File downloaded successfully!")
Downloading Files from Forms
Many websites require form submission or authentication before allowing file downloads. MechanicalSoup excels at handling these scenarios:
import mechanicalsoup
# Create browser instance with session management
browser = mechanicalsoup.StatefulBrowser(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Navigate to the page with the download form
browser.open("https://example.com/download-page")
# Find and fill the form
form = browser.select_form('form[action="/download"]')
browser["username"] = "your_username"
browser["password"] = "your_password"
# Submit the form
response = browser.submit_selected()
# Check if the response contains a file
if response.headers.get('content-type', '').startswith('application/'):
filename = response.headers.get('content-disposition', '').split('filename=')[-1].strip('"')
with open(filename or "downloaded_file", "wb") as file:
file.write(response.content)
print(f"Downloaded: {filename}")
else:
print("No file found in response")
Handling Large File Downloads with Streaming
For large files, it's important to use streaming to avoid loading the entire file into memory:
import mechanicalsoup
import os
from urllib.parse import urlparse
def download_large_file(browser, url, local_filename=None):
"""Download large files using streaming to manage memory efficiently"""
# Make request with streaming enabled
response = browser.get(url, stream=True)
response.raise_for_status()
# Determine filename
if not local_filename:
# Try to get filename from Content-Disposition header
content_disposition = response.headers.get('content-disposition', '')
if 'filename=' in content_disposition:
local_filename = content_disposition.split('filename=')[-1].strip('"')
else:
# Fallback to URL path
local_filename = os.path.basename(urlparse(url).path) or "downloaded_file"
# Download in chunks
total_size = int(response.headers.get('content-length', 0))
downloaded_size = 0
with open(local_filename, 'wb') as file:
for chunk in response.iter_content(chunk_size=8192):
if chunk:
file.write(chunk)
downloaded_size += len(chunk)
# Progress indicator
if total_size > 0:
progress = (downloaded_size / total_size) * 100
print(f"\rDownloading: {progress:.1f}%", end="", flush=True)
print(f"\nDownload completed: {local_filename}")
return local_filename
# Usage example
browser = mechanicalsoup.StatefulBrowser()
download_large_file(browser, "https://example.com/large-file.zip")
Downloading Multiple Files
When downloading multiple files, it's efficient to reuse the browser session and handle errors gracefully:
import mechanicalsoup
import os
import time
from urllib.parse import urljoin
def download_multiple_files(base_url, file_urls, download_dir="downloads"):
"""Download multiple files with error handling and rate limiting"""
# Create download directory
os.makedirs(download_dir, exist_ok=True)
# Initialize browser with session persistence
browser = mechanicalsoup.StatefulBrowser()
browser.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
downloaded_files = []
failed_downloads = []
for file_url in file_urls:
try:
# Construct full URL if needed
full_url = urljoin(base_url, file_url) if not file_url.startswith('http') else file_url
print(f"Downloading: {full_url}")
response = browser.get(full_url)
response.raise_for_status()
# Generate filename
filename = os.path.basename(file_url) or f"file_{len(downloaded_files)}"
filepath = os.path.join(download_dir, filename)
# Save file
with open(filepath, 'wb') as file:
file.write(response.content)
downloaded_files.append(filepath)
print(f"✓ Downloaded: {filename}")
# Rate limiting to be respectful
time.sleep(1)
except Exception as e:
print(f"✗ Failed to download {file_url}: {str(e)}")
failed_downloads.append(file_url)
return downloaded_files, failed_downloads
# Usage example
file_list = [
"/downloads/document1.pdf",
"/downloads/image1.jpg",
"/downloads/data.csv"
]
success, failed = download_multiple_files("https://example.com", file_list)
print(f"Successfully downloaded: {len(success)} files")
print(f"Failed downloads: {len(failed)} files")
Handling Authentication and Cookies
For websites requiring authentication, MechanicalSoup maintains session state automatically:
import mechanicalsoup
def download_authenticated_file(login_url, download_url, username, password):
"""Download files from authenticated areas"""
browser = mechanicalsoup.StatefulBrowser()
# Login first
browser.open(login_url)
# Find and fill login form
login_form = browser.select_form()
browser["username"] = username # Adjust field names as needed
browser["password"] = password
# Submit login
login_response = browser.submit_selected()
# Check if login was successful
if "dashboard" in login_response.url or "welcome" in browser.get_current_page().text.lower():
print("Login successful")
# Now download the file
download_response = browser.get(download_url)
if download_response.headers.get('content-type', '').startswith('application/'):
filename = "authenticated_download.pdf"
with open(filename, 'wb') as file:
file.write(download_response.content)
print(f"Downloaded: {filename}")
else:
print("Download failed or file not found")
else:
print("Login failed")
# Usage
download_authenticated_file(
"https://example.com/login",
"https://example.com/secure/document.pdf",
"your_username",
"your_password"
)
Advanced Download Features
Handling Different Content Types
import mechanicalsoup
import mimetypes
def smart_download(browser, url):
"""Download file with automatic type detection and naming"""
response = browser.get(url)
response.raise_for_status()
# Get content type
content_type = response.headers.get('content-type', '').split(';')[0]
# Determine file extension
extension = mimetypes.guess_extension(content_type)
if not extension:
extension = ".bin" # fallback for unknown types
# Generate filename
filename = f"download_{int(time.time())}{extension}"
# Save file
with open(filename, 'wb') as file:
file.write(response.content)
print(f"Downloaded {content_type} file as: {filename}")
return filename
Download Verification
import hashlib
def download_with_verification(browser, url, expected_hash=None):
"""Download file and verify its integrity"""
response = browser.get(url)
response.raise_for_status()
content = response.content
# Calculate hash
file_hash = hashlib.md5(content).hexdigest()
if expected_hash and file_hash != expected_hash:
raise ValueError(f"File integrity check failed. Expected: {expected_hash}, Got: {file_hash}")
filename = "verified_download.bin"
with open(filename, 'wb') as file:
file.write(content)
print(f"Downloaded and verified: {filename} (Hash: {file_hash})")
return filename, file_hash
Best Practices for File Downloads
1. Always Handle Errors
try:
response = browser.get(download_url)
response.raise_for_status()
except mechanicalsoup.requests.exceptions.RequestException as e:
print(f"Download failed: {e}")
2. Respect Rate Limits
import time
# Add delays between requests
time.sleep(1) # Wait 1 second between downloads
3. Use Appropriate Headers
browser.session.headers.update({
'User-Agent': 'Your-App/1.0',
'Accept': 'application/octet-stream, */*'
})
4. Validate File Types
allowed_types = ['application/pdf', 'image/jpeg', 'application/zip']
if response.headers.get('content-type') not in allowed_types:
raise ValueError("Unsupported file type")
Integration with Other Tools
MechanicalSoup works well with other Python libraries for enhanced functionality. For more complex scenarios involving JavaScript-heavy sites, you might want to consider how to handle file downloads in Puppeteer, which provides more advanced browser automation capabilities.
Troubleshooting Common Issues
Issue: Downloads Fail with 403 Forbidden
Solution: Add proper headers and ensure you're authenticated:
browser.session.headers.update({
'Referer': 'https://example.com/page-with-download-link',
'User-Agent': 'Mozilla/5.0 (compatible browser string)'
})
Issue: Large Files Cause Memory Issues
Solution: Use streaming downloads as shown in the large file example above.
Issue: Download Links are JavaScript-Generated
Solution: MechanicalSoup cannot execute JavaScript. Consider using browser automation tools for such cases.
Conclusion
MechanicalSoup provides excellent capabilities for downloading files from web pages, especially when dealing with forms, authentication, and session management. Its integration with the requests library makes it powerful for HTTP-based downloads, while BeautifulSoup integration helps with parsing and navigation.
For most web scraping scenarios involving file downloads, MechanicalSoup offers the right balance of simplicity and functionality. Remember to always respect robots.txt files, implement proper error handling, and be mindful of rate limiting to maintain good relationships with web servers.
Whether you're downloading single files, handling bulk downloads, or working with authenticated areas, MechanicalSoup's session management and form handling capabilities make it an excellent choice for Python developers working on web scraping projects.