Can I use MechanicalSoup to handle file uploads?
Yes, MechanicalSoup can handle file uploads effectively through its built-in support for HTML forms, including multipart/form-data encoding required for file uploads. MechanicalSoup simplifies the process by automatically handling form detection, file encoding, and submission, making it an excellent choice for automating file upload workflows.
Understanding File Upload Forms
Before diving into implementation, it's important to understand how file uploads work in web forms. File uploads typically use HTML forms with enctype="multipart/form-data"
and method="POST"
. These forms contain <input type="file">
elements that allow users to select files from their local system.
Basic File Upload with MechanicalSoup
Here's a simple example of uploading a file using MechanicalSoup:
import mechanicalsoup
# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()
# Navigate to the page with the upload form
browser.open("https://example.com/upload")
# Find the form (assumes there's only one form, or it's the first)
form = browser.select_form()
# For forms with multiple forms, you can select by attributes
# form = browser.select_form('form[action="/upload"]')
# Set the file input field
# The key should match the 'name' attribute of the file input
browser['file_field'] = open('/path/to/your/file.txt', 'rb')
# Submit the form
response = browser.submit_selected()
# Check if upload was successful
if response.status_code == 200:
print("File uploaded successfully!")
else:
print(f"Upload failed with status code: {response.status_code}")
Advanced File Upload Scenarios
Multiple File Uploads
Some forms allow multiple file uploads. Here's how to handle them:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://example.com/multi-upload")
# Select the form
form = browser.select_form()
# Upload multiple files to the same field (if supported)
browser['files[]'] = [
open('/path/to/file1.txt', 'rb'),
open('/path/to/file2.txt', 'rb'),
open('/path/to/file3.txt', 'rb')
]
# Or upload to different fields
browser['file1'] = open('/path/to/document.pdf', 'rb')
browser['file2'] = open('/path/to/image.jpg', 'rb')
response = browser.submit_selected()
Form with Additional Fields
Real-world upload forms often include additional fields like descriptions, categories, or metadata:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://example.com/upload-with-metadata")
# Select the form
form = browser.select_form()
# Fill in text fields
browser['title'] = "My Document"
browser['description'] = "Important project documentation"
browser['category'] = "documents"
# Upload the file
browser['document'] = open('/path/to/document.pdf', 'rb')
# Submit with all fields
response = browser.submit_selected()
Handling Different File Types
MechanicalSoup can handle various file types. Here's an example with different file formats:
import mechanicalsoup
import os
def upload_file(file_path, upload_url):
browser = mechanicalsoup.StatefulBrowser()
browser.open(upload_url)
# Get file information
filename = os.path.basename(file_path)
file_size = os.path.getsize(file_path)
print(f"Uploading {filename} ({file_size} bytes)")
form = browser.select_form()
# Set additional metadata based on file type
if filename.endswith('.pdf'):
browser['file_type'] = 'document'
elif filename.endswith(('.jpg', '.png', '.gif')):
browser['file_type'] = 'image'
elif filename.endswith(('.mp4', '.avi', '.mov')):
browser['file_type'] = 'video'
# Upload the file
browser['file'] = open(file_path, 'rb')
response = browser.submit_selected()
return response
# Usage examples
upload_file('/path/to/report.pdf', 'https://example.com/upload')
upload_file('/path/to/photo.jpg', 'https://example.com/upload')
upload_file('/path/to/video.mp4', 'https://example.com/upload')
Error Handling and Validation
Robust file upload implementations should include proper error handling:
import mechanicalsoup
import os
from urllib.parse import urljoin
def safe_file_upload(file_path, upload_url, max_size_mb=10):
"""
Safely upload a file with error handling and validation
"""
try:
# Validate file exists and size
if not os.path.exists(file_path):
raise FileNotFoundError(f"File not found: {file_path}")
file_size = os.path.getsize(file_path)
max_size_bytes = max_size_mb * 1024 * 1024
if file_size > max_size_bytes:
raise ValueError(f"File too large: {file_size} bytes (max: {max_size_bytes})")
# Create browser and navigate
browser = mechanicalsoup.StatefulBrowser()
browser.set_user_agent('MechanicalSoup File Uploader 1.0')
response = browser.open(upload_url)
if response.status_code != 200:
raise Exception(f"Failed to access upload page: {response.status_code}")
# Find and select form
forms = browser.page.find_all('form')
if not forms:
raise Exception("No forms found on the page")
# Look for file input in forms
upload_form = None
for form in forms:
if form.find('input', {'type': 'file'}):
upload_form = form
break
if not upload_form:
raise Exception("No file upload form found")
browser.select_form(upload_form)
# Find file input field name
file_input = upload_form.find('input', {'type': 'file'})
field_name = file_input.get('name', 'file')
# Upload file
with open(file_path, 'rb') as file:
browser[field_name] = file
response = browser.submit_selected()
# Check response
if response.status_code in [200, 201, 302]:
print(f"File uploaded successfully: {os.path.basename(file_path)}")
return True
else:
print(f"Upload failed with status: {response.status_code}")
return False
except Exception as e:
print(f"Upload error: {str(e)}")
return False
# Usage
success = safe_file_upload('/path/to/file.txt', 'https://example.com/upload')
Working with Authentication
Many file upload services require authentication. Here's how to handle login before uploading:
import mechanicalsoup
class AuthenticatedUploader:
def __init__(self):
self.browser = mechanicalsoup.StatefulBrowser()
self.logged_in = False
def login(self, login_url, username, password):
"""Login to the service"""
self.browser.open(login_url)
# Find login form
form = self.browser.select_form()
# Fill credentials (field names may vary)
self.browser['username'] = username # or 'email', 'login', etc.
self.browser['password'] = password
response = self.browser.submit_selected()
# Check if login was successful
if 'dashboard' in response.url or 'welcome' in response.text.lower():
self.logged_in = True
print("Login successful")
else:
print("Login failed")
return self.logged_in
def upload_file(self, upload_url, file_path, **form_data):
"""Upload file after authentication"""
if not self.logged_in:
raise Exception("Must login first")
self.browser.open(upload_url)
form = self.browser.select_form()
# Fill additional form data
for field, value in form_data.items():
self.browser[field] = value
# Upload file
self.browser['file'] = open(file_path, 'rb')
response = self.browser.submit_selected()
return response
# Usage
uploader = AuthenticatedUploader()
uploader.login('https://example.com/login', 'user@example.com', 'password')
uploader.upload_file(
'https://example.com/upload',
'/path/to/file.txt',
title="My Document",
category="reports"
)
Monitoring Upload Progress
For large files, you might want to provide progress feedback:
import mechanicalsoup
import os
from tqdm import tqdm
class ProgressUploader:
def __init__(self):
self.browser = mechanicalsoup.StatefulBrowser()
def upload_with_progress(self, upload_url, file_path):
"""Upload file with progress monitoring"""
file_size = os.path.getsize(file_path)
filename = os.path.basename(file_path)
print(f"Uploading {filename} ({file_size:,} bytes)")
self.browser.open(upload_url)
form = self.browser.select_form()
# Create progress bar
with tqdm(total=file_size, unit='B', unit_scale=True, desc=filename) as pbar:
# Note: MechanicalSoup doesn't provide built-in progress callbacks
# This is a simplified example - real progress tracking would require
# custom request handling or switching to requests with custom adapters
with open(file_path, 'rb') as file:
self.browser['file'] = file
pbar.update(file_size) # Simplified - shows completion
response = self.browser.submit_selected()
return response
# For actual progress tracking during upload, consider using requests directly:
import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder, MultipartEncoderMonitor
def upload_with_real_progress(upload_url, file_path):
"""Upload with real-time progress using requests"""
def progress_callback(monitor):
progress = (monitor.bytes_read / monitor.len) * 100
print(f"\rProgress: {progress:.1f}%", end='', flush=True)
with open(file_path, 'rb') as file:
encoder = MultipartEncoder(
fields={'file': (os.path.basename(file_path), file, 'application/octet-stream')}
)
monitor = MultipartEncoderMonitor(encoder, progress_callback)
response = requests.post(
upload_url,
data=monitor,
headers={'Content-Type': monitor.content_type}
)
print(f"\nUpload completed with status: {response.status_code}")
return response
Best Practices for File Uploads
1. Always Use Context Managers
# Good: Using context manager
with open(file_path, 'rb') as file:
browser['file'] = file
response = browser.submit_selected()
# Avoid: Leaving files open
browser['file'] = open(file_path, 'rb') # File might not be closed properly
2. Validate Files Before Upload
import mimetypes
def validate_file(file_path, allowed_types=None, max_size_mb=10):
"""Validate file before upload"""
if not os.path.exists(file_path):
return False, "File does not exist"
# Check file size
size_mb = os.path.getsize(file_path) / (1024 * 1024)
if size_mb > max_size_mb:
return False, f"File too large: {size_mb:.1f}MB (max: {max_size_mb}MB)"
# Check file type
if allowed_types:
mime_type, _ = mimetypes.guess_type(file_path)
if mime_type not in allowed_types:
return False, f"File type not allowed: {mime_type}"
return True, "File is valid"
# Usage
allowed_types = ['image/jpeg', 'image/png', 'application/pdf']
is_valid, message = validate_file('/path/to/file.jpg', allowed_types)
if is_valid:
# Proceed with upload
pass
else:
print(f"Validation failed: {message}")
3. Handle Network Issues
import time
import random
def upload_with_retry(browser, file_path, max_retries=3):
"""Upload with retry logic for network issues"""
for attempt in range(max_retries):
try:
with open(file_path, 'rb') as file:
browser['file'] = file
response = browser.submit_selected()
if response.status_code in [200, 201]:
return response
else:
raise Exception(f"HTTP {response.status_code}")
except Exception as e:
if attempt < max_retries - 1:
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time:.1f}s...")
time.sleep(wait_time)
else:
print(f"All {max_retries} attempts failed: {e}")
raise
return None
Comparison with Other Tools
While MechanicalSoup excels at handling forms and file uploads, other tools might be better suited for specific scenarios:
- For JavaScript-heavy upload interfaces: Consider using browser automation tools that can handle dynamic content
- For REST API uploads: Use
requests
library directly with multipart encoding - For large file uploads: Consider streaming uploads with progress tracking using
requests
andrequests-toolbelt
Troubleshooting Common Issues
Issue 1: Form Not Found
# Check if forms exist
forms = browser.page.find_all('form')
print(f"Found {len(forms)} forms on the page")
for i, form in enumerate(forms):
print(f"Form {i}: {form.get('action', 'No action')} - {form.get('method', 'GET')}")
file_inputs = form.find_all('input', {'type': 'file'})
print(f" File inputs: {len(file_inputs)}")
Issue 2: Wrong Field Name
# Find all file input fields
file_inputs = browser.page.find_all('input', {'type': 'file'})
for inp in file_inputs:
print(f"Field name: {inp.get('name')}, ID: {inp.get('id')}")
Issue 3: File Size Limits
Always check the form for hidden fields that might indicate size limits:
# Check for hidden size limit fields
hidden_inputs = browser.page.find_all('input', {'type': 'hidden'})
for inp in hidden_inputs:
name = inp.get('name', '')
if 'size' in name.lower() or 'max' in name.lower():
print(f"Size limit field: {name} = {inp.get('value')}")
Conclusion
MechanicalSoup provides a robust and Python-friendly way to handle file uploads through web forms. Its automatic form detection and handling make it ideal for scenarios where you need to interact with standard HTML upload forms. Combined with proper error handling, validation, and retry logic, MechanicalSoup can reliably automate file upload workflows in your web scraping and automation projects.
The key advantages of using MechanicalSoup for file uploads include its simplicity, automatic form handling, and seamless integration with session management and authentication. However, for more complex scenarios involving JavaScript-heavy interfaces or real-time progress tracking, you might need to consider alternative approaches or combine MechanicalSoup with other tools.