Table of contents

Can I use MechanicalSoup to handle file uploads?

Yes, MechanicalSoup can handle file uploads effectively through its built-in support for HTML forms, including multipart/form-data encoding required for file uploads. MechanicalSoup simplifies the process by automatically handling form detection, file encoding, and submission, making it an excellent choice for automating file upload workflows.

Understanding File Upload Forms

Before diving into implementation, it's important to understand how file uploads work in web forms. File uploads typically use HTML forms with enctype="multipart/form-data" and method="POST". These forms contain <input type="file"> elements that allow users to select files from their local system.

Basic File Upload with MechanicalSoup

Here's a simple example of uploading a file using MechanicalSoup:

import mechanicalsoup

# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()

# Navigate to the page with the upload form
browser.open("https://example.com/upload")

# Find the form (assumes there's only one form, or it's the first)
form = browser.select_form()

# For forms with multiple forms, you can select by attributes
# form = browser.select_form('form[action="/upload"]')

# Set the file input field
# The key should match the 'name' attribute of the file input
browser['file_field'] = open('/path/to/your/file.txt', 'rb')

# Submit the form
response = browser.submit_selected()

# Check if upload was successful
if response.status_code == 200:
    print("File uploaded successfully!")
else:
    print(f"Upload failed with status code: {response.status_code}")

Advanced File Upload Scenarios

Multiple File Uploads

Some forms allow multiple file uploads. Here's how to handle them:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.open("https://example.com/multi-upload")

# Select the form
form = browser.select_form()

# Upload multiple files to the same field (if supported)
browser['files[]'] = [
    open('/path/to/file1.txt', 'rb'),
    open('/path/to/file2.txt', 'rb'),
    open('/path/to/file3.txt', 'rb')
]

# Or upload to different fields
browser['file1'] = open('/path/to/document.pdf', 'rb')
browser['file2'] = open('/path/to/image.jpg', 'rb')

response = browser.submit_selected()

Form with Additional Fields

Real-world upload forms often include additional fields like descriptions, categories, or metadata:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
browser.open("https://example.com/upload-with-metadata")

# Select the form
form = browser.select_form()

# Fill in text fields
browser['title'] = "My Document"
browser['description'] = "Important project documentation"
browser['category'] = "documents"

# Upload the file
browser['document'] = open('/path/to/document.pdf', 'rb')

# Submit with all fields
response = browser.submit_selected()

Handling Different File Types

MechanicalSoup can handle various file types. Here's an example with different file formats:

import mechanicalsoup
import os

def upload_file(file_path, upload_url):
    browser = mechanicalsoup.StatefulBrowser()
    browser.open(upload_url)

    # Get file information
    filename = os.path.basename(file_path)
    file_size = os.path.getsize(file_path)

    print(f"Uploading {filename} ({file_size} bytes)")

    form = browser.select_form()

    # Set additional metadata based on file type
    if filename.endswith('.pdf'):
        browser['file_type'] = 'document'
    elif filename.endswith(('.jpg', '.png', '.gif')):
        browser['file_type'] = 'image'
    elif filename.endswith(('.mp4', '.avi', '.mov')):
        browser['file_type'] = 'video'

    # Upload the file
    browser['file'] = open(file_path, 'rb')

    response = browser.submit_selected()
    return response

# Usage examples
upload_file('/path/to/report.pdf', 'https://example.com/upload')
upload_file('/path/to/photo.jpg', 'https://example.com/upload')
upload_file('/path/to/video.mp4', 'https://example.com/upload')

Error Handling and Validation

Robust file upload implementations should include proper error handling:

import mechanicalsoup
import os
from urllib.parse import urljoin

def safe_file_upload(file_path, upload_url, max_size_mb=10):
    """
    Safely upload a file with error handling and validation
    """
    try:
        # Validate file exists and size
        if not os.path.exists(file_path):
            raise FileNotFoundError(f"File not found: {file_path}")

        file_size = os.path.getsize(file_path)
        max_size_bytes = max_size_mb * 1024 * 1024

        if file_size > max_size_bytes:
            raise ValueError(f"File too large: {file_size} bytes (max: {max_size_bytes})")

        # Create browser and navigate
        browser = mechanicalsoup.StatefulBrowser()
        browser.set_user_agent('MechanicalSoup File Uploader 1.0')

        response = browser.open(upload_url)
        if response.status_code != 200:
            raise Exception(f"Failed to access upload page: {response.status_code}")

        # Find and select form
        forms = browser.page.find_all('form')
        if not forms:
            raise Exception("No forms found on the page")

        # Look for file input in forms
        upload_form = None
        for form in forms:
            if form.find('input', {'type': 'file'}):
                upload_form = form
                break

        if not upload_form:
            raise Exception("No file upload form found")

        browser.select_form(upload_form)

        # Find file input field name
        file_input = upload_form.find('input', {'type': 'file'})
        field_name = file_input.get('name', 'file')

        # Upload file
        with open(file_path, 'rb') as file:
            browser[field_name] = file
            response = browser.submit_selected()

        # Check response
        if response.status_code in [200, 201, 302]:
            print(f"File uploaded successfully: {os.path.basename(file_path)}")
            return True
        else:
            print(f"Upload failed with status: {response.status_code}")
            return False

    except Exception as e:
        print(f"Upload error: {str(e)}")
        return False

# Usage
success = safe_file_upload('/path/to/file.txt', 'https://example.com/upload')

Working with Authentication

Many file upload services require authentication. Here's how to handle login before uploading:

import mechanicalsoup

class AuthenticatedUploader:
    def __init__(self):
        self.browser = mechanicalsoup.StatefulBrowser()
        self.logged_in = False

    def login(self, login_url, username, password):
        """Login to the service"""
        self.browser.open(login_url)

        # Find login form
        form = self.browser.select_form()

        # Fill credentials (field names may vary)
        self.browser['username'] = username  # or 'email', 'login', etc.
        self.browser['password'] = password

        response = self.browser.submit_selected()

        # Check if login was successful
        if 'dashboard' in response.url or 'welcome' in response.text.lower():
            self.logged_in = True
            print("Login successful")
        else:
            print("Login failed")

        return self.logged_in

    def upload_file(self, upload_url, file_path, **form_data):
        """Upload file after authentication"""
        if not self.logged_in:
            raise Exception("Must login first")

        self.browser.open(upload_url)
        form = self.browser.select_form()

        # Fill additional form data
        for field, value in form_data.items():
            self.browser[field] = value

        # Upload file
        self.browser['file'] = open(file_path, 'rb')

        response = self.browser.submit_selected()
        return response

# Usage
uploader = AuthenticatedUploader()
uploader.login('https://example.com/login', 'user@example.com', 'password')
uploader.upload_file(
    'https://example.com/upload', 
    '/path/to/file.txt',
    title="My Document",
    category="reports"
)

Monitoring Upload Progress

For large files, you might want to provide progress feedback:

import mechanicalsoup
import os
from tqdm import tqdm

class ProgressUploader:
    def __init__(self):
        self.browser = mechanicalsoup.StatefulBrowser()

    def upload_with_progress(self, upload_url, file_path):
        """Upload file with progress monitoring"""
        file_size = os.path.getsize(file_path)
        filename = os.path.basename(file_path)

        print(f"Uploading {filename} ({file_size:,} bytes)")

        self.browser.open(upload_url)
        form = self.browser.select_form()

        # Create progress bar
        with tqdm(total=file_size, unit='B', unit_scale=True, desc=filename) as pbar:
            # Note: MechanicalSoup doesn't provide built-in progress callbacks
            # This is a simplified example - real progress tracking would require
            # custom request handling or switching to requests with custom adapters

            with open(file_path, 'rb') as file:
                self.browser['file'] = file
                pbar.update(file_size)  # Simplified - shows completion
                response = self.browser.submit_selected()

        return response

# For actual progress tracking during upload, consider using requests directly:
import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder, MultipartEncoderMonitor

def upload_with_real_progress(upload_url, file_path):
    """Upload with real-time progress using requests"""

    def progress_callback(monitor):
        progress = (monitor.bytes_read / monitor.len) * 100
        print(f"\rProgress: {progress:.1f}%", end='', flush=True)

    with open(file_path, 'rb') as file:
        encoder = MultipartEncoder(
            fields={'file': (os.path.basename(file_path), file, 'application/octet-stream')}
        )

        monitor = MultipartEncoderMonitor(encoder, progress_callback)

        response = requests.post(
            upload_url,
            data=monitor,
            headers={'Content-Type': monitor.content_type}
        )

    print(f"\nUpload completed with status: {response.status_code}")
    return response

Best Practices for File Uploads

1. Always Use Context Managers

# Good: Using context manager
with open(file_path, 'rb') as file:
    browser['file'] = file
    response = browser.submit_selected()

# Avoid: Leaving files open
browser['file'] = open(file_path, 'rb')  # File might not be closed properly

2. Validate Files Before Upload

import mimetypes

def validate_file(file_path, allowed_types=None, max_size_mb=10):
    """Validate file before upload"""
    if not os.path.exists(file_path):
        return False, "File does not exist"

    # Check file size
    size_mb = os.path.getsize(file_path) / (1024 * 1024)
    if size_mb > max_size_mb:
        return False, f"File too large: {size_mb:.1f}MB (max: {max_size_mb}MB)"

    # Check file type
    if allowed_types:
        mime_type, _ = mimetypes.guess_type(file_path)
        if mime_type not in allowed_types:
            return False, f"File type not allowed: {mime_type}"

    return True, "File is valid"

# Usage
allowed_types = ['image/jpeg', 'image/png', 'application/pdf']
is_valid, message = validate_file('/path/to/file.jpg', allowed_types)
if is_valid:
    # Proceed with upload
    pass
else:
    print(f"Validation failed: {message}")

3. Handle Network Issues

import time
import random

def upload_with_retry(browser, file_path, max_retries=3):
    """Upload with retry logic for network issues"""
    for attempt in range(max_retries):
        try:
            with open(file_path, 'rb') as file:
                browser['file'] = file
                response = browser.submit_selected()

                if response.status_code in [200, 201]:
                    return response
                else:
                    raise Exception(f"HTTP {response.status_code}")

        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait_time:.1f}s...")
                time.sleep(wait_time)
            else:
                print(f"All {max_retries} attempts failed: {e}")
                raise

    return None

Comparison with Other Tools

While MechanicalSoup excels at handling forms and file uploads, other tools might be better suited for specific scenarios:

  • For JavaScript-heavy upload interfaces: Consider using browser automation tools that can handle dynamic content
  • For REST API uploads: Use requests library directly with multipart encoding
  • For large file uploads: Consider streaming uploads with progress tracking using requests and requests-toolbelt

Troubleshooting Common Issues

Issue 1: Form Not Found

# Check if forms exist
forms = browser.page.find_all('form')
print(f"Found {len(forms)} forms on the page")

for i, form in enumerate(forms):
    print(f"Form {i}: {form.get('action', 'No action')} - {form.get('method', 'GET')}")
    file_inputs = form.find_all('input', {'type': 'file'})
    print(f"  File inputs: {len(file_inputs)}")

Issue 2: Wrong Field Name

# Find all file input fields
file_inputs = browser.page.find_all('input', {'type': 'file'})
for inp in file_inputs:
    print(f"Field name: {inp.get('name')}, ID: {inp.get('id')}")

Issue 3: File Size Limits

Always check the form for hidden fields that might indicate size limits:

# Check for hidden size limit fields
hidden_inputs = browser.page.find_all('input', {'type': 'hidden'})
for inp in hidden_inputs:
    name = inp.get('name', '')
    if 'size' in name.lower() or 'max' in name.lower():
        print(f"Size limit field: {name} = {inp.get('value')}")

Conclusion

MechanicalSoup provides a robust and Python-friendly way to handle file uploads through web forms. Its automatic form detection and handling make it ideal for scenarios where you need to interact with standard HTML upload forms. Combined with proper error handling, validation, and retry logic, MechanicalSoup can reliably automate file upload workflows in your web scraping and automation projects.

The key advantages of using MechanicalSoup for file uploads include its simplicity, automatic form handling, and seamless integration with session management and authentication. However, for more complex scenarios involving JavaScript-heavy interfaces or real-time progress tracking, you might need to consider alternative approaches or combine MechanicalSoup with other tools.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon