How do I handle file downloads with MechanicalSoup?

MechanicalSoup is a powerful Python library that simplifies web scraping and browser automation tasks. When it comes to downloading files from web pages, MechanicalSoup provides several approaches depending on your specific needs. This guide will walk you through various methods to handle file downloads effectively using MechanicalSoup.

Understanding File Downloads in MechanicalSoup

MechanicalSoup is built on top of the requests library and BeautifulSoup, which means it inherits robust HTTP handling capabilities. File downloads typically involve making HTTP requests to specific URLs and saving the response content to local files.

Basic File Download

The simplest way to download a file with MechanicalSoup is to navigate to the download URL and save the response content:

import mechanicalsoup
import os

# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()

# Navigate to the file URL
response = browser.get("https://example.com/path/to/file.pdf")

# Save the file
with open("downloaded_file.pdf", "wb") as file:
    file.write(response.content)

print("File downloaded successfully!")

Downloading Files from Forms

Many websites require form submission or authentication before allowing file downloads. MechanicalSoup excels at handling these scenarios:

import mechanicalsoup

# Create browser instance with session management
browser = mechanicalsoup.StatefulBrowser(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)

# Navigate to the page with the download form
browser.open("https://example.com/download-page")

# Find and fill the form
form = browser.select_form('form[action="/download"]')
browser["username"] = "your_username"
browser["password"] = "your_password"

# Submit the form
response = browser.submit_selected()

# Check if the response contains a file
if response.headers.get('content-type', '').startswith('application/'):
    filename = response.headers.get('content-disposition', '').split('filename=')[-1].strip('"')

    with open(filename or "downloaded_file", "wb") as file:
        file.write(response.content)
    print(f"Downloaded: {filename}")
else:
    print("No file found in response")

Handling Large File Downloads with Streaming

For large files, it's important to use streaming to avoid loading the entire file into memory:

import mechanicalsoup
import os
from urllib.parse import urlparse

def download_large_file(browser, url, local_filename=None):
    """Download large files using streaming to manage memory efficiently"""

    # Make request with streaming enabled
    response = browser.get(url, stream=True)
    response.raise_for_status()

    # Determine filename
    if not local_filename:
        # Try to get filename from Content-Disposition header
        content_disposition = response.headers.get('content-disposition', '')
        if 'filename=' in content_disposition:
            local_filename = content_disposition.split('filename=')[-1].strip('"')
        else:
            # Fallback to URL path
            local_filename = os.path.basename(urlparse(url).path) or "downloaded_file"

    # Download in chunks
    total_size = int(response.headers.get('content-length', 0))
    downloaded_size = 0

    with open(local_filename, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                file.write(chunk)
                downloaded_size += len(chunk)

                # Progress indicator
                if total_size > 0:
                    progress = (downloaded_size / total_size) * 100
                    print(f"\rDownloading: {progress:.1f}%", end="", flush=True)

    print(f"\nDownload completed: {local_filename}")
    return local_filename

# Usage example
browser = mechanicalsoup.StatefulBrowser()
download_large_file(browser, "https://example.com/large-file.zip")

Downloading Multiple Files

When downloading multiple files, it's efficient to reuse the browser session and handle errors gracefully:

import mechanicalsoup
import os
import time
from urllib.parse import urljoin

def download_multiple_files(base_url, file_urls, download_dir="downloads"):
    """Download multiple files with error handling and rate limiting"""

    # Create download directory
    os.makedirs(download_dir, exist_ok=True)

    # Initialize browser with session persistence
    browser = mechanicalsoup.StatefulBrowser()
    browser.session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })

    downloaded_files = []
    failed_downloads = []

    for file_url in file_urls:
        try:
            # Construct full URL if needed
            full_url = urljoin(base_url, file_url) if not file_url.startswith('http') else file_url

            print(f"Downloading: {full_url}")
            response = browser.get(full_url)
            response.raise_for_status()

            # Generate filename
            filename = os.path.basename(file_url) or f"file_{len(downloaded_files)}"
            filepath = os.path.join(download_dir, filename)

            # Save file
            with open(filepath, 'wb') as file:
                file.write(response.content)

            downloaded_files.append(filepath)
            print(f"✓ Downloaded: {filename}")

            # Rate limiting to be respectful
            time.sleep(1)

        except Exception as e:
            print(f"✗ Failed to download {file_url}: {str(e)}")
            failed_downloads.append(file_url)

    return downloaded_files, failed_downloads

# Usage example
file_list = [
    "/downloads/document1.pdf",
    "/downloads/image1.jpg",
    "/downloads/data.csv"
]

success, failed = download_multiple_files("https://example.com", file_list)
print(f"Successfully downloaded: {len(success)} files")
print(f"Failed downloads: {len(failed)} files")

Handling Authentication and Cookies

For websites requiring authentication, MechanicalSoup maintains session state automatically:

import mechanicalsoup

def download_authenticated_file(login_url, download_url, username, password):
    """Download files from authenticated areas"""

    browser = mechanicalsoup.StatefulBrowser()

    # Login first
    browser.open(login_url)

    # Find and fill login form
    login_form = browser.select_form()
    browser["username"] = username  # Adjust field names as needed
    browser["password"] = password

    # Submit login
    login_response = browser.submit_selected()

    # Check if login was successful
    if "dashboard" in login_response.url or "welcome" in browser.get_current_page().text.lower():
        print("Login successful")

        # Now download the file
        download_response = browser.get(download_url)

        if download_response.headers.get('content-type', '').startswith('application/'):
            filename = "authenticated_download.pdf"
            with open(filename, 'wb') as file:
                file.write(download_response.content)
            print(f"Downloaded: {filename}")
        else:
            print("Download failed or file not found")
    else:
        print("Login failed")

# Usage
download_authenticated_file(
    "https://example.com/login",
    "https://example.com/secure/document.pdf",
    "your_username",
    "your_password"
)

Advanced Download Features

Handling Different Content Types

import mechanicalsoup
import mimetypes

def smart_download(browser, url):
    """Download file with automatic type detection and naming"""

    response = browser.get(url)
    response.raise_for_status()

    # Get content type
    content_type = response.headers.get('content-type', '').split(';')[0]

    # Determine file extension
    extension = mimetypes.guess_extension(content_type)
    if not extension:
        extension = ".bin"  # fallback for unknown types

    # Generate filename
    filename = f"download_{int(time.time())}{extension}"

    # Save file
    with open(filename, 'wb') as file:
        file.write(response.content)

    print(f"Downloaded {content_type} file as: {filename}")
    return filename

Download Verification

import hashlib

def download_with_verification(browser, url, expected_hash=None):
    """Download file and verify its integrity"""

    response = browser.get(url)
    response.raise_for_status()

    content = response.content

    # Calculate hash
    file_hash = hashlib.md5(content).hexdigest()

    if expected_hash and file_hash != expected_hash:
        raise ValueError(f"File integrity check failed. Expected: {expected_hash}, Got: {file_hash}")

    filename = "verified_download.bin"
    with open(filename, 'wb') as file:
        file.write(content)

    print(f"Downloaded and verified: {filename} (Hash: {file_hash})")
    return filename, file_hash

Best Practices for File Downloads

1. Always Handle Errors

try:
    response = browser.get(download_url)
    response.raise_for_status()
except mechanicalsoup.requests.exceptions.RequestException as e:
    print(f"Download failed: {e}")

2. Respect Rate Limits

import time

# Add delays between requests
time.sleep(1)  # Wait 1 second between downloads

3. Use Appropriate Headers

browser.session.headers.update({
    'User-Agent': 'Your-App/1.0',
    'Accept': 'application/octet-stream, */*'
})

4. Validate File Types

allowed_types = ['application/pdf', 'image/jpeg', 'application/zip']
if response.headers.get('content-type') not in allowed_types:
    raise ValueError("Unsupported file type")

Integration with Other Tools

MechanicalSoup works well with other Python libraries for enhanced functionality. For more complex scenarios involving JavaScript-heavy sites, you might want to consider how to handle file downloads in Puppeteer, which provides more advanced browser automation capabilities.

Troubleshooting Common Issues

Issue: Downloads Fail with 403 Forbidden

Solution: Add proper headers and ensure you're authenticated:

browser.session.headers.update({
    'Referer': 'https://example.com/page-with-download-link',
    'User-Agent': 'Mozilla/5.0 (compatible browser string)'
})

Issue: Large Files Cause Memory Issues

Solution: Use streaming downloads as shown in the large file example above.

Issue: Download Links are JavaScript-Generated

Solution: MechanicalSoup cannot execute JavaScript. Consider using browser automation tools for such cases.

Conclusion

MechanicalSoup provides excellent capabilities for downloading files from web pages, especially when dealing with forms, authentication, and session management. Its integration with the requests library makes it powerful for HTTP-based downloads, while BeautifulSoup integration helps with parsing and navigation.

For most web scraping scenarios involving file downloads, MechanicalSoup offers the right balance of simplicity and functionality. Remember to always respect robots.txt files, implement proper error handling, and be mindful of rate limiting to maintain good relationships with web servers.

Whether you're downloading single files, handling bulk downloads, or working with authenticated areas, MechanicalSoup's session management and form handling capabilities make it an excellent choice for Python developers working on web scraping projects.

Table of contents