Table of contents

How do I handle file downloads with Selenium WebDriver?

File downloads are a common requirement in web automation and testing scenarios. Selenium WebDriver provides several approaches to handle file downloads, from configuring browser-specific download preferences to monitoring download completion. This comprehensive guide covers the various methods and best practices for managing file downloads across different browsers.

Understanding File Download Challenges

When automating file downloads with Selenium, you encounter several challenges:

  • Browser security restrictions: Modern browsers prevent automatic downloads for security reasons
  • Download dialogs: Some browsers show download confirmation dialogs
  • Asynchronous nature: Downloads happen in the background, making it difficult to know when they complete
  • Browser-specific configurations: Each browser requires different setup approaches

Configuring Chrome for File Downloads

Chrome WebDriver offers the most comprehensive file download configuration options:

Python Example with Chrome

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
import time

def setup_chrome_for_downloads():
    download_dir = os.path.abspath("downloads")
    os.makedirs(download_dir, exist_ok=True)

    chrome_options = Options()
    chrome_options.add_experimental_option("prefs", {
        "download.default_directory": download_dir,
        "download.prompt_for_download": False,
        "download.directory_upgrade": True,
        "safebrowsing.enabled": True
    })

    driver = webdriver.Chrome(options=chrome_options)
    return driver, download_dir

def download_file_and_wait(driver, download_dir, download_link_xpath):
    # Click the download link
    download_link = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, download_link_xpath))
    )
    download_link.click()

    # Wait for download to complete
    return wait_for_download_complete(download_dir)

def wait_for_download_complete(download_dir, timeout=30):
    initial_files = set(os.listdir(download_dir))

    for _ in range(timeout):
        time.sleep(1)
        current_files = set(os.listdir(download_dir))
        new_files = current_files - initial_files

        if new_files:
            # Check if download is complete (no .crdownload files)
            downloading_files = [f for f in new_files if f.endswith('.crdownload')]
            if not downloading_files:
                return list(new_files)[0]

    raise TimeoutError("Download did not complete within timeout period")

# Usage example
driver, download_dir = setup_chrome_for_downloads()
try:
    driver.get("https://example.com/download-page")
    downloaded_file = download_file_and_wait(driver, download_dir, "//a[@id='download-link']")
    print(f"Downloaded file: {downloaded_file}")
finally:
    driver.quit()

JavaScript Example with Chrome

const { Builder } = require('selenium-webdriver');
const chrome = require('selenium-webdriver/chrome');
const fs = require('fs');
const path = require('path');

async function setupChromeForDownloads() {
    const downloadDir = path.resolve('./downloads');

    // Create download directory if it doesn't exist
    if (!fs.existsSync(downloadDir)) {
        fs.mkdirSync(downloadDir, { recursive: true });
    }

    const options = new chrome.Options();
    options.setUserPreferences({
        'download.default_directory': downloadDir,
        'download.prompt_for_download': false,
        'download.directory_upgrade': true,
        'safebrowsing.enabled': true
    });

    const driver = await new Builder()
        .forBrowser('chrome')
        .setChromeOptions(options)
        .build();

    return { driver, downloadDir };
}

async function waitForDownloadComplete(downloadDir, timeout = 30000) {
    const startTime = Date.now();
    const initialFiles = fs.readdirSync(downloadDir);

    while (Date.now() - startTime < timeout) {
        await new Promise(resolve => setTimeout(resolve, 1000));

        const currentFiles = fs.readdirSync(downloadDir);
        const newFiles = currentFiles.filter(file => !initialFiles.includes(file));

        if (newFiles.length > 0) {
            // Check if download is complete
            const downloadingFiles = newFiles.filter(file => file.endsWith('.crdownload'));
            if (downloadingFiles.length === 0) {
                return newFiles[0];
            }
        }
    }

    throw new Error('Download did not complete within timeout period');
}

// Usage example
async function downloadFile() {
    const { driver, downloadDir } = await setupChromeForDownloads();

    try {
        await driver.get('https://example.com/download-page');

        const downloadLink = await driver.findElement({ id: 'download-link' });
        await downloadLink.click();

        const downloadedFile = await waitForDownloadComplete(downloadDir);
        console.log(`Downloaded file: ${downloadedFile}`);
    } finally {
        await driver.quit();
    }
}

downloadFile();

Configuring Firefox for File Downloads

Firefox requires different configuration options for handling downloads:

Python Example with Firefox

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import os

def setup_firefox_for_downloads():
    download_dir = os.path.abspath("downloads")
    os.makedirs(download_dir, exist_ok=True)

    firefox_options = Options()
    firefox_options.set_preference("browser.download.folderList", 2)
    firefox_options.set_preference("browser.download.dir", download_dir)
    firefox_options.set_preference("browser.download.useDownloadDir", True)

    # Specify MIME types to download automatically
    firefox_options.set_preference(
        "browser.helperApps.neverAsk.saveToDisk",
        "application/pdf,application/octet-stream,text/csv,application/zip"
    )

    # Disable download manager
    firefox_options.set_preference("browser.download.manager.showWhenStarting", False)
    firefox_options.set_preference("pdfjs.disabled", True)  # Disable PDF preview

    driver = webdriver.Firefox(options=firefox_options)
    return driver, download_dir

Advanced Download Handling Techniques

Using Chrome DevTools Protocol

For more advanced control over downloads, you can use Chrome DevTools Protocol:

def enable_download_headless(driver, download_dir):
    """Enable downloads in headless Chrome"""
    driver.command_executor._commands["send_command"] = (
        "POST", '/session/$sessionId/chromium/send_command'
    )

    params = {
        'cmd': 'Page.setDownloadBehavior',
        'params': {
            'behavior': 'allow',
            'downloadPath': download_dir
        }
    }

    driver.execute("send_command", params)

# Usage in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
enable_download_headless(driver, download_dir)

Monitoring Download Progress

You can monitor download progress by checking file sizes and modification times:

import os
import time

def monitor_download_progress(download_dir, expected_filename=None, timeout=60):
    """Monitor download progress with detailed feedback"""
    start_time = time.time()

    while time.time() - start_time < timeout:
        files = os.listdir(download_dir)

        # Find downloading files
        downloading_files = [f for f in files if f.endswith('.crdownload')]
        completed_files = [f for f in files if not f.endswith('.crdownload')]

        if downloading_files:
            download_file = downloading_files[0]
            file_path = os.path.join(download_dir, download_file)
            file_size = os.path.getsize(file_path)
            print(f"Downloading: {download_file}, Size: {file_size} bytes")

        if completed_files:
            if expected_filename:
                if expected_filename in completed_files:
                    return expected_filename
            else:
                return completed_files[0]

        time.sleep(1)

    raise TimeoutError("Download did not complete within timeout")

Handling Different File Types

Different file types may require specific handling approaches:

PDF Files

def setup_pdf_download(chrome_options):
    """Configure Chrome to download PDFs instead of displaying them"""
    chrome_options.add_experimental_option("prefs", {
        "plugins.always_open_pdf_externally": True,
        "download.default_directory": download_dir,
        "download.prompt_for_download": False,
    })

ZIP and Archive Files

def handle_zip_download(driver, download_link_selector):
    """Handle ZIP file downloads with proper waiting"""
    download_link = driver.find_element(By.CSS_SELECTOR, download_link_selector)

    # Get expected filename from link attributes
    expected_filename = download_link.get_attribute("download")
    if not expected_filename:
        expected_filename = download_link.get_attribute("href").split("/")[-1]

    download_link.click()

    return wait_for_specific_file(download_dir, expected_filename)

def wait_for_specific_file(download_dir, filename, timeout=30):
    """Wait for a specific file to be downloaded"""
    file_path = os.path.join(download_dir, filename)

    for _ in range(timeout):
        if os.path.exists(file_path):
            # Ensure file is completely downloaded
            initial_size = os.path.getsize(file_path)
            time.sleep(1)
            final_size = os.path.getsize(file_path)

            if initial_size == final_size:
                return filename

        time.sleep(1)

    raise TimeoutError(f"File {filename} was not downloaded within timeout")

Best Practices and Troubleshooting

Error Handling and Validation

def validate_download(download_dir, expected_filename, min_size=None):
    """Validate downloaded file"""
    file_path = os.path.join(download_dir, expected_filename)

    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Downloaded file not found: {expected_filename}")

    file_size = os.path.getsize(file_path)
    if min_size and file_size < min_size:
        raise ValueError(f"Downloaded file is too small: {file_size} bytes")

    print(f"Download validated: {expected_filename} ({file_size} bytes)")
    return True

Cleanup and Management

def cleanup_downloads(download_dir, keep_latest=5):
    """Clean up old downloads, keeping only the latest files"""
    files = []
    for filename in os.listdir(download_dir):
        file_path = os.path.join(download_dir, filename)
        if os.path.isfile(file_path):
            files.append((filename, os.path.getmtime(file_path)))

    # Sort by modification time (newest first)
    files.sort(key=lambda x: x[1], reverse=True)

    # Remove old files
    for filename, _ in files[keep_latest:]:
        file_path = os.path.join(download_dir, filename)
        os.remove(file_path)
        print(f"Removed old download: {filename}")

Integration with Testing Frameworks

When integrating file downloads with testing frameworks, consider organizing your download handling into reusable utilities:

class DownloadManager:
    def __init__(self, download_dir="downloads"):
        self.download_dir = os.path.abspath(download_dir)
        os.makedirs(self.download_dir, exist_ok=True)

    def setup_driver(self, browser="chrome"):
        """Setup driver with download configuration"""
        if browser == "chrome":
            return self._setup_chrome()
        elif browser == "firefox":
            return self._setup_firefox()
        else:
            raise ValueError(f"Unsupported browser: {browser}")

    def _setup_chrome(self):
        options = Options()
        options.add_experimental_option("prefs", {
            "download.default_directory": self.download_dir,
            "download.prompt_for_download": False,
            "download.directory_upgrade": True,
            "safebrowsing.enabled": True
        })
        return webdriver.Chrome(options=options)

    def download_and_verify(self, driver, selector, expected_filename=None):
        """Download file and verify completion"""
        element = driver.find_element(By.CSS_SELECTOR, selector)
        element.click()

        downloaded_file = self.wait_for_download()

        if expected_filename and downloaded_file != expected_filename:
            raise ValueError(f"Expected {expected_filename}, got {downloaded_file}")

        return downloaded_file

Alternative Approaches

While Selenium handles file downloads effectively, consider these alternatives for specific scenarios:

  • Direct API calls: If the download URL is accessible, use HTTP clients for faster downloads
  • Puppeteer file downloads: For JavaScript-heavy applications, Puppeteer might offer better control
  • WebScraping.AI: For complex scraping scenarios that include file downloads, consider using specialized APIs that handle browser automation and downloads seamlessly

Console Commands for Testing

Test your download configuration with these useful commands:

# Check download directory permissions
ls -la downloads/

# Monitor download directory in real-time
watch -n 1 'ls -la downloads/'

# Clean up test downloads
rm -rf downloads/*.crdownload

# Verify downloaded file integrity
file downloads/example.pdf

Conclusion

Handling file downloads with Selenium WebDriver requires proper browser configuration, download monitoring, and error handling. By following the patterns and examples in this guide, you can create robust download automation that works across different browsers and file types. Remember to always validate your downloads and implement proper cleanup mechanisms to maintain a clean testing environment.

The key to successful file download automation is understanding each browser's specific requirements and implementing appropriate waiting strategies to ensure downloads complete successfully before proceeding with your automation workflow.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon