Table of contents

How do I handle file downloads during web scraping with Symfony Panther?

Symfony Panther is a PHP library that provides browser automation capabilities through the WebDriver protocol. While it doesn't have built-in file download methods, you can configure the browser to automatically download files and handle them programmatically.

Configuration Methods

Chrome Browser Configuration

Configure Chrome to automatically download files without user prompts:

use Symfony\Component\Panther\Client;

// Method 1: Using Chrome options directly
$chromeOptions = [
    '--disable-web-security',
    '--allow-running-insecure-content',
    '--disable-features=VizDisplayCompositor'
];

$downloadPath = sys_get_temp_dir() . '/panther_downloads';
if (!is_dir($downloadPath)) {
    mkdir($downloadPath, 0755, true);
}

$client = Client::createChromeClient(null, $chromeOptions, [
    'prefs' => [
        'download.default_directory' => $downloadPath,
        'download.prompt_for_download' => false,
        'download.directory_upgrade' => true,
        'safebrowsing.enabled' => false
    ]
]);

Firefox Browser Configuration

use Symfony\Component\Panther\Client;

$downloadPath = sys_get_temp_dir() . '/panther_downloads';
if (!is_dir($downloadPath)) {
    mkdir($downloadPath, 0755, true);
}

$client = Client::createFirefoxClient(null, [], [
    'profile' => [
        'browser.download.dir' => $downloadPath,
        'browser.download.folderList' => 2,
        'browser.download.useDownloadDir' => true,
        'browser.helperApps.neverAsk.saveToDisk' => 'application/pdf,application/zip,text/csv'
    ]
]);

Complete Download Implementation

Basic Download Handler Class

<?php

use Symfony\Component\Panther\Client;
use Symfony\Component\DomCrawler\Crawler;

class FileDownloadHandler
{
    private Client $client;
    private string $downloadPath;
    private int $timeout;

    public function __construct(string $downloadPath = null, int $timeout = 30)
    {
        $this->downloadPath = $downloadPath ?: sys_get_temp_dir() . '/panther_downloads';
        $this->timeout = $timeout;
        $this->setupDownloadDirectory();
        $this->initializeClient();
    }

    private function setupDownloadDirectory(): void
    {
        if (!is_dir($this->downloadPath)) {
            mkdir($this->downloadPath, 0755, true);
        }

        // Clean existing files
        array_map('unlink', glob($this->downloadPath . '/*'));
    }

    private function initializeClient(): void
    {
        $this->client = Client::createChromeClient(null, [
            '--no-sandbox',
            '--disable-dev-shm-usage'
        ], [
            'prefs' => [
                'download.default_directory' => $this->downloadPath,
                'download.prompt_for_download' => false,
                'download.directory_upgrade' => true,
                'safebrowsing.enabled' => false,
                'plugins.always_open_pdf_externally' => true
            ]
        ]);
    }

    public function downloadFile(string $url, string $selector = null): ?string
    {
        try {
            $crawler = $this->client->request('GET', $url);

            if ($selector) {
                // Click download link/button
                $downloadElement = $crawler->filter($selector);
                if ($downloadElement->count() === 0) {
                    throw new \Exception("Download element not found: {$selector}");
                }
                $downloadElement->click();
            }

            // Wait for download to complete
            $downloadedFile = $this->waitForDownload();

            return $downloadedFile;

        } catch (\Exception $e) {
            throw new \Exception("Download failed: " . $e->getMessage());
        }
    }

    private function waitForDownload(): ?string
    {
        $startTime = time();
        $initialFiles = $this->getDownloadedFiles();

        while ((time() - $startTime) < $this->timeout) {
            sleep(1);
            $currentFiles = $this->getDownloadedFiles();

            // Check for new files
            $newFiles = array_diff($currentFiles, $initialFiles);
            if (!empty($newFiles)) {
                $newFile = reset($newFiles);

                // Ensure file is completely downloaded (not .crdownload)
                if (!str_ends_with($newFile, '.crdownload') && filesize($newFile) > 0) {
                    return $newFile;
                }
            }
        }

        throw new \Exception("Download timeout after {$this->timeout} seconds");
    }

    private function getDownloadedFiles(): array
    {
        return glob($this->downloadPath . '/*');
    }

    public function getDownloadPath(): string
    {
        return $this->downloadPath;
    }

    public function __destruct()
    {
        if ($this->client) {
            $this->client->quit();
        }
    }
}

Usage Examples

Download by Direct Link

$downloader = new FileDownloadHandler('/tmp/downloads');

try {
    $filePath = $downloader->downloadFile('https://example.com/report.pdf');
    echo "Downloaded: " . basename($filePath) . "\n";

    // Process the file
    $fileSize = filesize($filePath);
    $mimeType = mime_content_type($filePath);
    echo "File size: {$fileSize} bytes, MIME: {$mimeType}\n";

} catch (Exception $e) {
    echo "Error: " . $e->getMessage() . "\n";
}

Download by Button Click

$downloader = new FileDownloadHandler();

try {
    // Navigate to page and click download button
    $filePath = $downloader->downloadFile(
        'https://example.com/documents', 
        'button[data-download="report"]'
    );

    // Move file to permanent location
    $newLocation = '/var/www/uploads/' . basename($filePath);
    rename($filePath, $newLocation);

} catch (Exception $e) {
    echo "Download failed: " . $e->getMessage() . "\n";
}

Advanced Download with Form Interaction

use Symfony\Component\Panther\Client;

class AdvancedDownloader
{
    private Client $client;

    public function downloadWithAuthentication(string $loginUrl, string $username, string $password, string $downloadUrl): string
    {
        // Setup client with download configuration
        $downloadPath = sys_get_temp_dir() . '/secure_downloads';
        if (!is_dir($downloadPath)) {
            mkdir($downloadPath, 0755, true);
        }

        $this->client = Client::createChromeClient(null, [], [
            'prefs' => [
                'download.default_directory' => $downloadPath,
                'download.prompt_for_download' => false
            ]
        ]);

        // Login first
        $crawler = $this->client->request('GET', $loginUrl);
        $form = $crawler->selectButton('Login')->form([
            'username' => $username,
            'password' => $password
        ]);
        $this->client->submit($form);

        // Wait for login redirect
        $this->client->waitFor('#dashboard');

        // Navigate to download page
        $this->client->request('GET', $downloadUrl);

        // Click download
        $this->client->clickLink('Download Report');

        // Wait for download
        return $this->waitForFile($downloadPath);
    }

    private function waitForFile(string $path, int $timeout = 30): string
    {
        $startTime = time();
        while ((time() - $startTime) < $timeout) {
            $files = glob($path . '/*');
            $files = array_filter($files, fn($f) => !str_ends_with($f, '.crdownload'));

            if (!empty($files)) {
                return reset($files);
            }
            sleep(1);
        }
        throw new \Exception('Download timeout');
    }
}

Best Practices

1. Robust Wait Mechanisms

private function waitForCompleteDownload(string $expectedFilename = null): string
{
    $startTime = time();
    $lastSize = 0;
    $stableCount = 0;

    while ((time() - $startTime) < $this->timeout) {
        $files = glob($this->downloadPath . '/*');
        $files = array_filter($files, function($file) {
            return !str_ends_with($file, '.crdownload') && 
                   !str_ends_with($file, '.tmp') &&
                   filesize($file) > 0;
        });

        if (!empty($files)) {
            $file = reset($files);
            $currentSize = filesize($file);

            // Check if file size is stable (download complete)
            if ($currentSize === $lastSize && $currentSize > 0) {
                $stableCount++;
                if ($stableCount >= 3) { // 3 consecutive checks with same size
                    return $file;
                }
            } else {
                $stableCount = 0;
            }

            $lastSize = $currentSize;
        }

        sleep(1);
    }

    throw new \Exception("Download did not complete within {$this->timeout} seconds");
}

2. Error Handling and Cleanup

public function downloadWithCleanup(string $url, string $selector = null): array
{
    $tempFiles = [];

    try {
        $filePath = $this->downloadFile($url, $selector);
        $tempFiles[] = $filePath;

        // Validate file
        if (!$this->validateDownload($filePath)) {
            throw new \Exception('Downloaded file validation failed');
        }

        return [
            'success' => true,
            'file' => $filePath,
            'size' => filesize($filePath),
            'mime' => mime_content_type($filePath)
        ];

    } catch (\Exception $e) {
        // Clean up any partial downloads
        foreach ($tempFiles as $file) {
            if (file_exists($file)) {
                unlink($file);
            }
        }

        return [
            'success' => false,
            'error' => $e->getMessage()
        ];
    }
}

private function validateDownload(string $filePath): bool
{
    // Check file exists and has content
    if (!file_exists($filePath) || filesize($filePath) === 0) {
        return false;
    }

    // Check for common error patterns
    $content = file_get_contents($filePath, false, null, 0, 1024);
    $errorPatterns = ['<!DOCTYPE html', '<html', 'Error 404', 'Access Denied'];

    foreach ($errorPatterns as $pattern) {
        if (stripos($content, $pattern) !== false) {
            return false;
        }
    }

    return true;
}

Common Issues and Solutions

  • Download path permissions: Ensure PHP has write access to the download directory
  • Browser security: Use appropriate Chrome/Firefox flags to bypass download restrictions
  • File validation: Always verify downloaded files aren't error pages in disguise
  • Timeout handling: Implement robust timeout mechanisms for large files
  • Cleanup: Remove temporary files to prevent disk space issues

This approach provides a reliable way to handle file downloads in Symfony Panther while maintaining proper error handling and resource management.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon