How do I create a web scraping client using Symfony Panther?

Symfony Panther is a powerful browser testing and web scraping library for PHP that leverages the WebDriver protocol. It provides real browser automation capabilities, enabling JavaScript execution and user interaction simulation, making it ideal for scraping modern web applications.

Prerequisites

Before getting started, ensure you have: - PHP 7.4 or higher - Composer installed - Chrome or Firefox browser - Basic knowledge of CSS selectors

Step 1: Installation

Install Symfony Panther using Composer:

composer require symfony/panther

For development environments, you may also want to install ChromeDriver:

composer require --dev dbrekelmans/bdi
vendor/bin/bdi detect drivers

Step 2: Basic Web Scraping Client

Here's a comprehensive example of a web scraping client:

<?php

require __DIR__ . '/vendor/autoload.php';

use Symfony\Component\Panther\Client;
use Symfony\Component\Panther\PantherTestCase;
use Symfony\Component\DomCrawler\Crawler;

class WebScrapingClient
{
    private Client $client;

    public function __construct(array $options = [])
    {
        // Create client with custom options
        $this->client = Client::createChromeClient(null, [
            '--headless',
            '--no-sandbox',
            '--disable-dev-shm-usage',
            '--window-size=1920,1080'
        ], $options);
    }

    public function scrapeWebsite(string $url): array
    {
        try {
            // Navigate to the URL
            $crawler = $this->client->request('GET', $url);

            // Wait for page to load
            $this->client->waitFor('title');

            // Extract basic page information
            $data = [
                'title' => $this->getPageTitle($crawler),
                'meta_description' => $this->getMetaDescription($crawler),
                'headings' => $this->getHeadings($crawler),
                'links' => $this->getLinks($crawler),
                'images' => $this->getImages($crawler)
            ];

            return $data;

        } catch (\Exception $e) {
            throw new \RuntimeException("Scraping failed: " . $e->getMessage());
        }
    }

    private function getPageTitle(Crawler $crawler): string
    {
        return $crawler->filter('title')->count() > 0 
            ? $crawler->filter('title')->text() 
            : '';
    }

    private function getMetaDescription(Crawler $crawler): string
    {
        return $crawler->filter('meta[name="description"]')->count() > 0
            ? $crawler->filter('meta[name="description"]')->attr('content')
            : '';
    }

    private function getHeadings(Crawler $crawler): array
    {
        $headings = [];
        $crawler->filter('h1, h2, h3, h4, h5, h6')->each(function (Crawler $node) use (&$headings) {
            $headings[] = [
                'tag' => $node->nodeName(),
                'text' => trim($node->text())
            ];
        });
        return $headings;
    }

    private function getLinks(Crawler $crawler): array
    {
        $links = [];
        $crawler->filter('a[href]')->each(function (Crawler $node) use (&$links) {
            $href = $node->attr('href');
            if (!empty($href)) {
                $links[] = [
                    'url' => $href,
                    'text' => trim($node->text())
                ];
            }
        });
        return $links;
    }

    private function getImages(Crawler $crawler): array
    {
        $images = [];
        $crawler->filter('img[src]')->each(function (Crawler $node) use (&$images) {
            $images[] = [
                'src' => $node->attr('src'),
                'alt' => $node->attr('alt') ?? ''
            ];
        });
        return $images;
    }

    public function __destruct()
    {
        $this->client->quit();
    }
}

Step 3: Advanced Features

Form Interaction

Handle forms and user interactions:

public function loginAndScrape(string $loginUrl, string $username, string $password): array
{
    $crawler = $this->client->request('GET', $loginUrl);

    // Fill login form
    $form = $crawler->selectButton('Login')->form();
    $form['username'] = $username;
    $form['password'] = $password;

    // Submit form
    $crawler = $this->client->submit($form);

    // Wait for redirect
    $this->client->waitFor('.dashboard');

    // Now scrape protected content
    return $this->scrapeWebsite($this->client->getCurrentURL());
}

JavaScript Execution

Execute custom JavaScript:

public function executeJavaScript(string $script): mixed
{
    return $this->client->executeScript($script);
}

public function scrollToBottom(): void
{
    $this->client->executeScript('window.scrollTo(0, document.body.scrollHeight);');

    // Wait for dynamic content to load
    $this->client->wait(2); // Wait 2 seconds
}

Screenshot Capture

Take screenshots for debugging:

public function takeScreenshot(string $filename = null): string
{
    $filename = $filename ?? 'screenshot_' . date('Y-m-d_H-i-s') . '.png';
    $this->client->takeScreenshot($filename);
    return $filename;
}

Wait Strategies

Implement various waiting strategies:

public function waitForElement(string $selector, int $timeout = 10): void
{
    $this->client->waitFor($selector, $timeout);
}

public function waitForText(string $text, int $timeout = 10): void
{
    $this->client->waitForText($text, $timeout);
}

public function waitForInvisibility(string $selector, int $timeout = 10): void
{
    $this->client->waitForInvisibility($selector, $timeout);
}

Step 4: Usage Example

// Create scraper instance
$scraper = new WebScrapingClient();

// Basic scraping
$data = $scraper->scrapeWebsite('https://example.com');
print_r($data);

// Take screenshot
$screenshot = $scraper->takeScreenshot();
echo "Screenshot saved: " . $screenshot . "\n";

// Execute JavaScript
$pageHeight = $scraper->executeJavaScript('return document.body.scrollHeight;');
echo "Page height: " . $pageHeight . "px\n";

Configuration Options

Client Options

$options = [
    'connection_timeout_in_ms' => 5000,
    'request_timeout_in_ms' => 60000,
];

$client = Client::createChromeClient(null, [
    '--headless',
    '--no-sandbox',
    '--disable-gpu',
    '--window-size=1920,1080',
    '--user-agent=Mozilla/5.0 (compatible; WebScraper/1.0)'
], $options);

Environment Variables

Set environment variables for browser paths:

export PANTHER_CHROME_BINARY=/usr/bin/google-chrome
export PANTHER_CHROME_DRIVER_BINARY=/usr/bin/chromedriver

Best Practices

  1. Resource Management: Always call quit() on the client to free resources
  2. Error Handling: Implement comprehensive exception handling
  3. Rate Limiting: Add delays between requests to avoid overloading servers
  4. Headless Mode: Use headless browsers in production for better performance
  5. User Agents: Rotate user agents to avoid detection
  6. Timeouts: Set appropriate timeouts for different operations
  7. Cleanup: Close browser instances properly to prevent memory leaks

Common Issues and Solutions

  • Browser not found: Install Chrome/Firefox or set correct binary paths
  • Timeout errors: Increase timeout values or improve wait strategies
  • Memory issues: Use headless mode and clean up resources properly
  • Stale elements: Re-query elements after page navigation
  • CAPTCHA: Consider using CAPTCHA solving services or alternative approaches

Symfony Panther provides a robust foundation for web scraping with full browser automation capabilities, making it particularly effective for JavaScript-heavy applications and complex user interactions.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon