Symfony Panther is a powerful browser testing and web scraping library for PHP that leverages the WebDriver protocol. It provides real browser automation capabilities, enabling JavaScript execution and user interaction simulation, making it ideal for scraping modern web applications.
Prerequisites
Before getting started, ensure you have: - PHP 7.4 or higher - Composer installed - Chrome or Firefox browser - Basic knowledge of CSS selectors
Step 1: Installation
Install Symfony Panther using Composer:
composer require symfony/panther
For development environments, you may also want to install ChromeDriver:
composer require --dev dbrekelmans/bdi
vendor/bin/bdi detect drivers
Step 2: Basic Web Scraping Client
Here's a comprehensive example of a web scraping client:
<?php
require __DIR__ . '/vendor/autoload.php';
use Symfony\Component\Panther\Client;
use Symfony\Component\Panther\PantherTestCase;
use Symfony\Component\DomCrawler\Crawler;
class WebScrapingClient
{
private Client $client;
public function __construct(array $options = [])
{
// Create client with custom options
$this->client = Client::createChromeClient(null, [
'--headless',
'--no-sandbox',
'--disable-dev-shm-usage',
'--window-size=1920,1080'
], $options);
}
public function scrapeWebsite(string $url): array
{
try {
// Navigate to the URL
$crawler = $this->client->request('GET', $url);
// Wait for page to load
$this->client->waitFor('title');
// Extract basic page information
$data = [
'title' => $this->getPageTitle($crawler),
'meta_description' => $this->getMetaDescription($crawler),
'headings' => $this->getHeadings($crawler),
'links' => $this->getLinks($crawler),
'images' => $this->getImages($crawler)
];
return $data;
} catch (\Exception $e) {
throw new \RuntimeException("Scraping failed: " . $e->getMessage());
}
}
private function getPageTitle(Crawler $crawler): string
{
return $crawler->filter('title')->count() > 0
? $crawler->filter('title')->text()
: '';
}
private function getMetaDescription(Crawler $crawler): string
{
return $crawler->filter('meta[name="description"]')->count() > 0
? $crawler->filter('meta[name="description"]')->attr('content')
: '';
}
private function getHeadings(Crawler $crawler): array
{
$headings = [];
$crawler->filter('h1, h2, h3, h4, h5, h6')->each(function (Crawler $node) use (&$headings) {
$headings[] = [
'tag' => $node->nodeName(),
'text' => trim($node->text())
];
});
return $headings;
}
private function getLinks(Crawler $crawler): array
{
$links = [];
$crawler->filter('a[href]')->each(function (Crawler $node) use (&$links) {
$href = $node->attr('href');
if (!empty($href)) {
$links[] = [
'url' => $href,
'text' => trim($node->text())
];
}
});
return $links;
}
private function getImages(Crawler $crawler): array
{
$images = [];
$crawler->filter('img[src]')->each(function (Crawler $node) use (&$images) {
$images[] = [
'src' => $node->attr('src'),
'alt' => $node->attr('alt') ?? ''
];
});
return $images;
}
public function __destruct()
{
$this->client->quit();
}
}
Step 3: Advanced Features
Form Interaction
Handle forms and user interactions:
public function loginAndScrape(string $loginUrl, string $username, string $password): array
{
$crawler = $this->client->request('GET', $loginUrl);
// Fill login form
$form = $crawler->selectButton('Login')->form();
$form['username'] = $username;
$form['password'] = $password;
// Submit form
$crawler = $this->client->submit($form);
// Wait for redirect
$this->client->waitFor('.dashboard');
// Now scrape protected content
return $this->scrapeWebsite($this->client->getCurrentURL());
}
JavaScript Execution
Execute custom JavaScript:
public function executeJavaScript(string $script): mixed
{
return $this->client->executeScript($script);
}
public function scrollToBottom(): void
{
$this->client->executeScript('window.scrollTo(0, document.body.scrollHeight);');
// Wait for dynamic content to load
$this->client->wait(2); // Wait 2 seconds
}
Screenshot Capture
Take screenshots for debugging:
public function takeScreenshot(string $filename = null): string
{
$filename = $filename ?? 'screenshot_' . date('Y-m-d_H-i-s') . '.png';
$this->client->takeScreenshot($filename);
return $filename;
}
Wait Strategies
Implement various waiting strategies:
public function waitForElement(string $selector, int $timeout = 10): void
{
$this->client->waitFor($selector, $timeout);
}
public function waitForText(string $text, int $timeout = 10): void
{
$this->client->waitForText($text, $timeout);
}
public function waitForInvisibility(string $selector, int $timeout = 10): void
{
$this->client->waitForInvisibility($selector, $timeout);
}
Step 4: Usage Example
// Create scraper instance
$scraper = new WebScrapingClient();
// Basic scraping
$data = $scraper->scrapeWebsite('https://example.com');
print_r($data);
// Take screenshot
$screenshot = $scraper->takeScreenshot();
echo "Screenshot saved: " . $screenshot . "\n";
// Execute JavaScript
$pageHeight = $scraper->executeJavaScript('return document.body.scrollHeight;');
echo "Page height: " . $pageHeight . "px\n";
Configuration Options
Client Options
$options = [
'connection_timeout_in_ms' => 5000,
'request_timeout_in_ms' => 60000,
];
$client = Client::createChromeClient(null, [
'--headless',
'--no-sandbox',
'--disable-gpu',
'--window-size=1920,1080',
'--user-agent=Mozilla/5.0 (compatible; WebScraper/1.0)'
], $options);
Environment Variables
Set environment variables for browser paths:
export PANTHER_CHROME_BINARY=/usr/bin/google-chrome
export PANTHER_CHROME_DRIVER_BINARY=/usr/bin/chromedriver
Best Practices
- Resource Management: Always call
quit()
on the client to free resources - Error Handling: Implement comprehensive exception handling
- Rate Limiting: Add delays between requests to avoid overloading servers
- Headless Mode: Use headless browsers in production for better performance
- User Agents: Rotate user agents to avoid detection
- Timeouts: Set appropriate timeouts for different operations
- Cleanup: Close browser instances properly to prevent memory leaks
Common Issues and Solutions
- Browser not found: Install Chrome/Firefox or set correct binary paths
- Timeout errors: Increase timeout values or improve wait strategies
- Memory issues: Use headless mode and clean up resources properly
- Stale elements: Re-query elements after page navigation
- CAPTCHA: Consider using CAPTCHA solving services or alternative approaches
Symfony Panther provides a robust foundation for web scraping with full browser automation capabilities, making it particularly effective for JavaScript-heavy applications and complex user interactions.