Can I use CSS selectors and XPath expressions with Symfony Panther?
Yes, Symfony Panther fully supports both CSS selectors and XPath expressions for element selection and data extraction. Built on top of Facebook's php-webdriver, Panther provides comprehensive DOM querying capabilities that make it easy to target specific elements on web pages.
CSS Selectors in Symfony Panther
Symfony Panther uses the filter()
method with CSS selectors, leveraging Symfony's DomCrawler component for element selection.
Basic CSS Selector Usage
<?php
use Symfony\Component\Panther\PantherTestCase;
class WebScrapingTest extends PantherTestCase
{
public function testCssSelectors()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
// Select by element tag
$title = $crawler->filter('h1')->text();
// Select by class
$navigation = $crawler->filter('.nav-menu');
// Select by ID
$content = $crawler->filter('#main-content')->text();
// Select by attribute
$submitButton = $crawler->filter('input[type="submit"]');
// Complex selectors
$firstArticle = $crawler->filter('article.post:first-child');
$links = $crawler->filter('nav ul li a');
}
}
Advanced CSS Selector Patterns
// Descendant selectors
$menuItems = $crawler->filter('.navbar .dropdown-menu li');
// Child selectors
$directChildren = $crawler->filter('.container > div');
// Pseudo-selectors
$firstItem = $crawler->filter('ul li:first-child');
$lastItem = $crawler->filter('ul li:last-child');
$evenRows = $crawler->filter('table tr:nth-child(even)');
// Attribute selectors
$externalLinks = $crawler->filter('a[href^="http"]');
$downloadLinks = $crawler->filter('a[href$=".pdf"]');
$requiredFields = $crawler->filter('input[required]');
// Multiple class selection
$activeButtons = $crawler->filter('.btn.active');
XPath Expressions in Symfony Panther
For more complex queries, Symfony Panther supports XPath expressions through the filterXPath()
method.
Basic XPath Usage
<?php
use Symfony\Component\Panther\PantherTestCase;
class XPathScrapingTest extends PantherTestCase
{
public function testXPathSelectors()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
// Select by element name
$headings = $crawler->filterXPath('//h1 | //h2 | //h3');
// Select by attribute value
$submitButtons = $crawler->filterXPath('//input[@type="submit"]');
// Select by text content
$loginLink = $crawler->filterXPath('//a[text()="Login"]');
// Select by partial text
$searchResults = $crawler->filterXPath('//div[contains(text(), "Search")]');
// Select parent elements
$parentDiv = $crawler->filterXPath('//span[@class="error"]/..');
}
}
Advanced XPath Expressions
// Select elements with specific position
$thirdListItem = $crawler->filterXPath('//ul/li[3]');
$lastTableRow = $crawler->filterXPath('//table/tbody/tr[last()]');
// Conditional selections
$checkedCheckboxes = $crawler->filterXPath('//input[@type="checkbox" and @checked]');
$emptyFields = $crawler->filterXPath('//input[@value="" or not(@value)]');
// Text-based selections
$priceElements = $crawler->filterXPath('//span[contains(@class, "price")]');
$headingsContainingApi = $crawler->filterXPath('//h2[contains(text(), "API")]');
// Sibling selections
$nextSibling = $crawler->filterXPath('//div[@id="content"]/following-sibling::div[1]');
$previousSibling = $crawler->filterXPath('//h2[text()="Overview"]/preceding-sibling::h1');
// Ancestor selections
$formContainer = $crawler->filterXPath('//input[@name="username"]/ancestor::form');
Practical Web Scraping Examples
Extracting Data from Tables
<?php
class TableScrapingExample extends PantherTestCase
{
public function scrapeProductTable()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example-shop.com/products');
// Using CSS selectors
$products = [];
$crawler->filter('table.products tbody tr')->each(function ($row) use (&$products) {
$name = $row->filter('td:nth-child(1)')->text();
$price = $row->filter('td.price')->text();
$stock = $row->filter('td[data-stock]')->attr('data-stock');
$products[] = [
'name' => $name,
'price' => $price,
'stock' => (int) $stock
];
});
// Using XPath for more complex selections
$featuredProducts = $crawler->filterXPath('//tr[contains(@class, "featured")]')->each(function ($row) {
return [
'name' => $row->filterXPath('.//td[1]')->text(),
'price' => $row->filterXPath('.//td[contains(@class, "price")]')->text(),
'rating' => $row->filterXPath('.//span[@class="stars"]/@data-rating')->extract(['data-rating'])[0]
];
});
return $products;
}
}
Form Interaction and Data Extraction
<?php
class FormInteractionExample extends PantherTestCase
{
public function interactWithSearchForm()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com/search');
// Fill form using CSS selectors
$form = $crawler->selectButton('Search')->form();
$form['query'] = 'web scraping';
$form['category'] = 'technology';
// Submit form and get results
$resultsCrawler = $client->submit($form);
// Extract search results using XPath
$results = $resultsCrawler->filterXPath('//div[@class="search-result"]')->each(function ($result) {
return [
'title' => $result->filterXPath('.//h3/a')->text(),
'url' => $result->filterXPath('.//h3/a/@href')->extract(['href'])[0],
'snippet' => $result->filterXPath('.//p[@class="snippet"]')->text(),
'date' => $result->filterXPath('.//time/@datetime')->extract(['datetime'])[0] ?? null
];
});
return $results;
}
}
Handling Dynamic Content
When working with JavaScript-heavy applications, handling dynamic content that loads after page navigation becomes crucial. Symfony Panther excels at this:
<?php
class DynamicContentExample extends PantherTestCase
{
public function scrapeDynamicContent()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://spa-example.com');
// Wait for dynamic content to load
$client->waitFor('.dynamic-content');
// Or wait for specific text to appear
$client->waitForText('Loading complete');
// Now extract data from dynamically loaded elements
$dynamicData = $crawler->filter('.ajax-loaded-content')->each(function ($element) {
return [
'id' => $element->attr('data-id'),
'content' => $element->filter('.content')->text(),
'timestamp' => $element->filterXPath('.//time/@data-timestamp')->extract(['data-timestamp'])[0]
];
});
return $dynamicData;
}
}
JavaScript Execution with Element Selection
Symfony Panther allows you to combine JavaScript execution with element selection:
<?php
class JavaScriptSelectionExample extends PantherTestCase
{
public function executeJavaScriptWithSelectors()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
// Execute JavaScript to modify page state
$client->executeScript('
document.querySelector(".hidden-section").style.display = "block";
document.querySelector("#load-more").click();
');
// Wait for changes to take effect
$client->waitFor('.newly-loaded-content');
// Now extract data that wasn't initially visible
$hiddenData = $crawler->filter('.hidden-section .data-item')->each(function ($item) {
return [
'value' => $item->attr('data-value'),
'text' => $item->text()
];
});
return $hiddenData;
}
}
Error Handling and Best Practices
Robust Element Selection
<?php
class RobustScrapingExample extends PantherTestCase
{
public function robustElementSelection()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
try {
// Check if element exists before extracting data
if ($crawler->filter('.price')->count() > 0) {
$price = $crawler->filter('.price')->text();
} else {
// Fallback to XPath selector
$priceNode = $crawler->filterXPath('//span[contains(@class, "cost")]');
$price = $priceNode->count() > 0 ? $priceNode->text() : 'N/A';
}
// Multiple selector strategy
$title = $this->getTextBySelectorPriority($crawler, [
'h1.page-title',
'.title',
'//h1',
'.header h2'
]);
} catch (\Exception $e) {
// Handle scraping errors gracefully
$this->logError("Failed to extract data: " . $e->getMessage());
return null;
}
}
private function getTextBySelectorPriority($crawler, array $selectors)
{
foreach ($selectors as $selector) {
try {
if (strpos($selector, '//') === 0) {
// XPath selector
$elements = $crawler->filterXPath($selector);
} else {
// CSS selector
$elements = $crawler->filter($selector);
}
if ($elements->count() > 0) {
return $elements->text();
}
} catch (\Exception $e) {
continue;
}
}
return null;
}
}
Performance Optimization
For better performance when working with large pages or multiple selectors, consider these strategies:
<?php
class OptimizedScrapingExample extends PantherTestCase
{
public function optimizedExtraction()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
// Cache frequently used elements
$mainContent = $crawler->filter('#main-content');
// Extract multiple pieces of data from cached element
$data = [
'title' => $mainContent->filter('h1')->text(),
'subtitle' => $mainContent->filter('.subtitle')->text(),
'content' => $mainContent->filter('.content')->text(),
'metadata' => $mainContent->filterXPath('.//meta[@name]')->each(function ($meta) {
return [
'name' => $meta->attr('name'),
'content' => $meta->attr('content')
];
})
];
return $data;
}
}
Console Commands for Batch Processing
You can also use Symfony Panther in console commands for batch web scraping:
# Create a console command
bin/console make:command scrape:products
# Run the scraping command
bin/console scrape:products --url="https://example.com/products" --output="products.json"
<?php
// src/Command/ScrapeProductsCommand.php
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Panther\PantherTestCase;
class ScrapeProductsCommand extends Command
{
protected static $defaultName = 'scrape:products';
protected function execute(InputInterface $input, OutputInterface $output)
{
$client = PantherTestCase::createPantherClient();
$crawler = $client->request('GET', $input->getOption('url'));
// Use both CSS selectors and XPath for comprehensive data extraction
$products = $crawler->filter('.product-card')->each(function ($card) {
return [
'name' => $card->filter('.product-name')->text(),
'price' => $card->filterXPath('.//span[contains(@class, "price")]')->text(),
'image' => $card->filter('img')->attr('src'),
'availability' => $card->filterXPath('.//span[@data-availability]')->attr('data-availability')
];
});
file_put_contents($input->getOption('output'), json_encode($products, JSON_PRETTY_PRINT));
return Command::SUCCESS;
}
}
Combining with Other Tools
Symfony Panther's CSS and XPath capabilities work seamlessly with other web scraping tools. For instance, you can use similar techniques when interacting with DOM elements in Puppeteer or when working with other browser automation frameworks.
Both CSS selectors and XPath expressions in Symfony Panther provide powerful ways to target and extract data from web pages. CSS selectors are generally more readable and faster for simple selections, while XPath offers more advanced querying capabilities for complex scenarios. The choice between them often depends on the specific requirements of your web scraping project and your team's familiarity with each syntax.