Symfony Panther provides powerful syntax for filtering and extracting text from HTML elements using both CSS selectors and XPath expressions. This guide covers all the essential methods and patterns for efficient text extraction.
Installation
Install Symfony Panther via Composer:
composer require symfony/panther
Basic Text Extraction Syntax
CSS Selectors
use Symfony\Component\Panther\PantherTestCase;
class TextExtractionExample extends PantherTestCase
{
public function extractTextExample()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
// Extract text from single element
$title = $crawler->filter('h1')->text();
// Extract text from element with class
$content = $crawler->filter('.content')->text();
// Extract text from element with ID
$header = $crawler->filter('#header')->text();
// Extract text from nested elements
$menuItem = $crawler->filter('nav ul li a')->text();
}
}
XPath Expressions
// XPath for more complex selections
$titleText = $crawler->filterXPath('//h1[@class="main-title"]')->text();
$linkText = $crawler->filterXPath('//a[contains(@href, "contact")]')->text();
$tableData = $crawler->filterXPath('//table//td[position()=2]')->text();
Multiple Elements Extraction
Extract All Matching Elements
// Get text from all matching elements
$allHeadings = $crawler->filter('h2')->each(function ($node) {
return $node->text();
});
// Extract links and their text
$allLinks = $crawler->filter('a')->each(function ($node) {
return [
'text' => $node->text(),
'href' => $node->attr('href')
];
});
// Extract list items
$listItems = $crawler->filter('ul li')->each(function ($node) {
return trim($node->text());
});
Advanced Multiple Element Processing
// Extract table data with structure
$tableRows = $crawler->filter('table tbody tr')->each(function ($row) {
$cells = $row->filter('td')->each(function ($cell) {
return $cell->text();
});
return $cells;
});
// Extract cards with multiple data points
$productCards = $crawler->filter('.product-card')->each(function ($card) {
return [
'name' => $card->filter('.product-name')->text(),
'price' => $card->filter('.price')->text(),
'description' => $card->filter('.description')->text()
];
});
Attribute Extraction
// Extract attributes along with text
$imageInfo = $crawler->filter('img')->each(function ($img) {
return [
'alt' => $img->attr('alt'),
'src' => $img->attr('src'),
'title' => $img->attr('title')
];
});
// Extract form data
$formFields = $crawler->filter('input')->each(function ($input) {
return [
'name' => $input->attr('name'),
'value' => $input->attr('value'),
'type' => $input->attr('type')
];
});
Error Handling and Safety
public function safeTextExtraction()
{
$client = static::createPantherClient();
$crawler = $client->request('GET', 'https://example.com');
// Check if element exists before extracting
$titleFilter = $crawler->filter('h1');
$title = $titleFilter->count() > 0 ? $titleFilter->text() : 'No title found';
// Handle multiple elements safely
$descriptions = $crawler->filter('.description')->each(function ($node) {
return $node->count() > 0 ? trim($node->text()) : '';
});
// Filter out empty results
$descriptions = array_filter($descriptions, function($desc) {
return !empty($desc);
});
}
Advanced Filtering Patterns
Combining Selectors
// Descendant selectors
$articleText = $crawler->filter('article p')->text();
// Child selectors
$directChildren = $crawler->filter('div > p')->text();
// Sibling selectors
$nextElement = $crawler->filter('h2 + p')->text();
// Attribute selectors
$externalLinks = $crawler->filter('a[target="_blank"]')->each(function ($node) {
return $node->text();
});
Complex XPath Queries
// Text contains
$specificText = $crawler->filterXPath('//p[contains(text(), "specific phrase")]')->text();
// Multiple conditions
$complexSelector = $crawler->filterXPath('//div[@class="content" and @data-type="article"]//p')->text();
// Position-based selection
$secondParagraph = $crawler->filterXPath('//p[position()=2]')->text();
// Parent-child relationships
$parentText = $crawler->filterXPath('//li[contains(@class, "active")]/../@title')->text();
Performance Tips
// Reuse crawler for multiple extractions
$crawler = $client->request('GET', 'https://example.com');
// Extract multiple pieces of data efficiently
$pageData = [
'title' => $crawler->filter('title')->text(),
'headings' => $crawler->filter('h1, h2, h3')->each(function ($node) {
return $node->text();
}),
'links' => $crawler->filter('a[href]')->each(function ($node) {
return [
'text' => $node->text(),
'url' => $node->attr('href')
];
})
];
Best Practices
- Always check element existence before extracting text to avoid exceptions
- Use specific selectors to avoid extracting unwanted content
- Trim whitespace from extracted text for cleaner results
- Handle empty results gracefully in your application logic
- Combine CSS and XPath based on the complexity of your selection needs
Common Pitfalls
- Empty results: Always verify elements exist before calling
text()
- Whitespace: Use
trim()
to clean extracted text - First vs All:
text()
returns only the first match, useeach()
for all matches - Dynamic content: Ensure JavaScript has loaded before extracting text
Remember to respect website terms of service and implement appropriate delays between requests when scraping multiple pages.