What is the syntax for filtering and extracting text from HTML elements using Symfony Panther?

Symfony Panther provides powerful syntax for filtering and extracting text from HTML elements using both CSS selectors and XPath expressions. This guide covers all the essential methods and patterns for efficient text extraction.

Installation

Install Symfony Panther via Composer:

composer require symfony/panther

Basic Text Extraction Syntax

CSS Selectors

use Symfony\Component\Panther\PantherTestCase;

class TextExtractionExample extends PantherTestCase
{
    public function extractTextExample()
    {
        $client = static::createPantherClient();
        $crawler = $client->request('GET', 'https://example.com');

        // Extract text from single element
        $title = $crawler->filter('h1')->text();

        // Extract text from element with class
        $content = $crawler->filter('.content')->text();

        // Extract text from element with ID
        $header = $crawler->filter('#header')->text();

        // Extract text from nested elements
        $menuItem = $crawler->filter('nav ul li a')->text();
    }
}

XPath Expressions

// XPath for more complex selections
$titleText = $crawler->filterXPath('//h1[@class="main-title"]')->text();
$linkText = $crawler->filterXPath('//a[contains(@href, "contact")]')->text();
$tableData = $crawler->filterXPath('//table//td[position()=2]')->text();

Multiple Elements Extraction

Extract All Matching Elements

// Get text from all matching elements
$allHeadings = $crawler->filter('h2')->each(function ($node) {
    return $node->text();
});

// Extract links and their text
$allLinks = $crawler->filter('a')->each(function ($node) {
    return [
        'text' => $node->text(),
        'href' => $node->attr('href')
    ];
});

// Extract list items
$listItems = $crawler->filter('ul li')->each(function ($node) {
    return trim($node->text());
});

Advanced Multiple Element Processing

// Extract table data with structure
$tableRows = $crawler->filter('table tbody tr')->each(function ($row) {
    $cells = $row->filter('td')->each(function ($cell) {
        return $cell->text();
    });
    return $cells;
});

// Extract cards with multiple data points
$productCards = $crawler->filter('.product-card')->each(function ($card) {
    return [
        'name' => $card->filter('.product-name')->text(),
        'price' => $card->filter('.price')->text(),
        'description' => $card->filter('.description')->text()
    ];
});

Attribute Extraction

// Extract attributes along with text
$imageInfo = $crawler->filter('img')->each(function ($img) {
    return [
        'alt' => $img->attr('alt'),
        'src' => $img->attr('src'),
        'title' => $img->attr('title')
    ];
});

// Extract form data
$formFields = $crawler->filter('input')->each(function ($input) {
    return [
        'name' => $input->attr('name'),
        'value' => $input->attr('value'),
        'type' => $input->attr('type')
    ];
});

Error Handling and Safety

public function safeTextExtraction()
{
    $client = static::createPantherClient();
    $crawler = $client->request('GET', 'https://example.com');

    // Check if element exists before extracting
    $titleFilter = $crawler->filter('h1');
    $title = $titleFilter->count() > 0 ? $titleFilter->text() : 'No title found';

    // Handle multiple elements safely
    $descriptions = $crawler->filter('.description')->each(function ($node) {
        return $node->count() > 0 ? trim($node->text()) : '';
    });

    // Filter out empty results
    $descriptions = array_filter($descriptions, function($desc) {
        return !empty($desc);
    });
}

Advanced Filtering Patterns

Combining Selectors

// Descendant selectors
$articleText = $crawler->filter('article p')->text();

// Child selectors
$directChildren = $crawler->filter('div > p')->text();

// Sibling selectors
$nextElement = $crawler->filter('h2 + p')->text();

// Attribute selectors
$externalLinks = $crawler->filter('a[target="_blank"]')->each(function ($node) {
    return $node->text();
});

Complex XPath Queries

// Text contains
$specificText = $crawler->filterXPath('//p[contains(text(), "specific phrase")]')->text();

// Multiple conditions
$complexSelector = $crawler->filterXPath('//div[@class="content" and @data-type="article"]//p')->text();

// Position-based selection
$secondParagraph = $crawler->filterXPath('//p[position()=2]')->text();

// Parent-child relationships
$parentText = $crawler->filterXPath('//li[contains(@class, "active")]/../@title')->text();

Performance Tips

// Reuse crawler for multiple extractions
$crawler = $client->request('GET', 'https://example.com');

// Extract multiple pieces of data efficiently
$pageData = [
    'title' => $crawler->filter('title')->text(),
    'headings' => $crawler->filter('h1, h2, h3')->each(function ($node) {
        return $node->text();
    }),
    'links' => $crawler->filter('a[href]')->each(function ($node) {
        return [
            'text' => $node->text(),
            'url' => $node->attr('href')
        ];
    })
];

Best Practices

  1. Always check element existence before extracting text to avoid exceptions
  2. Use specific selectors to avoid extracting unwanted content
  3. Trim whitespace from extracted text for cleaner results
  4. Handle empty results gracefully in your application logic
  5. Combine CSS and XPath based on the complexity of your selection needs

Common Pitfalls

  • Empty results: Always verify elements exist before calling text()
  • Whitespace: Use trim() to clean extracted text
  • First vs All: text() returns only the first match, use each() for all matches
  • Dynamic content: Ensure JavaScript has loaded before extracting text

Remember to respect website terms of service and implement appropriate delays between requests when scraping multiple pages.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon