How do I extract meta tags and page metadata using Symfony Panther?
Symfony Panther is a powerful PHP library that combines the capabilities of Symfony DomCrawler with Chrome/Chromium browser automation, making it ideal for extracting meta tags and page metadata from modern web applications. This guide will show you how to effectively extract various types of metadata using Symfony Panther.
What is Symfony Panther?
Symfony Panther is a browser testing and web scraping library built on top of Facebook's WebDriver protocol. It provides a unified API for both static HTML parsing and JavaScript-heavy dynamic content extraction, making it perfect for modern web applications that rely heavily on client-side rendering.
Installation and Setup
First, install Symfony Panther via Composer:
composer require symfony/panther
You'll also need to have Chrome or Chromium installed on your system. Panther will automatically download ChromeDriver if it's not already available.
Basic Meta Tag Extraction
Here's how to extract common meta tags from a webpage:
<?php
use Symfony\Component\Panther\Client;
// Create a new Panther client
$client = Client::createChromeClient();
// Navigate to the target page
$crawler = $client->request('GET', 'https://example.com');
// Extract the page title
$title = $crawler->filter('title')->text();
echo "Page Title: " . $title . "\n";
// Extract meta description
$metaDescription = $crawler->filter('meta[name="description"]')->attr('content');
echo "Meta Description: " . $metaDescription . "\n";
// Extract meta keywords
$metaKeywords = $crawler->filter('meta[name="keywords"]')->attr('content');
echo "Meta Keywords: " . $metaKeywords . "\n";
// Extract charset
$charset = $crawler->filter('meta[charset]')->attr('charset');
echo "Charset: " . $charset . "\n";
// Close the client
$client->quit();
Extracting Open Graph Meta Tags
Open Graph meta tags are essential for social media sharing. Here's how to extract them:
<?php
use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com');
// Extract Open Graph meta tags
$ogTags = [];
// Common Open Graph tags
$ogSelectors = [
'og:title' => 'meta[property="og:title"]',
'og:description' => 'meta[property="og:description"]',
'og:image' => 'meta[property="og:image"]',
'og:url' => 'meta[property="og:url"]',
'og:type' => 'meta[property="og:type"]',
'og:site_name' => 'meta[property="og:site_name"]'
];
foreach ($ogSelectors as $property => $selector) {
$element = $crawler->filter($selector);
if ($element->count() > 0) {
$ogTags[$property] = $element->attr('content');
}
}
print_r($ogTags);
$client->quit();
Extracting Twitter Card Meta Tags
Twitter Cards provide rich media attachments for tweets. Extract them like this:
<?php
use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com');
// Extract Twitter Card meta tags
$twitterTags = [];
$twitterSelectors = [
'twitter:card' => 'meta[name="twitter:card"]',
'twitter:site' => 'meta[name="twitter:site"]',
'twitter:creator' => 'meta[name="twitter:creator"]',
'twitter:title' => 'meta[name="twitter:title"]',
'twitter:description' => 'meta[name="twitter:description"]',
'twitter:image' => 'meta[name="twitter:image"]'
];
foreach ($twitterSelectors as $property => $selector) {
$element = $crawler->filter($selector);
if ($element->count() > 0) {
$twitterTags[$property] = $element->attr('content');
}
}
print_r($twitterTags);
$client->quit();
Comprehensive Metadata Extraction Class
Here's a complete class that extracts all types of metadata:
<?php
use Symfony\Component\Panther\Client;
use Symfony\Component\DomCrawler\Crawler;
class MetadataExtractor
{
private $client;
public function __construct()
{
$this->client = Client::createChromeClient();
}
public function extractMetadata(string $url): array
{
$crawler = $this->client->request('GET', $url);
// Wait for page to fully load
$this->client->waitFor('title');
return [
'basic' => $this->extractBasicMetadata($crawler),
'open_graph' => $this->extractOpenGraphTags($crawler),
'twitter' => $this->extractTwitterTags($crawler),
'seo' => $this->extractSEOTags($crawler),
'technical' => $this->extractTechnicalTags($crawler)
];
}
private function extractBasicMetadata(Crawler $crawler): array
{
return [
'title' => $this->getElementText($crawler, 'title'),
'description' => $this->getElementAttribute($crawler, 'meta[name="description"]', 'content'),
'keywords' => $this->getElementAttribute($crawler, 'meta[name="keywords"]', 'content'),
'author' => $this->getElementAttribute($crawler, 'meta[name="author"]', 'content'),
'charset' => $this->getElementAttribute($crawler, 'meta[charset]', 'charset'),
'viewport' => $this->getElementAttribute($crawler, 'meta[name="viewport"]', 'content')
];
}
private function extractOpenGraphTags(Crawler $crawler): array
{
$ogProperties = [
'title', 'description', 'image', 'url', 'type', 'site_name',
'locale', 'video', 'audio', 'determiner', 'updated_time'
];
$ogTags = [];
foreach ($ogProperties as $property) {
$value = $this->getElementAttribute(
$crawler,
'meta[property="og:' . $property . '"]',
'content'
);
if ($value) {
$ogTags['og:' . $property] = $value;
}
}
return $ogTags;
}
private function extractTwitterTags(Crawler $crawler): array
{
$twitterProperties = [
'card', 'site', 'creator', 'title', 'description',
'image', 'image:alt', 'player', 'app:name:iphone'
];
$twitterTags = [];
foreach ($twitterProperties as $property) {
$value = $this->getElementAttribute(
$crawler,
'meta[name="twitter:' . $property . '"]',
'content'
);
if ($value) {
$twitterTags['twitter:' . $property] = $value;
}
}
return $twitterTags;
}
private function extractSEOTags(Crawler $crawler): array
{
return [
'canonical' => $this->getElementAttribute($crawler, 'link[rel="canonical"]', 'href'),
'robots' => $this->getElementAttribute($crawler, 'meta[name="robots"]', 'content'),
'googlebot' => $this->getElementAttribute($crawler, 'meta[name="googlebot"]', 'content'),
'generator' => $this->getElementAttribute($crawler, 'meta[name="generator"]', 'content'),
'theme_color' => $this->getElementAttribute($crawler, 'meta[name="theme-color"]', 'content'),
'manifest' => $this->getElementAttribute($crawler, 'link[rel="manifest"]', 'href')
];
}
private function extractTechnicalTags(Crawler $crawler): array
{
return [
'content_type' => $this->getElementAttribute($crawler, 'meta[http-equiv="Content-Type"]', 'content'),
'refresh' => $this->getElementAttribute($crawler, 'meta[http-equiv="refresh"]', 'content'),
'cache_control' => $this->getElementAttribute($crawler, 'meta[http-equiv="Cache-Control"]', 'content'),
'pragma' => $this->getElementAttribute($crawler, 'meta[http-equiv="Pragma"]', 'content')
];
}
private function getElementText(Crawler $crawler, string $selector): ?string
{
$element = $crawler->filter($selector);
return $element->count() > 0 ? trim($element->text()) : null;
}
private function getElementAttribute(Crawler $crawler, string $selector, string $attribute): ?string
{
$element = $crawler->filter($selector);
return $element->count() > 0 ? $element->attr($attribute) : null;
}
public function __destruct()
{
if ($this->client) {
$this->client->quit();
}
}
}
// Usage example
$extractor = new MetadataExtractor();
$metadata = $extractor->extractMetadata('https://example.com');
print_r($metadata);
Handling Dynamic Content
For JavaScript-heavy sites where metadata is loaded dynamically, you may need to wait for specific elements or use timeouts. Similar to how you handle AJAX requests using Puppeteer, Symfony Panther provides waiting mechanisms:
<?php
use Symfony\Component\Panther\Client;
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://spa-example.com');
// Wait for a specific meta tag to appear
$client->waitFor('meta[name="description"]');
// Or wait for a specific amount of time
$client->waitFor(3000); // Wait 3 seconds
// Extract metadata after waiting
$description = $crawler->filter('meta[name="description"]')->attr('content');
$client->quit();
Batch Processing Multiple URLs
When extracting metadata from multiple pages, it's efficient to reuse the same client instance:
<?php
use Symfony\Component\Panther\Client;
class BatchMetadataExtractor
{
private $client;
public function __construct()
{
$this->client = Client::createChromeClient();
}
public function extractFromUrls(array $urls): array
{
$results = [];
foreach ($urls as $url) {
try {
$crawler = $this->client->request('GET', $url);
$this->client->waitFor('title');
$results[$url] = [
'title' => $this->getTitle($crawler),
'description' => $this->getDescription($crawler),
'og_image' => $this->getOgImage($crawler)
];
// Add delay between requests to be respectful
sleep(1);
} catch (\Exception $e) {
$results[$url] = ['error' => $e->getMessage()];
}
}
return $results;
}
private function getTitle($crawler): ?string
{
$element = $crawler->filter('title');
return $element->count() > 0 ? $element->text() : null;
}
private function getDescription($crawler): ?string
{
$element = $crawler->filter('meta[name="description"]');
return $element->count() > 0 ? $element->attr('content') : null;
}
private function getOgImage($crawler): ?string
{
$element = $crawler->filter('meta[property="og:image"]');
return $element->count() > 0 ? $element->attr('content') : null;
}
public function __destruct()
{
if ($this->client) {
$this->client->quit();
}
}
}
// Usage
$extractor = new BatchMetadataExtractor();
$urls = [
'https://example1.com',
'https://example2.com',
'https://example3.com'
];
$results = $extractor->extractFromUrls($urls);
print_r($results);
Error Handling and Best Practices
Always implement proper error handling when extracting metadata:
<?php
use Symfony\Component\Panther\Client;
use Symfony\Component\Panther\Exception\NoSuchElementException;
try {
$client = Client::createChromeClient();
$crawler = $client->request('GET', 'https://example.com');
// Check if element exists before accessing
$titleElement = $crawler->filter('title');
if ($titleElement->count() > 0) {
$title = $titleElement->text();
} else {
$title = 'No title found';
}
echo "Title: " . $title . "\n";
} catch (NoSuchElementException $e) {
echo "Element not found: " . $e->getMessage() . "\n";
} catch (\Exception $e) {
echo "Error: " . $e->getMessage() . "\n";
} finally {
if (isset($client)) {
$client->quit();
}
}
Performance Optimization
To improve performance when extracting metadata:
- Disable images and CSS if you only need metadata:
$client = Client::createChromeClient(null, null, [
'--disable-images',
'--disable-css',
'--disable-javascript' // Only if metadata is in static HTML
]);
- Set timeouts to avoid hanging requests:
$client = Client::createChromeClient();
$client->manage()->timeouts()->implicitlyWait(10); // 10 seconds timeout
- Use headless mode for better performance:
$client = Client::createChromeClient(null, null, ['--headless']);
Conclusion
Symfony Panther provides a robust solution for extracting meta tags and page metadata from both static and dynamic web pages. Its ability to handle JavaScript-rendered content makes it particularly valuable for modern web applications. When working with dynamic content that requires browser automation, understanding how to handle timeouts in browser automation becomes crucial for reliable metadata extraction.
Remember to always implement proper error handling, respect website rate limits, and consider the performance implications of your scraping operations. With these techniques, you can effectively extract comprehensive metadata for SEO analysis, content management, or social media optimization.