How do I extract meta tags from a webpage using Simple HTML DOM?
Extracting meta tags from webpages is a fundamental task in web scraping, especially for SEO analysis, content aggregation, and social media preview generation. Simple HTML DOM Parser provides an efficient way to parse HTML documents and extract meta information in PHP. This guide will show you how to extract various types of meta tags including standard SEO tags, Open Graph tags, and Twitter Card metadata.
Understanding Meta Tags
Meta tags are HTML elements that provide metadata about a webpage. They're placed within the <head>
section of an HTML document and contain information that helps search engines, social media platforms, and other applications understand the page content.
Common meta tags include:
- Title tag: <title>Page Title</title>
- Description: <meta name="description" content="Page description">
- Keywords: <meta name="keywords" content="keyword1, keyword2">
- Open Graph tags: <meta property="og:title" content="Title">
- Twitter Cards: <meta name="twitter:card" content="summary">
- Viewport: <meta name="viewport" content="width=device-width">
Installing Simple HTML DOM Parser
First, you need to install the Simple HTML DOM Parser library. You can download it from the official source or use Composer:
composer require sunra/php-simple-html-dom-parser
Alternatively, you can download the simple_html_dom.php
file directly and include it in your project.
Basic Setup and HTML Loading
Here's how to set up Simple HTML DOM Parser and load HTML content:
<?php
require_once 'simple_html_dom.php';
// Load HTML from URL
$html = file_get_html('https://example.com');
// Or load HTML from string
$html_string = '<html><head><title>Sample Page</title></head></html>';
$html = str_get_html($html_string);
// Or load HTML from file
$html = file_get_html('local_file.html');
?>
Extracting Basic Meta Tags
Title Tag Extraction
The title tag is one of the most important SEO elements:
<?php
$html = file_get_html('https://example.com');
// Extract title tag
$title_element = $html->find('title', 0);
$title = $title_element ? $title_element->plaintext : 'No title found';
echo "Title: " . $title . "\n";
// Clean up
$html->clear();
?>
Meta Description and Keywords
Extract standard meta tags using the name
attribute:
<?php
$html = file_get_html('https://example.com');
// Extract meta description
$description_element = $html->find('meta[name="description"]', 0);
$description = $description_element ? $description_element->content : 'No description found';
// Extract meta keywords
$keywords_element = $html->find('meta[name="keywords"]', 0);
$keywords = $keywords_element ? $keywords_element->content : 'No keywords found';
echo "Description: " . $description . "\n";
echo "Keywords: " . $keywords . "\n";
$html->clear();
?>
Comprehensive Meta Tag Extraction Function
Here's a complete function that extracts multiple types of meta tags:
<?php
function extractMetaTags($url) {
$html = file_get_html($url);
if (!$html) {
return false;
}
$meta_data = array();
// Extract title
$title_element = $html->find('title', 0);
$meta_data['title'] = $title_element ? trim($title_element->plaintext) : '';
// Extract standard meta tags
$standard_meta = array(
'description' => 'meta[name="description"]',
'keywords' => 'meta[name="keywords"]',
'author' => 'meta[name="author"]',
'robots' => 'meta[name="robots"]',
'viewport' => 'meta[name="viewport"]'
);
foreach ($standard_meta as $key => $selector) {
$element = $html->find($selector, 0);
$meta_data[$key] = $element ? $element->content : '';
}
// Extract Open Graph tags
$og_tags = array(
'og:title' => 'meta[property="og:title"]',
'og:description' => 'meta[property="og:description"]',
'og:image' => 'meta[property="og:image"]',
'og:url' => 'meta[property="og:url"]',
'og:type' => 'meta[property="og:type"]',
'og:site_name' => 'meta[property="og:site_name"]'
);
foreach ($og_tags as $key => $selector) {
$element = $html->find($selector, 0);
$meta_data[$key] = $element ? $element->content : '';
}
// Extract Twitter Card tags
$twitter_tags = array(
'twitter:card' => 'meta[name="twitter:card"]',
'twitter:site' => 'meta[name="twitter:site"]',
'twitter:creator' => 'meta[name="twitter:creator"]',
'twitter:title' => 'meta[name="twitter:title"]',
'twitter:description' => 'meta[name="twitter:description"]',
'twitter:image' => 'meta[name="twitter:image"]'
);
foreach ($twitter_tags as $key => $selector) {
$element = $html->find($selector, 0);
$meta_data[$key] = $element ? $element->content : '';
}
// Extract canonical URL
$canonical_element = $html->find('link[rel="canonical"]', 0);
$meta_data['canonical'] = $canonical_element ? $canonical_element->href : '';
// Clean up
$html->clear();
return $meta_data;
}
// Usage
$meta_data = extractMetaTags('https://example.com');
if ($meta_data) {
print_r($meta_data);
} else {
echo "Failed to extract meta tags";
}
?>
Advanced Meta Tag Extraction Techniques
Extracting All Meta Tags Dynamically
Sometimes you want to extract all meta tags without knowing their names in advance:
<?php
function extractAllMetaTags($url) {
$html = file_get_html($url);
if (!$html) {
return false;
}
$all_meta = array();
// Find all meta tags
$meta_tags = $html->find('meta');
foreach ($meta_tags as $meta) {
$key = '';
$value = $meta->content;
// Check for name attribute
if ($meta->name) {
$key = 'name:' . $meta->name;
}
// Check for property attribute (Open Graph)
elseif ($meta->property) {
$key = 'property:' . $meta->property;
}
// Check for http-equiv attribute
elseif ($meta->getAttribute('http-equiv')) {
$key = 'http-equiv:' . $meta->getAttribute('http-equiv');
}
// Check for charset
elseif ($meta->charset) {
$key = 'charset';
$value = $meta->charset;
}
if ($key && $value) {
$all_meta[$key] = $value;
}
}
$html->clear();
return $all_meta;
}
?>
Handling Multiple Meta Tags with Same Name
Some pages may have multiple meta tags with the same name. Here's how to handle them:
<?php
function extractMultipleMeta($url, $meta_name) {
$html = file_get_html($url);
if (!$html) {
return false;
}
$meta_values = array();
$meta_tags = $html->find('meta[name="' . $meta_name . '"]');
foreach ($meta_tags as $meta) {
if ($meta->content) {
$meta_values[] = $meta->content;
}
}
$html->clear();
return $meta_values;
}
// Extract all keyword meta tags
$keywords = extractMultipleMeta('https://example.com', 'keywords');
?>
Error Handling and Best Practices
Robust Error Handling
Always implement proper error handling when extracting meta tags:
<?php
function safeExtractMetaTags($url, $timeout = 30) {
try {
// Set user agent to avoid blocking
$context = stream_context_create([
'http' => [
'timeout' => $timeout,
'user_agent' => 'Mozilla/5.0 (compatible; MetaExtractor/1.0)'
]
]);
$html = file_get_html($url, false, $context);
if (!$html) {
throw new Exception("Failed to load HTML from URL: " . $url);
}
$meta_data = array();
// Safe title extraction
$title_element = $html->find('title', 0);
$meta_data['title'] = $title_element ? trim($title_element->plaintext) : '';
// Safe meta description extraction
$desc_element = $html->find('meta[name="description"]', 0);
$meta_data['description'] = $desc_element ? trim($desc_element->content) : '';
$html->clear();
return $meta_data;
} catch (Exception $e) {
error_log("Meta extraction error: " . $e->getMessage());
return false;
}
}
?>
Performance Optimization
For better performance when processing multiple URLs:
<?php
function batchExtractMeta($urls) {
$results = array();
foreach ($urls as $url) {
$meta_data = extractMetaTags($url);
$results[$url] = $meta_data;
// Add delay to avoid overwhelming the server
usleep(500000); // 0.5 second delay
}
return $results;
}
?>
Integration with Modern Web Scraping Workflows
While Simple HTML DOM is excellent for basic HTML parsing, modern websites often require more sophisticated approaches. For JavaScript-heavy sites that dynamically generate meta tags, you might need to use tools like Puppeteer for handling dynamic content or implement proper error handling strategies when dealing with complex web applications.
Complete Example: SEO Meta Analyzer
Here's a complete example that creates an SEO meta analyzer:
<?php
class SEOMetaAnalyzer {
private $required_tags = array('title', 'description');
private $recommended_tags = array('keywords', 'og:title', 'og:description');
public function analyze($url) {
$meta_data = $this->extractMetaTags($url);
if (!$meta_data) {
return array('error' => 'Failed to extract meta tags');
}
$analysis = array(
'url' => $url,
'meta_tags' => $meta_data,
'seo_score' => $this->calculateSEOScore($meta_data),
'recommendations' => $this->getRecommendations($meta_data)
);
return $analysis;
}
private function extractMetaTags($url) {
// Use the comprehensive extraction function from above
return extractMetaTags($url);
}
private function calculateSEOScore($meta_data) {
$score = 0;
$max_score = 100;
// Title check (30 points)
if (!empty($meta_data['title'])) {
$title_length = strlen($meta_data['title']);
if ($title_length >= 30 && $title_length <= 60) {
$score += 30;
} elseif ($title_length > 0) {
$score += 15;
}
}
// Description check (30 points)
if (!empty($meta_data['description'])) {
$desc_length = strlen($meta_data['description']);
if ($desc_length >= 120 && $desc_length <= 160) {
$score += 30;
} elseif ($desc_length > 0) {
$score += 15;
}
}
// Open Graph tags (20 points)
$og_tags = array('og:title', 'og:description', 'og:image');
$og_count = 0;
foreach ($og_tags as $tag) {
if (!empty($meta_data[$tag])) {
$og_count++;
}
}
$score += ($og_count / count($og_tags)) * 20;
// Other important tags (20 points)
$other_tags = array('canonical', 'robots', 'viewport');
$other_count = 0;
foreach ($other_tags as $tag) {
if (!empty($meta_data[$tag])) {
$other_count++;
}
}
$score += ($other_count / count($other_tags)) * 20;
return round($score, 2);
}
private function getRecommendations($meta_data) {
$recommendations = array();
if (empty($meta_data['title'])) {
$recommendations[] = "Add a title tag";
} elseif (strlen($meta_data['title']) > 60) {
$recommendations[] = "Title tag is too long (over 60 characters)";
}
if (empty($meta_data['description'])) {
$recommendations[] = "Add a meta description";
} elseif (strlen($meta_data['description']) > 160) {
$recommendations[] = "Meta description is too long (over 160 characters)";
}
if (empty($meta_data['og:title'])) {
$recommendations[] = "Add Open Graph title for better social media sharing";
}
if (empty($meta_data['canonical'])) {
$recommendations[] = "Add canonical URL to avoid duplicate content issues";
}
return $recommendations;
}
}
// Usage
$analyzer = new SEOMetaAnalyzer();
$analysis = $analyzer->analyze('https://example.com');
print_r($analysis);
?>
Conclusion
Simple HTML DOM Parser provides a straightforward and efficient way to extract meta tags from webpages in PHP. Whether you're building an SEO analyzer, content aggregator, or social media preview generator, the techniques shown in this guide will help you reliably extract meta information from HTML documents.
Remember to implement proper error handling, respect website rate limits, and consider using more advanced tools like Puppeteer when dealing with JavaScript-heavy sites that dynamically generate meta tags. With these fundamentals, you'll be well-equipped to build robust meta tag extraction systems for your web scraping projects.