How do I extract meta tags from a webpage using Simple HTML DOM?

Extracting meta tags from webpages is a fundamental task in web scraping, especially for SEO analysis, content aggregation, and social media preview generation. Simple HTML DOM Parser provides an efficient way to parse HTML documents and extract meta information in PHP. This guide will show you how to extract various types of meta tags including standard SEO tags, Open Graph tags, and Twitter Card metadata.

Understanding Meta Tags

Meta tags are HTML elements that provide metadata about a webpage. They're placed within the <head> section of an HTML document and contain information that helps search engines, social media platforms, and other applications understand the page content.

Common meta tags include: - Title tag: <title>Page Title</title> - Description: <meta name="description" content="Page description"> - Keywords: <meta name="keywords" content="keyword1, keyword2"> - Open Graph tags: <meta property="og:title" content="Title"> - Twitter Cards: <meta name="twitter:card" content="summary"> - Viewport: <meta name="viewport" content="width=device-width">

Installing Simple HTML DOM Parser

First, you need to install the Simple HTML DOM Parser library. You can download it from the official source or use Composer:

composer require sunra/php-simple-html-dom-parser

Alternatively, you can download the simple_html_dom.php file directly and include it in your project.

Basic Setup and HTML Loading

Here's how to set up Simple HTML DOM Parser and load HTML content:

<?php
require_once 'simple_html_dom.php';

// Load HTML from URL
$html = file_get_html('https://example.com');

// Or load HTML from string
$html_string = '<html><head><title>Sample Page</title></head></html>';
$html = str_get_html($html_string);

// Or load HTML from file
$html = file_get_html('local_file.html');
?>

Extracting Basic Meta Tags

Title Tag Extraction

The title tag is one of the most important SEO elements:

<?php
$html = file_get_html('https://example.com');

// Extract title tag
$title_element = $html->find('title', 0);
$title = $title_element ? $title_element->plaintext : 'No title found';

echo "Title: " . $title . "\n";

// Clean up
$html->clear();
?>

Meta Description and Keywords

Extract standard meta tags using the name attribute:

<?php
$html = file_get_html('https://example.com');

// Extract meta description
$description_element = $html->find('meta[name="description"]', 0);
$description = $description_element ? $description_element->content : 'No description found';

// Extract meta keywords
$keywords_element = $html->find('meta[name="keywords"]', 0);
$keywords = $keywords_element ? $keywords_element->content : 'No keywords found';

echo "Description: " . $description . "\n";
echo "Keywords: " . $keywords . "\n";

$html->clear();
?>

Comprehensive Meta Tag Extraction Function

Here's a complete function that extracts multiple types of meta tags:

<?php
function extractMetaTags($url) {
    $html = file_get_html($url);

    if (!$html) {
        return false;
    }

    $meta_data = array();

    // Extract title
    $title_element = $html->find('title', 0);
    $meta_data['title'] = $title_element ? trim($title_element->plaintext) : '';

    // Extract standard meta tags
    $standard_meta = array(
        'description' => 'meta[name="description"]',
        'keywords' => 'meta[name="keywords"]',
        'author' => 'meta[name="author"]',
        'robots' => 'meta[name="robots"]',
        'viewport' => 'meta[name="viewport"]'
    );

    foreach ($standard_meta as $key => $selector) {
        $element = $html->find($selector, 0);
        $meta_data[$key] = $element ? $element->content : '';
    }

    // Extract Open Graph tags
    $og_tags = array(
        'og:title' => 'meta[property="og:title"]',
        'og:description' => 'meta[property="og:description"]',
        'og:image' => 'meta[property="og:image"]',
        'og:url' => 'meta[property="og:url"]',
        'og:type' => 'meta[property="og:type"]',
        'og:site_name' => 'meta[property="og:site_name"]'
    );

    foreach ($og_tags as $key => $selector) {
        $element = $html->find($selector, 0);
        $meta_data[$key] = $element ? $element->content : '';
    }

    // Extract Twitter Card tags
    $twitter_tags = array(
        'twitter:card' => 'meta[name="twitter:card"]',
        'twitter:site' => 'meta[name="twitter:site"]',
        'twitter:creator' => 'meta[name="twitter:creator"]',
        'twitter:title' => 'meta[name="twitter:title"]',
        'twitter:description' => 'meta[name="twitter:description"]',
        'twitter:image' => 'meta[name="twitter:image"]'
    );

    foreach ($twitter_tags as $key => $selector) {
        $element = $html->find($selector, 0);
        $meta_data[$key] = $element ? $element->content : '';
    }

    // Extract canonical URL
    $canonical_element = $html->find('link[rel="canonical"]', 0);
    $meta_data['canonical'] = $canonical_element ? $canonical_element->href : '';

    // Clean up
    $html->clear();

    return $meta_data;
}

// Usage
$meta_data = extractMetaTags('https://example.com');
if ($meta_data) {
    print_r($meta_data);
} else {
    echo "Failed to extract meta tags";
}
?>

Advanced Meta Tag Extraction Techniques

Extracting All Meta Tags Dynamically

Sometimes you want to extract all meta tags without knowing their names in advance:

<?php
function extractAllMetaTags($url) {
    $html = file_get_html($url);

    if (!$html) {
        return false;
    }

    $all_meta = array();

    // Find all meta tags
    $meta_tags = $html->find('meta');

    foreach ($meta_tags as $meta) {
        $key = '';
        $value = $meta->content;

        // Check for name attribute
        if ($meta->name) {
            $key = 'name:' . $meta->name;
        }
        // Check for property attribute (Open Graph)
        elseif ($meta->property) {
            $key = 'property:' . $meta->property;
        }
        // Check for http-equiv attribute
        elseif ($meta->getAttribute('http-equiv')) {
            $key = 'http-equiv:' . $meta->getAttribute('http-equiv');
        }
        // Check for charset
        elseif ($meta->charset) {
            $key = 'charset';
            $value = $meta->charset;
        }

        if ($key && $value) {
            $all_meta[$key] = $value;
        }
    }

    $html->clear();
    return $all_meta;
}
?>

Handling Multiple Meta Tags with Same Name

Some pages may have multiple meta tags with the same name. Here's how to handle them:

<?php
function extractMultipleMeta($url, $meta_name) {
    $html = file_get_html($url);

    if (!$html) {
        return false;
    }

    $meta_values = array();
    $meta_tags = $html->find('meta[name="' . $meta_name . '"]');

    foreach ($meta_tags as $meta) {
        if ($meta->content) {
            $meta_values[] = $meta->content;
        }
    }

    $html->clear();
    return $meta_values;
}

// Extract all keyword meta tags
$keywords = extractMultipleMeta('https://example.com', 'keywords');
?>

Error Handling and Best Practices

Robust Error Handling

Always implement proper error handling when extracting meta tags:

<?php
function safeExtractMetaTags($url, $timeout = 30) {
    try {
        // Set user agent to avoid blocking
        $context = stream_context_create([
            'http' => [
                'timeout' => $timeout,
                'user_agent' => 'Mozilla/5.0 (compatible; MetaExtractor/1.0)'
            ]
        ]);

        $html = file_get_html($url, false, $context);

        if (!$html) {
            throw new Exception("Failed to load HTML from URL: " . $url);
        }

        $meta_data = array();

        // Safe title extraction
        $title_element = $html->find('title', 0);
        $meta_data['title'] = $title_element ? trim($title_element->plaintext) : '';

        // Safe meta description extraction
        $desc_element = $html->find('meta[name="description"]', 0);
        $meta_data['description'] = $desc_element ? trim($desc_element->content) : '';

        $html->clear();
        return $meta_data;

    } catch (Exception $e) {
        error_log("Meta extraction error: " . $e->getMessage());
        return false;
    }
}
?>

Performance Optimization

For better performance when processing multiple URLs:

<?php
function batchExtractMeta($urls) {
    $results = array();

    foreach ($urls as $url) {
        $meta_data = extractMetaTags($url);
        $results[$url] = $meta_data;

        // Add delay to avoid overwhelming the server
        usleep(500000); // 0.5 second delay
    }

    return $results;
}
?>

Integration with Modern Web Scraping Workflows

While Simple HTML DOM is excellent for basic HTML parsing, modern websites often require more sophisticated approaches. For JavaScript-heavy sites that dynamically generate meta tags, you might need to use tools like Puppeteer for handling dynamic content or implement proper error handling strategies when dealing with complex web applications.

Complete Example: SEO Meta Analyzer

Here's a complete example that creates an SEO meta analyzer:

<?php
class SEOMetaAnalyzer {
    private $required_tags = array('title', 'description');
    private $recommended_tags = array('keywords', 'og:title', 'og:description');

    public function analyze($url) {
        $meta_data = $this->extractMetaTags($url);

        if (!$meta_data) {
            return array('error' => 'Failed to extract meta tags');
        }

        $analysis = array(
            'url' => $url,
            'meta_tags' => $meta_data,
            'seo_score' => $this->calculateSEOScore($meta_data),
            'recommendations' => $this->getRecommendations($meta_data)
        );

        return $analysis;
    }

    private function extractMetaTags($url) {
        // Use the comprehensive extraction function from above
        return extractMetaTags($url);
    }

    private function calculateSEOScore($meta_data) {
        $score = 0;
        $max_score = 100;

        // Title check (30 points)
        if (!empty($meta_data['title'])) {
            $title_length = strlen($meta_data['title']);
            if ($title_length >= 30 && $title_length <= 60) {
                $score += 30;
            } elseif ($title_length > 0) {
                $score += 15;
            }
        }

        // Description check (30 points)
        if (!empty($meta_data['description'])) {
            $desc_length = strlen($meta_data['description']);
            if ($desc_length >= 120 && $desc_length <= 160) {
                $score += 30;
            } elseif ($desc_length > 0) {
                $score += 15;
            }
        }

        // Open Graph tags (20 points)
        $og_tags = array('og:title', 'og:description', 'og:image');
        $og_count = 0;
        foreach ($og_tags as $tag) {
            if (!empty($meta_data[$tag])) {
                $og_count++;
            }
        }
        $score += ($og_count / count($og_tags)) * 20;

        // Other important tags (20 points)
        $other_tags = array('canonical', 'robots', 'viewport');
        $other_count = 0;
        foreach ($other_tags as $tag) {
            if (!empty($meta_data[$tag])) {
                $other_count++;
            }
        }
        $score += ($other_count / count($other_tags)) * 20;

        return round($score, 2);
    }

    private function getRecommendations($meta_data) {
        $recommendations = array();

        if (empty($meta_data['title'])) {
            $recommendations[] = "Add a title tag";
        } elseif (strlen($meta_data['title']) > 60) {
            $recommendations[] = "Title tag is too long (over 60 characters)";
        }

        if (empty($meta_data['description'])) {
            $recommendations[] = "Add a meta description";
        } elseif (strlen($meta_data['description']) > 160) {
            $recommendations[] = "Meta description is too long (over 160 characters)";
        }

        if (empty($meta_data['og:title'])) {
            $recommendations[] = "Add Open Graph title for better social media sharing";
        }

        if (empty($meta_data['canonical'])) {
            $recommendations[] = "Add canonical URL to avoid duplicate content issues";
        }

        return $recommendations;
    }
}

// Usage
$analyzer = new SEOMetaAnalyzer();
$analysis = $analyzer->analyze('https://example.com');
print_r($analysis);
?>

Conclusion

Simple HTML DOM Parser provides a straightforward and efficient way to extract meta tags from webpages in PHP. Whether you're building an SEO analyzer, content aggregator, or social media preview generator, the techniques shown in this guide will help you reliably extract meta information from HTML documents.

Remember to implement proper error handling, respect website rate limits, and consider using more advanced tools like Puppeteer when dealing with JavaScript-heavy sites that dynamically generate meta tags. With these fundamentals, you'll be well-equipped to build robust meta tag extraction systems for your web scraping projects.

Table of contents

How do I extract meta tags from a webpage using Simple HTML DOM?

Understanding Meta Tags

Installing Simple HTML DOM Parser

Basic Setup and HTML Loading

Extracting Basic Meta Tags

Title Tag Extraction

Meta Description and Keywords

Comprehensive Meta Tag Extraction Function

Advanced Meta Tag Extraction Techniques

Extracting All Meta Tags Dynamically

Handling Multiple Meta Tags with Same Name

Error Handling and Best Practices

Robust Error Handling

Performance Optimization

Integration with Modern Web Scraping Workflows

Complete Example: SEO Meta Analyzer

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I scrape form data using Simple HTML DOM?

How do I handle character encoding issues with Simple HTML DOM?

How do I extract image URLs from a webpage using Simple HTML DOM?

Get Started Now

Support