Table of contents

How do I extract meta tags from a webpage using Simple HTML DOM?

Extracting meta tags from webpages is a fundamental task in web scraping, especially for SEO analysis, content aggregation, and social media preview generation. Simple HTML DOM Parser provides an efficient way to parse HTML documents and extract meta information in PHP. This guide will show you how to extract various types of meta tags including standard SEO tags, Open Graph tags, and Twitter Card metadata.

Understanding Meta Tags

Meta tags are HTML elements that provide metadata about a webpage. They're placed within the <head> section of an HTML document and contain information that helps search engines, social media platforms, and other applications understand the page content.

Common meta tags include: - Title tag: <title>Page Title</title> - Description: <meta name="description" content="Page description"> - Keywords: <meta name="keywords" content="keyword1, keyword2"> - Open Graph tags: <meta property="og:title" content="Title"> - Twitter Cards: <meta name="twitter:card" content="summary"> - Viewport: <meta name="viewport" content="width=device-width">

Installing Simple HTML DOM Parser

First, you need to install the Simple HTML DOM Parser library. You can download it from the official source or use Composer:

composer require sunra/php-simple-html-dom-parser

Alternatively, you can download the simple_html_dom.php file directly and include it in your project.

Basic Setup and HTML Loading

Here's how to set up Simple HTML DOM Parser and load HTML content:

<?php
require_once 'simple_html_dom.php';

// Load HTML from URL
$html = file_get_html('https://example.com');

// Or load HTML from string
$html_string = '<html><head><title>Sample Page</title></head></html>';
$html = str_get_html($html_string);

// Or load HTML from file
$html = file_get_html('local_file.html');
?>

Extracting Basic Meta Tags

Title Tag Extraction

The title tag is one of the most important SEO elements:

<?php
$html = file_get_html('https://example.com');

// Extract title tag
$title_element = $html->find('title', 0);
$title = $title_element ? $title_element->plaintext : 'No title found';

echo "Title: " . $title . "\n";

// Clean up
$html->clear();
?>

Meta Description and Keywords

Extract standard meta tags using the name attribute:

<?php
$html = file_get_html('https://example.com');

// Extract meta description
$description_element = $html->find('meta[name="description"]', 0);
$description = $description_element ? $description_element->content : 'No description found';

// Extract meta keywords
$keywords_element = $html->find('meta[name="keywords"]', 0);
$keywords = $keywords_element ? $keywords_element->content : 'No keywords found';

echo "Description: " . $description . "\n";
echo "Keywords: " . $keywords . "\n";

$html->clear();
?>

Comprehensive Meta Tag Extraction Function

Here's a complete function that extracts multiple types of meta tags:

<?php
function extractMetaTags($url) {
    $html = file_get_html($url);

    if (!$html) {
        return false;
    }

    $meta_data = array();

    // Extract title
    $title_element = $html->find('title', 0);
    $meta_data['title'] = $title_element ? trim($title_element->plaintext) : '';

    // Extract standard meta tags
    $standard_meta = array(
        'description' => 'meta[name="description"]',
        'keywords' => 'meta[name="keywords"]',
        'author' => 'meta[name="author"]',
        'robots' => 'meta[name="robots"]',
        'viewport' => 'meta[name="viewport"]'
    );

    foreach ($standard_meta as $key => $selector) {
        $element = $html->find($selector, 0);
        $meta_data[$key] = $element ? $element->content : '';
    }

    // Extract Open Graph tags
    $og_tags = array(
        'og:title' => 'meta[property="og:title"]',
        'og:description' => 'meta[property="og:description"]',
        'og:image' => 'meta[property="og:image"]',
        'og:url' => 'meta[property="og:url"]',
        'og:type' => 'meta[property="og:type"]',
        'og:site_name' => 'meta[property="og:site_name"]'
    );

    foreach ($og_tags as $key => $selector) {
        $element = $html->find($selector, 0);
        $meta_data[$key] = $element ? $element->content : '';
    }

    // Extract Twitter Card tags
    $twitter_tags = array(
        'twitter:card' => 'meta[name="twitter:card"]',
        'twitter:site' => 'meta[name="twitter:site"]',
        'twitter:creator' => 'meta[name="twitter:creator"]',
        'twitter:title' => 'meta[name="twitter:title"]',
        'twitter:description' => 'meta[name="twitter:description"]',
        'twitter:image' => 'meta[name="twitter:image"]'
    );

    foreach ($twitter_tags as $key => $selector) {
        $element = $html->find($selector, 0);
        $meta_data[$key] = $element ? $element->content : '';
    }

    // Extract canonical URL
    $canonical_element = $html->find('link[rel="canonical"]', 0);
    $meta_data['canonical'] = $canonical_element ? $canonical_element->href : '';

    // Clean up
    $html->clear();

    return $meta_data;
}

// Usage
$meta_data = extractMetaTags('https://example.com');
if ($meta_data) {
    print_r($meta_data);
} else {
    echo "Failed to extract meta tags";
}
?>

Advanced Meta Tag Extraction Techniques

Extracting All Meta Tags Dynamically

Sometimes you want to extract all meta tags without knowing their names in advance:

<?php
function extractAllMetaTags($url) {
    $html = file_get_html($url);

    if (!$html) {
        return false;
    }

    $all_meta = array();

    // Find all meta tags
    $meta_tags = $html->find('meta');

    foreach ($meta_tags as $meta) {
        $key = '';
        $value = $meta->content;

        // Check for name attribute
        if ($meta->name) {
            $key = 'name:' . $meta->name;
        }
        // Check for property attribute (Open Graph)
        elseif ($meta->property) {
            $key = 'property:' . $meta->property;
        }
        // Check for http-equiv attribute
        elseif ($meta->getAttribute('http-equiv')) {
            $key = 'http-equiv:' . $meta->getAttribute('http-equiv');
        }
        // Check for charset
        elseif ($meta->charset) {
            $key = 'charset';
            $value = $meta->charset;
        }

        if ($key && $value) {
            $all_meta[$key] = $value;
        }
    }

    $html->clear();
    return $all_meta;
}
?>

Handling Multiple Meta Tags with Same Name

Some pages may have multiple meta tags with the same name. Here's how to handle them:

<?php
function extractMultipleMeta($url, $meta_name) {
    $html = file_get_html($url);

    if (!$html) {
        return false;
    }

    $meta_values = array();
    $meta_tags = $html->find('meta[name="' . $meta_name . '"]');

    foreach ($meta_tags as $meta) {
        if ($meta->content) {
            $meta_values[] = $meta->content;
        }
    }

    $html->clear();
    return $meta_values;
}

// Extract all keyword meta tags
$keywords = extractMultipleMeta('https://example.com', 'keywords');
?>

Error Handling and Best Practices

Robust Error Handling

Always implement proper error handling when extracting meta tags:

<?php
function safeExtractMetaTags($url, $timeout = 30) {
    try {
        // Set user agent to avoid blocking
        $context = stream_context_create([
            'http' => [
                'timeout' => $timeout,
                'user_agent' => 'Mozilla/5.0 (compatible; MetaExtractor/1.0)'
            ]
        ]);

        $html = file_get_html($url, false, $context);

        if (!$html) {
            throw new Exception("Failed to load HTML from URL: " . $url);
        }

        $meta_data = array();

        // Safe title extraction
        $title_element = $html->find('title', 0);
        $meta_data['title'] = $title_element ? trim($title_element->plaintext) : '';

        // Safe meta description extraction
        $desc_element = $html->find('meta[name="description"]', 0);
        $meta_data['description'] = $desc_element ? trim($desc_element->content) : '';

        $html->clear();
        return $meta_data;

    } catch (Exception $e) {
        error_log("Meta extraction error: " . $e->getMessage());
        return false;
    }
}
?>

Performance Optimization

For better performance when processing multiple URLs:

<?php
function batchExtractMeta($urls) {
    $results = array();

    foreach ($urls as $url) {
        $meta_data = extractMetaTags($url);
        $results[$url] = $meta_data;

        // Add delay to avoid overwhelming the server
        usleep(500000); // 0.5 second delay
    }

    return $results;
}
?>

Integration with Modern Web Scraping Workflows

While Simple HTML DOM is excellent for basic HTML parsing, modern websites often require more sophisticated approaches. For JavaScript-heavy sites that dynamically generate meta tags, you might need to use tools like Puppeteer for handling dynamic content or implement proper error handling strategies when dealing with complex web applications.

Complete Example: SEO Meta Analyzer

Here's a complete example that creates an SEO meta analyzer:

<?php
class SEOMetaAnalyzer {
    private $required_tags = array('title', 'description');
    private $recommended_tags = array('keywords', 'og:title', 'og:description');

    public function analyze($url) {
        $meta_data = $this->extractMetaTags($url);

        if (!$meta_data) {
            return array('error' => 'Failed to extract meta tags');
        }

        $analysis = array(
            'url' => $url,
            'meta_tags' => $meta_data,
            'seo_score' => $this->calculateSEOScore($meta_data),
            'recommendations' => $this->getRecommendations($meta_data)
        );

        return $analysis;
    }

    private function extractMetaTags($url) {
        // Use the comprehensive extraction function from above
        return extractMetaTags($url);
    }

    private function calculateSEOScore($meta_data) {
        $score = 0;
        $max_score = 100;

        // Title check (30 points)
        if (!empty($meta_data['title'])) {
            $title_length = strlen($meta_data['title']);
            if ($title_length >= 30 && $title_length <= 60) {
                $score += 30;
            } elseif ($title_length > 0) {
                $score += 15;
            }
        }

        // Description check (30 points)
        if (!empty($meta_data['description'])) {
            $desc_length = strlen($meta_data['description']);
            if ($desc_length >= 120 && $desc_length <= 160) {
                $score += 30;
            } elseif ($desc_length > 0) {
                $score += 15;
            }
        }

        // Open Graph tags (20 points)
        $og_tags = array('og:title', 'og:description', 'og:image');
        $og_count = 0;
        foreach ($og_tags as $tag) {
            if (!empty($meta_data[$tag])) {
                $og_count++;
            }
        }
        $score += ($og_count / count($og_tags)) * 20;

        // Other important tags (20 points)
        $other_tags = array('canonical', 'robots', 'viewport');
        $other_count = 0;
        foreach ($other_tags as $tag) {
            if (!empty($meta_data[$tag])) {
                $other_count++;
            }
        }
        $score += ($other_count / count($other_tags)) * 20;

        return round($score, 2);
    }

    private function getRecommendations($meta_data) {
        $recommendations = array();

        if (empty($meta_data['title'])) {
            $recommendations[] = "Add a title tag";
        } elseif (strlen($meta_data['title']) > 60) {
            $recommendations[] = "Title tag is too long (over 60 characters)";
        }

        if (empty($meta_data['description'])) {
            $recommendations[] = "Add a meta description";
        } elseif (strlen($meta_data['description']) > 160) {
            $recommendations[] = "Meta description is too long (over 160 characters)";
        }

        if (empty($meta_data['og:title'])) {
            $recommendations[] = "Add Open Graph title for better social media sharing";
        }

        if (empty($meta_data['canonical'])) {
            $recommendations[] = "Add canonical URL to avoid duplicate content issues";
        }

        return $recommendations;
    }
}

// Usage
$analyzer = new SEOMetaAnalyzer();
$analysis = $analyzer->analyze('https://example.com');
print_r($analysis);
?>

Conclusion

Simple HTML DOM Parser provides a straightforward and efficient way to extract meta tags from webpages in PHP. Whether you're building an SEO analyzer, content aggregator, or social media preview generator, the techniques shown in this guide will help you reliably extract meta information from HTML documents.

Remember to implement proper error handling, respect website rate limits, and consider using more advanced tools like Puppeteer when dealing with JavaScript-heavy sites that dynamically generate meta tags. With these fundamentals, you'll be well-equipped to build robust meta tag extraction systems for your web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon