Table of contents

How do you extract metadata from HTML head tags using Cheerio?

Extracting metadata from HTML head tags is a fundamental task in web scraping, essential for SEO analysis, social media optimization, and content management. Cheerio, the server-side implementation of jQuery, provides powerful tools for parsing and extracting this critical information from web pages.

Understanding HTML Metadata

HTML metadata resides within the <head> section of web pages and includes various elements that describe the page content, structure, and behavior. Common metadata includes:

  • Title tags - The main page title
  • Meta descriptions - Page summaries for search engines
  • Open Graph tags - Social media sharing metadata
  • Twitter Card tags - Twitter-specific metadata
  • Canonical URLs - Preferred page URLs
  • Schema.org structured data - Rich snippets for search engines

Basic Setup and Installation

First, install Cheerio in your Node.js project:

npm install cheerio axios

Here's the basic setup for loading HTML content:

const cheerio = require('cheerio');
const axios = require('axios');

async function extractMetadata(url) {
    try {
        const response = await axios.get(url);
        const $ = cheerio.load(response.data);

        // Extract metadata here
        return metadata;
    } catch (error) {
        console.error('Error fetching page:', error.message);
        return null;
    }
}

Extracting Basic Metadata

Title Tag Extraction

The page title is one of the most important metadata elements:

function extractTitle($) {
    // Primary method - get title tag content
    let title = $('title').text().trim();

    // Fallback to Open Graph title
    if (!title) {
        title = $('meta[property="og:title"]').attr('content');
    }

    // Fallback to Twitter title
    if (!title) {
        title = $('meta[name="twitter:title"]').attr('content');
    }

    return title || 'No title found';
}

Meta Description Extraction

Meta descriptions are crucial for SEO and social sharing:

function extractDescription($) {
    // Standard meta description
    let description = $('meta[name="description"]').attr('content');

    // Fallback to Open Graph description
    if (!description) {
        description = $('meta[property="og:description"]').attr('content');
    }

    // Fallback to Twitter description
    if (!description) {
        description = $('meta[name="twitter:description"]').attr('content');
    }

    return description ? description.trim() : null;
}

Advanced Metadata Extraction

Open Graph Tags

Open Graph tags control how content appears when shared on social platforms:

function extractOpenGraph($) {
    const ogData = {};

    // Extract all Open Graph properties
    $('meta[property^="og:"]').each((index, element) => {
        const property = $(element).attr('property');
        const content = $(element).attr('content');

        if (property && content) {
            // Convert og:property to camelCase key
            const key = property.replace('og:', '').replace(/-([a-z])/g, (g) => g[1].toUpperCase());
            ogData[key] = content.trim();
        }
    });

    return ogData;
}

Twitter Card Metadata

Twitter Cards provide rich media experiences when sharing content:

function extractTwitterCard($) {
    const twitterData = {};

    // Extract Twitter Card tags
    $('meta[name^="twitter:"]').each((index, element) => {
        const name = $(element).attr('name');
        const content = $(element).attr('content');

        if (name && content) {
            const key = name.replace('twitter:', '').replace(/-([a-z])/g, (g) => g[1].toUpperCase());
            twitterData[key] = content.trim();
        }
    });

    return twitterData;
}

Canonical URL and Links

Extract important link relationships:

function extractLinks($) {
    const links = {};

    // Canonical URL
    const canonical = $('link[rel="canonical"]').attr('href');
    if (canonical) {
        links.canonical = canonical;
    }

    // Alternative languages (hreflang)
    const alternates = [];
    $('link[rel="alternate"][hreflang]').each((index, element) => {
        alternates.push({
            href: $(element).attr('href'),
            hreflang: $(element).attr('hreflang')
        });
    });

    if (alternates.length > 0) {
        links.alternates = alternates;
    }

    // RSS/Atom feeds
    const feeds = [];
    $('link[type="application/rss+xml"], link[type="application/atom+xml"]').each((index, element) => {
        feeds.push({
            href: $(element).attr('href'),
            title: $(element).attr('title'),
            type: $(element).attr('type')
        });
    });

    if (feeds.length > 0) {
        links.feeds = feeds;
    }

    return links;
}

Structured Data Extraction

JSON-LD Schema Markup

Many websites use JSON-LD for structured data:

function extractJSONLD($) {
    const structuredData = [];

    $('script[type="application/ld+json"]').each((index, element) => {
        try {
            const content = $(element).html();
            const data = JSON.parse(content);
            structuredData.push(data);
        } catch (error) {
            console.warn('Invalid JSON-LD found:', error.message);
        }
    });

    return structuredData;
}

Microdata Extraction

Extract microdata attributes from HTML elements:

function extractMicrodata($) {
    const microdata = [];

    $('[itemscope]').each((index, element) => {
        const item = {
            type: $(element).attr('itemtype'),
            properties: {}
        };

        $(element).find('[itemprop]').each((i, prop) => {
            const propName = $(prop).attr('itemprop');
            let propValue;

            // Extract value based on element type
            if ($(prop).is('meta')) {
                propValue = $(prop).attr('content');
            } else if ($(prop).is('img')) {
                propValue = $(prop).attr('src');
            } else if ($(prop).is('a')) {
                propValue = $(prop).attr('href');
            } else {
                propValue = $(prop).text().trim();
            }

            item.properties[propName] = propValue;
        });

        microdata.push(item);
    });

    return microdata;
}

Complete Metadata Extraction Function

Here's a comprehensive function that combines all extraction methods:

async function extractAllMetadata(url) {
    try {
        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (compatible; MetadataBot/1.0)'
            }
        });

        const $ = cheerio.load(response.data);

        const metadata = {
            url: url,
            title: extractTitle($),
            description: extractDescription($),
            openGraph: extractOpenGraph($),
            twitterCard: extractTwitterCard($),
            links: extractLinks($),
            structuredData: {
                jsonLD: extractJSONLD($),
                microdata: extractMicrodata($)
            },
            meta: {}
        };

        // Extract additional meta tags
        $('meta').each((index, element) => {
            const name = $(element).attr('name') || $(element).attr('property');
            const content = $(element).attr('content');

            if (name && content && !name.startsWith('og:') && !name.startsWith('twitter:')) {
                metadata.meta[name] = content.trim();
            }
        });

        return metadata;
    } catch (error) {
        throw new Error(`Failed to extract metadata: ${error.message}`);
    }
}

Error Handling and Best Practices

Robust Error Handling

function safeExtractMetadata($, selector, attribute = 'content') {
    try {
        const element = $(selector);
        if (element.length === 0) return null;

        return attribute === 'text' 
            ? element.text().trim() 
            : element.attr(attribute)?.trim() || null;
    } catch (error) {
        console.warn(`Error extracting ${selector}:`, error.message);
        return null;
    }
}

Performance Optimization

For large-scale scraping operations, consider these optimizations:

// Limit response size to prevent memory issues
const response = await axios.get(url, {
    maxContentLength: 5 * 1024 * 1024, // 5MB limit
    timeout: 10000, // 10 second timeout
    headers: {
        'Accept': 'text/html,application/xhtml+xml'
    }
});

// Only load the head section if possible
const headMatch = response.data.match(/<head[^>]*>([\s\S]*?)<\/head>/i);
if (headMatch) {
    const $ = cheerio.load(`<html><head>${headMatch[1]}</head></html>`);
    // Process only head content
}

Integration with Web Scraping APIs

When working with complex websites that require JavaScript rendering, you might need to combine Cheerio with tools like Puppeteer. For dynamic content that loads after the initial page load, consider using headless browser solutions or specialized web scraping APIs that can handle JavaScript-heavy sites.

Usage Example

// Example usage
extractAllMetadata('https://example.com')
    .then(metadata => {
        console.log('Page Title:', metadata.title);
        console.log('Description:', metadata.description);
        console.log('Open Graph Data:', metadata.openGraph);
        console.log('Structured Data:', metadata.structuredData);
    })
    .catch(error => {
        console.error('Extraction failed:', error.message);
    });

Conclusion

Cheerio provides an excellent foundation for extracting HTML metadata, offering jQuery-like syntax with server-side performance. By combining basic metadata extraction with advanced techniques for Open Graph tags, structured data, and error handling, you can build robust metadata extraction systems for SEO analysis, content management, and social media optimization.

For websites with complex JavaScript interactions or dynamic content loading, consider integrating these Cheerio techniques with browser automation tools for comprehensive metadata extraction capabilities.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon