Table of contents

How do I extract metadata from HTML head tags using Beautiful Soup?

Extracting metadata from HTML head tags is a fundamental task in web scraping, SEO analysis, and content analysis. Beautiful Soup provides powerful tools to parse and extract various types of metadata including title tags, meta descriptions, Open Graph tags, Twitter Cards, and schema.org structured data. This guide covers comprehensive techniques for extracting all types of HTML metadata using Beautiful Soup.

Understanding HTML Metadata

HTML metadata resides within the <head> section of web pages and provides information about the document. Common metadata includes:

  • Title tag: The page title displayed in browser tabs and search results
  • Meta tags: Description, keywords, robots directives, viewport settings
  • Open Graph tags: Social media sharing metadata
  • Twitter Card tags: Twitter-specific sharing metadata
  • Schema.org markup: Structured data for search engines
  • Link tags: Canonical URLs, stylesheets, favicons

Basic Setup and Installation

First, ensure you have Beautiful Soup and requests installed:

pip install beautifulsoup4 requests lxml

Here's the basic setup for extracting metadata:

import requests
from bs4 import BeautifulSoup
import json
from urllib.parse import urljoin, urlparse

def get_page_soup(url):
    """Fetch and parse a webpage with Beautiful Soup"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    return BeautifulSoup(response.content, 'lxml')

# Example usage
url = "https://example.com"
soup = get_page_soup(url)

Extracting Basic Metadata

Title Tag Extraction

The title tag is the most fundamental metadata element:

def extract_title(soup):
    """Extract the page title"""
    title_tag = soup.find('title')
    if title_tag:
        return title_tag.get_text().strip()
    return None

# Usage
title = extract_title(soup)
print(f"Title: {title}")

Meta Tags Extraction

Extract common meta tags using various approaches:

def extract_meta_tags(soup):
    """Extract all meta tags into a dictionary"""
    meta_data = {}

    # Extract meta tags by name attribute
    meta_tags = soup.find_all('meta', attrs={'name': True})
    for tag in meta_tags:
        name = tag.get('name').lower()
        content = tag.get('content', '')
        meta_data[name] = content

    # Extract meta tags by property attribute (Open Graph, etc.)
    property_tags = soup.find_all('meta', attrs={'property': True})
    for tag in property_tags:
        property_name = tag.get('property').lower()
        content = tag.get('content', '')
        meta_data[property_name] = content

    # Extract meta tags by http-equiv attribute
    http_equiv_tags = soup.find_all('meta', attrs={'http-equiv': True})
    for tag in http_equiv_tags:
        equiv_name = tag.get('http-equiv').lower()
        content = tag.get('content', '')
        meta_data[f"http-equiv-{equiv_name}"] = content

    return meta_data

# Usage
meta_data = extract_meta_tags(soup)
print("Meta Description:", meta_data.get('description'))
print("Keywords:", meta_data.get('keywords'))
print("Robots:", meta_data.get('robots'))

Extracting Social Media Metadata

Open Graph Tags

Open Graph tags control how content appears when shared on social platforms:

def extract_open_graph(soup):
    """Extract Open Graph metadata"""
    og_data = {}

    og_tags = soup.find_all('meta', property=lambda x: x and x.startswith('og:'))
    for tag in og_tags:
        property_name = tag.get('property')
        content = tag.get('content', '')
        # Remove 'og:' prefix for cleaner keys
        key = property_name.replace('og:', '')
        og_data[key] = content

    return og_data

# Usage
og_data = extract_open_graph(soup)
print("OG Title:", og_data.get('title'))
print("OG Description:", og_data.get('description'))
print("OG Image:", og_data.get('image'))
print("OG URL:", og_data.get('url'))
print("OG Type:", og_data.get('type'))

Twitter Card Tags

Extract Twitter-specific metadata:

def extract_twitter_cards(soup):
    """Extract Twitter Card metadata"""
    twitter_data = {}

    twitter_tags = soup.find_all('meta', attrs={'name': lambda x: x and x.startswith('twitter:')})
    for tag in twitter_tags:
        name = tag.get('name')
        content = tag.get('content', '')
        # Remove 'twitter:' prefix for cleaner keys
        key = name.replace('twitter:', '')
        twitter_data[key] = content

    return twitter_data

# Usage
twitter_data = extract_twitter_cards(soup)
print("Twitter Card:", twitter_data.get('card'))
print("Twitter Site:", twitter_data.get('site'))
print("Twitter Creator:", twitter_data.get('creator'))

Extracting Link Tags and Resources

Link tags provide information about external resources and relationships:

def extract_link_tags(soup, base_url=None):
    """Extract link tag information"""
    links_data = {}

    # Extract canonical URL
    canonical = soup.find('link', rel='canonical')
    if canonical:
        canonical_url = canonical.get('href')
        if base_url and canonical_url:
            canonical_url = urljoin(base_url, canonical_url)
        links_data['canonical'] = canonical_url

    # Extract favicon
    favicon_selectors = [
        {'rel': 'icon'},
        {'rel': 'shortcut icon'},
        {'rel': 'apple-touch-icon'}
    ]

    favicons = []
    for selector in favicon_selectors:
        favicon_tags = soup.find_all('link', selector)
        for tag in favicon_tags:
            favicon_url = tag.get('href')
            if base_url and favicon_url:
                favicon_url = urljoin(base_url, favicon_url)
            favicons.append({
                'rel': tag.get('rel'),
                'href': favicon_url,
                'sizes': tag.get('sizes'),
                'type': tag.get('type')
            })

    links_data['favicons'] = favicons

    # Extract stylesheet links
    stylesheets = []
    css_links = soup.find_all('link', rel='stylesheet')
    for link in css_links:
        stylesheet_url = link.get('href')
        if base_url and stylesheet_url:
            stylesheet_url = urljoin(base_url, stylesheet_url)
        stylesheets.append(stylesheet_url)

    links_data['stylesheets'] = stylesheets

    return links_data

# Usage
links_data = extract_link_tags(soup, base_url=url)
print("Canonical URL:", links_data.get('canonical'))
print("Favicons:", links_data.get('favicons'))

Comprehensive Metadata Extraction Class

Here's a complete class that combines all metadata extraction methods:

class MetadataExtractor:
    def __init__(self, url_or_soup, base_url=None):
        if isinstance(url_or_soup, str):
            self.soup = get_page_soup(url_or_soup)
            self.base_url = url_or_soup
        else:
            self.soup = url_or_soup
            self.base_url = base_url

    def extract_all(self):
        """Extract all metadata into a structured dictionary"""
        metadata = {
            'basic': self._extract_basic(),
            'meta_tags': self._extract_meta_tags(),
            'open_graph': self._extract_open_graph(),
            'twitter': self._extract_twitter_cards(),
            'links': self._extract_link_tags(),
            'schema': self._extract_schema_org()
        }
        return metadata

    def _extract_basic(self):
        """Extract basic metadata"""
        title_tag = self.soup.find('title')
        title = title_tag.get_text().strip() if title_tag else None

        return {
            'title': title,
            'lang': self.soup.html.get('lang') if self.soup.html else None
        }

    def _extract_meta_tags(self):
        """Extract meta tags"""
        meta_data = {}

        # Name-based meta tags
        for tag in self.soup.find_all('meta', attrs={'name': True}):
            name = tag.get('name').lower()
            content = tag.get('content', '')
            meta_data[name] = content

        return meta_data

    def _extract_open_graph(self):
        """Extract Open Graph tags"""
        og_data = {}
        og_tags = self.soup.find_all('meta', property=lambda x: x and x.startswith('og:'))

        for tag in og_tags:
            property_name = tag.get('property').replace('og:', '')
            content = tag.get('content', '')
            og_data[property_name] = content

        return og_data

    def _extract_twitter_cards(self):
        """Extract Twitter Card tags"""
        twitter_data = {}
        twitter_tags = self.soup.find_all('meta', attrs={'name': lambda x: x and x.startswith('twitter:')})

        for tag in twitter_tags:
            name = tag.get('name').replace('twitter:', '')
            content = tag.get('content', '')
            twitter_data[name] = content

        return twitter_data

    def _extract_link_tags(self):
        """Extract link tags"""
        links_data = {}

        # Canonical URL
        canonical = self.soup.find('link', rel='canonical')
        if canonical:
            canonical_url = canonical.get('href')
            if self.base_url and canonical_url:
                canonical_url = urljoin(self.base_url, canonical_url)
            links_data['canonical'] = canonical_url

        return links_data

    def _extract_schema_org(self):
        """Extract Schema.org JSON-LD structured data"""
        schema_data = []

        # Find JSON-LD scripts
        scripts = self.soup.find_all('script', type='application/ld+json')
        for script in scripts:
            try:
                data = json.loads(script.string)
                schema_data.append(data)
            except (json.JSONDecodeError, AttributeError):
                continue

        return schema_data

# Usage example
extractor = MetadataExtractor("https://example.com")
all_metadata = extractor.extract_all()

print("=== BASIC METADATA ===")
print(f"Title: {all_metadata['basic']['title']}")
print(f"Language: {all_metadata['basic']['lang']}")

print("\n=== META TAGS ===")
for key, value in all_metadata['meta_tags'].items():
    print(f"{key}: {value}")

print("\n=== OPEN GRAPH ===")
for key, value in all_metadata['open_graph'].items():
    print(f"{key}: {value}")

Advanced Extraction Techniques

Handling Multiple Values

Some meta tags can have multiple values or variations:

def extract_meta_variations(soup, name_variations):
    """Extract meta tag content checking multiple name variations"""
    for variation in name_variations:
        tag = soup.find('meta', attrs={'name': variation}) or \
              soup.find('meta', attrs={'property': variation})
        if tag and tag.get('content'):
            return tag.get('content').strip()
    return None

# Example: Extract description from various possible meta tags
description_variations = [
    'description', 'og:description', 'twitter:description'
]
description = extract_meta_variations(soup, description_variations)

Error Handling and Validation

Implement robust error handling for production use:

def safe_extract_metadata(url, timeout=10):
    """Safely extract metadata with error handling"""
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers, timeout=timeout)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'lxml')
        extractor = MetadataExtractor(soup, base_url=url)
        return extractor.extract_all()

    except requests.RequestException as e:
        print(f"Request error: {e}")
        return None
    except Exception as e:
        print(f"Parsing error: {e}")
        return None

Integration with Web Scraping Workflows

When working with dynamic content or complex websites, you might need to combine Beautiful Soup with tools like Puppeteer for JavaScript-rendered pages. For handling dynamic content that loads after the initial page load, consider using Puppeteer to handle AJAX requests before extracting metadata.

For comprehensive SEO analysis workflows, you can use Puppeteer for SEO auditing to gather additional performance metrics alongside the metadata extraction.

Best Practices and Performance Tips

1. Use Efficient Selectors

# Efficient: Use specific selectors
meta_description = soup.find('meta', attrs={'name': 'description'})

# Less efficient: Search all meta tags
all_meta = soup.find_all('meta')
for tag in all_meta:
    if tag.get('name') == 'description':
        meta_description = tag

2. Handle Missing Data Gracefully

def safe_get_content(tag):
    """Safely get content from a tag"""
    return tag.get('content', '').strip() if tag else None

3. Normalize and Clean Data

def clean_metadata(text):
    """Clean and normalize metadata text"""
    if not text:
        return None

    # Remove extra whitespace and normalize
    text = ' '.join(text.split())

    # Remove common unwanted characters
    text = text.replace('\n', ' ').replace('\r', ' ')

    return text.strip()

Conclusion

Beautiful Soup provides excellent capabilities for extracting HTML metadata, from basic title and meta tags to complex structured data. The techniques covered in this guide enable you to build robust metadata extraction systems for SEO analysis, content management, and web scraping applications.

Key takeaways: - Use specific selectors for efficient metadata extraction - Handle different types of meta tags (name, property, http-equiv) - Extract social media metadata (Open Graph, Twitter Cards) - Implement error handling for production use - Combine with other tools for JavaScript-heavy sites - Clean and validate extracted data

This comprehensive approach ensures you can extract all types of HTML metadata reliably and efficiently using Beautiful Soup in your Python applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon