How do I extract metadata from HTML head tags using Beautiful Soup?
Extracting metadata from HTML head tags is a fundamental task in web scraping, SEO analysis, and content analysis. Beautiful Soup provides powerful tools to parse and extract various types of metadata including title tags, meta descriptions, Open Graph tags, Twitter Cards, and schema.org structured data. This guide covers comprehensive techniques for extracting all types of HTML metadata using Beautiful Soup.
Understanding HTML Metadata
HTML metadata resides within the <head>
section of web pages and provides information about the document. Common metadata includes:
- Title tag: The page title displayed in browser tabs and search results
- Meta tags: Description, keywords, robots directives, viewport settings
- Open Graph tags: Social media sharing metadata
- Twitter Card tags: Twitter-specific sharing metadata
- Schema.org markup: Structured data for search engines
- Link tags: Canonical URLs, stylesheets, favicons
Basic Setup and Installation
First, ensure you have Beautiful Soup and requests installed:
pip install beautifulsoup4 requests lxml
Here's the basic setup for extracting metadata:
import requests
from bs4 import BeautifulSoup
import json
from urllib.parse import urljoin, urlparse
def get_page_soup(url):
"""Fetch and parse a webpage with Beautiful Soup"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return BeautifulSoup(response.content, 'lxml')
# Example usage
url = "https://example.com"
soup = get_page_soup(url)
Extracting Basic Metadata
Title Tag Extraction
The title tag is the most fundamental metadata element:
def extract_title(soup):
"""Extract the page title"""
title_tag = soup.find('title')
if title_tag:
return title_tag.get_text().strip()
return None
# Usage
title = extract_title(soup)
print(f"Title: {title}")
Meta Tags Extraction
Extract common meta tags using various approaches:
def extract_meta_tags(soup):
"""Extract all meta tags into a dictionary"""
meta_data = {}
# Extract meta tags by name attribute
meta_tags = soup.find_all('meta', attrs={'name': True})
for tag in meta_tags:
name = tag.get('name').lower()
content = tag.get('content', '')
meta_data[name] = content
# Extract meta tags by property attribute (Open Graph, etc.)
property_tags = soup.find_all('meta', attrs={'property': True})
for tag in property_tags:
property_name = tag.get('property').lower()
content = tag.get('content', '')
meta_data[property_name] = content
# Extract meta tags by http-equiv attribute
http_equiv_tags = soup.find_all('meta', attrs={'http-equiv': True})
for tag in http_equiv_tags:
equiv_name = tag.get('http-equiv').lower()
content = tag.get('content', '')
meta_data[f"http-equiv-{equiv_name}"] = content
return meta_data
# Usage
meta_data = extract_meta_tags(soup)
print("Meta Description:", meta_data.get('description'))
print("Keywords:", meta_data.get('keywords'))
print("Robots:", meta_data.get('robots'))
Extracting Social Media Metadata
Open Graph Tags
Open Graph tags control how content appears when shared on social platforms:
def extract_open_graph(soup):
"""Extract Open Graph metadata"""
og_data = {}
og_tags = soup.find_all('meta', property=lambda x: x and x.startswith('og:'))
for tag in og_tags:
property_name = tag.get('property')
content = tag.get('content', '')
# Remove 'og:' prefix for cleaner keys
key = property_name.replace('og:', '')
og_data[key] = content
return og_data
# Usage
og_data = extract_open_graph(soup)
print("OG Title:", og_data.get('title'))
print("OG Description:", og_data.get('description'))
print("OG Image:", og_data.get('image'))
print("OG URL:", og_data.get('url'))
print("OG Type:", og_data.get('type'))
Twitter Card Tags
Extract Twitter-specific metadata:
def extract_twitter_cards(soup):
"""Extract Twitter Card metadata"""
twitter_data = {}
twitter_tags = soup.find_all('meta', attrs={'name': lambda x: x and x.startswith('twitter:')})
for tag in twitter_tags:
name = tag.get('name')
content = tag.get('content', '')
# Remove 'twitter:' prefix for cleaner keys
key = name.replace('twitter:', '')
twitter_data[key] = content
return twitter_data
# Usage
twitter_data = extract_twitter_cards(soup)
print("Twitter Card:", twitter_data.get('card'))
print("Twitter Site:", twitter_data.get('site'))
print("Twitter Creator:", twitter_data.get('creator'))
Extracting Link Tags and Resources
Link tags provide information about external resources and relationships:
def extract_link_tags(soup, base_url=None):
"""Extract link tag information"""
links_data = {}
# Extract canonical URL
canonical = soup.find('link', rel='canonical')
if canonical:
canonical_url = canonical.get('href')
if base_url and canonical_url:
canonical_url = urljoin(base_url, canonical_url)
links_data['canonical'] = canonical_url
# Extract favicon
favicon_selectors = [
{'rel': 'icon'},
{'rel': 'shortcut icon'},
{'rel': 'apple-touch-icon'}
]
favicons = []
for selector in favicon_selectors:
favicon_tags = soup.find_all('link', selector)
for tag in favicon_tags:
favicon_url = tag.get('href')
if base_url and favicon_url:
favicon_url = urljoin(base_url, favicon_url)
favicons.append({
'rel': tag.get('rel'),
'href': favicon_url,
'sizes': tag.get('sizes'),
'type': tag.get('type')
})
links_data['favicons'] = favicons
# Extract stylesheet links
stylesheets = []
css_links = soup.find_all('link', rel='stylesheet')
for link in css_links:
stylesheet_url = link.get('href')
if base_url and stylesheet_url:
stylesheet_url = urljoin(base_url, stylesheet_url)
stylesheets.append(stylesheet_url)
links_data['stylesheets'] = stylesheets
return links_data
# Usage
links_data = extract_link_tags(soup, base_url=url)
print("Canonical URL:", links_data.get('canonical'))
print("Favicons:", links_data.get('favicons'))
Comprehensive Metadata Extraction Class
Here's a complete class that combines all metadata extraction methods:
class MetadataExtractor:
def __init__(self, url_or_soup, base_url=None):
if isinstance(url_or_soup, str):
self.soup = get_page_soup(url_or_soup)
self.base_url = url_or_soup
else:
self.soup = url_or_soup
self.base_url = base_url
def extract_all(self):
"""Extract all metadata into a structured dictionary"""
metadata = {
'basic': self._extract_basic(),
'meta_tags': self._extract_meta_tags(),
'open_graph': self._extract_open_graph(),
'twitter': self._extract_twitter_cards(),
'links': self._extract_link_tags(),
'schema': self._extract_schema_org()
}
return metadata
def _extract_basic(self):
"""Extract basic metadata"""
title_tag = self.soup.find('title')
title = title_tag.get_text().strip() if title_tag else None
return {
'title': title,
'lang': self.soup.html.get('lang') if self.soup.html else None
}
def _extract_meta_tags(self):
"""Extract meta tags"""
meta_data = {}
# Name-based meta tags
for tag in self.soup.find_all('meta', attrs={'name': True}):
name = tag.get('name').lower()
content = tag.get('content', '')
meta_data[name] = content
return meta_data
def _extract_open_graph(self):
"""Extract Open Graph tags"""
og_data = {}
og_tags = self.soup.find_all('meta', property=lambda x: x and x.startswith('og:'))
for tag in og_tags:
property_name = tag.get('property').replace('og:', '')
content = tag.get('content', '')
og_data[property_name] = content
return og_data
def _extract_twitter_cards(self):
"""Extract Twitter Card tags"""
twitter_data = {}
twitter_tags = self.soup.find_all('meta', attrs={'name': lambda x: x and x.startswith('twitter:')})
for tag in twitter_tags:
name = tag.get('name').replace('twitter:', '')
content = tag.get('content', '')
twitter_data[name] = content
return twitter_data
def _extract_link_tags(self):
"""Extract link tags"""
links_data = {}
# Canonical URL
canonical = self.soup.find('link', rel='canonical')
if canonical:
canonical_url = canonical.get('href')
if self.base_url and canonical_url:
canonical_url = urljoin(self.base_url, canonical_url)
links_data['canonical'] = canonical_url
return links_data
def _extract_schema_org(self):
"""Extract Schema.org JSON-LD structured data"""
schema_data = []
# Find JSON-LD scripts
scripts = self.soup.find_all('script', type='application/ld+json')
for script in scripts:
try:
data = json.loads(script.string)
schema_data.append(data)
except (json.JSONDecodeError, AttributeError):
continue
return schema_data
# Usage example
extractor = MetadataExtractor("https://example.com")
all_metadata = extractor.extract_all()
print("=== BASIC METADATA ===")
print(f"Title: {all_metadata['basic']['title']}")
print(f"Language: {all_metadata['basic']['lang']}")
print("\n=== META TAGS ===")
for key, value in all_metadata['meta_tags'].items():
print(f"{key}: {value}")
print("\n=== OPEN GRAPH ===")
for key, value in all_metadata['open_graph'].items():
print(f"{key}: {value}")
Advanced Extraction Techniques
Handling Multiple Values
Some meta tags can have multiple values or variations:
def extract_meta_variations(soup, name_variations):
"""Extract meta tag content checking multiple name variations"""
for variation in name_variations:
tag = soup.find('meta', attrs={'name': variation}) or \
soup.find('meta', attrs={'property': variation})
if tag and tag.get('content'):
return tag.get('content').strip()
return None
# Example: Extract description from various possible meta tags
description_variations = [
'description', 'og:description', 'twitter:description'
]
description = extract_meta_variations(soup, description_variations)
Error Handling and Validation
Implement robust error handling for production use:
def safe_extract_metadata(url, timeout=10):
"""Safely extract metadata with error handling"""
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=timeout)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'lxml')
extractor = MetadataExtractor(soup, base_url=url)
return extractor.extract_all()
except requests.RequestException as e:
print(f"Request error: {e}")
return None
except Exception as e:
print(f"Parsing error: {e}")
return None
Integration with Web Scraping Workflows
When working with dynamic content or complex websites, you might need to combine Beautiful Soup with tools like Puppeteer for JavaScript-rendered pages. For handling dynamic content that loads after the initial page load, consider using Puppeteer to handle AJAX requests before extracting metadata.
For comprehensive SEO analysis workflows, you can use Puppeteer for SEO auditing to gather additional performance metrics alongside the metadata extraction.
Best Practices and Performance Tips
1. Use Efficient Selectors
# Efficient: Use specific selectors
meta_description = soup.find('meta', attrs={'name': 'description'})
# Less efficient: Search all meta tags
all_meta = soup.find_all('meta')
for tag in all_meta:
if tag.get('name') == 'description':
meta_description = tag
2. Handle Missing Data Gracefully
def safe_get_content(tag):
"""Safely get content from a tag"""
return tag.get('content', '').strip() if tag else None
3. Normalize and Clean Data
def clean_metadata(text):
"""Clean and normalize metadata text"""
if not text:
return None
# Remove extra whitespace and normalize
text = ' '.join(text.split())
# Remove common unwanted characters
text = text.replace('\n', ' ').replace('\r', ' ')
return text.strip()
Conclusion
Beautiful Soup provides excellent capabilities for extracting HTML metadata, from basic title and meta tags to complex structured data. The techniques covered in this guide enable you to build robust metadata extraction systems for SEO analysis, content management, and web scraping applications.
Key takeaways: - Use specific selectors for efficient metadata extraction - Handle different types of meta tags (name, property, http-equiv) - Extract social media metadata (Open Graph, Twitter Cards) - Implement error handling for production use - Combine with other tools for JavaScript-heavy sites - Clean and validate extracted data
This comprehensive approach ensures you can extract all types of HTML metadata reliably and efficiently using Beautiful Soup in your Python applications.