How do I extract metadata from HTML documents using Nokogiri?
HTML metadata extraction is a crucial skill for web scraping, SEO analysis, and content processing. Nokogiri, Ruby's premier HTML/XML parsing library, provides powerful tools for extracting various types of metadata from HTML documents. This comprehensive guide will show you how to extract different types of metadata including page titles, meta tags, Open Graph data, Twitter Cards, and structured data.
Understanding HTML Metadata
HTML metadata includes information that describes the document but isn't displayed as part of the main content. Common metadata types include:
- Page title (
<title>
tag) - Meta tags (description, keywords, viewport, etc.)
- Open Graph metadata (for social media sharing)
- Twitter Card metadata
- Structured data (JSON-LD, microdata)
- Link tags (canonical URLs, favicons, stylesheets)
Basic Setup and Installation
First, ensure you have Nokogiri installed:
gem install nokogiri
For a Gemfile:
gem 'nokogiri'
Basic setup for parsing HTML:
require 'nokogiri'
require 'open-uri'
# Parse HTML from a string
html_content = '<html><head><title>Example</title></head></html>'
doc = Nokogiri::HTML(html_content)
# Parse HTML from a URL
doc = Nokogiri::HTML(URI.open('https://example.com'))
# Parse HTML from a file
doc = Nokogiri::HTML(File.open('page.html'))
Extracting Basic Metadata
Page Title
The page title is one of the most important metadata elements:
require 'nokogiri'
html = <<-HTML
<!DOCTYPE html>
<html>
<head>
<title>Complete Guide to Web Scraping with Ruby</title>
</head>
<body>
<h1>Content here</h1>
</body>
</html>
HTML
doc = Nokogiri::HTML(html)
# Extract the title
title = doc.at('title')&.text&.strip
puts "Page Title: #{title}"
# Output: Page Title: Complete Guide to Web Scraping with Ruby
# Alternative method using CSS selector
title = doc.css('title').first&.text&.strip
Meta Tags
Meta tags provide essential information about the document:
html = <<-HTML
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="description" content="Learn web scraping techniques using Ruby and Nokogiri">
<meta name="keywords" content="ruby, nokogiri, web scraping, html parsing">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="robots" content="index, follow">
</head>
</html>
HTML
doc = Nokogiri::HTML(html)
# Extract all meta tags
meta_tags = {}
doc.css('meta').each do |meta|
name = meta['name'] || meta['property'] || meta['http-equiv']
content = meta['content']
if name && content
meta_tags[name] = content
end
end
puts "Meta Tags:"
meta_tags.each { |name, content| puts " #{name}: #{content}" }
# Extract specific meta tags
description = doc.at('meta[name="description"]')&.[]('content')
keywords = doc.at('meta[name="keywords"]')&.[]('content')
author = doc.at('meta[name="author"]')&.[]('content')
puts "\nSpecific Meta Tags:"
puts "Description: #{description}"
puts "Keywords: #{keywords}"
puts "Author: #{author}"
Extracting Social Media Metadata
Open Graph Data
Open Graph metadata is used by Facebook, LinkedIn, and other social platforms:
html = <<-HTML
<!DOCTYPE html>
<html>
<head>
<meta property="og:title" content="Amazing Web Scraping Tutorial">
<meta property="og:description" content="Learn advanced web scraping techniques">
<meta property="og:image" content="https://example.com/image.jpg">
<meta property="og:url" content="https://example.com/tutorial">
<meta property="og:type" content="article">
<meta property="og:site_name" content="Web Scraping Hub">
<meta property="article:author" content="Jane Smith">
<meta property="article:published_time" content="2024-01-15T10:00:00Z">
</head>
</html>
HTML
doc = Nokogiri::HTML(html)
# Extract Open Graph data
open_graph = {}
doc.css('meta[property^="og:"]').each do |meta|
property = meta['property']
content = meta['content']
if property && content
# Remove 'og:' prefix for cleaner keys
key = property.sub(/^og:/, '')
open_graph[key] = content
end
end
puts "Open Graph Data:"
open_graph.each { |key, value| puts " #{key}: #{value}" }
# Extract article-specific metadata
article_meta = {}
doc.css('meta[property^="article:"]').each do |meta|
property = meta['property']
content = meta['content']
if property && content
key = property.sub(/^article:/, '')
article_meta[key] = content
end
end
puts "\nArticle Metadata:"
article_meta.each { |key, value| puts " #{key}: #{value}" }
Twitter Card Data
Twitter uses its own metadata format:
html = <<-HTML
<!DOCTYPE html>
<html>
<head>
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:site" content="@webscraping">
<meta name="twitter:creator" content="@johndoe">
<meta name="twitter:title" content="Master Web Scraping with Nokogiri">
<meta name="twitter:description" content="Complete guide to HTML parsing">
<meta name="twitter:image" content="https://example.com/twitter-image.jpg">
</head>
</html>
HTML
doc = Nokogiri::HTML(html)
# Extract Twitter Card data
twitter_meta = {}
doc.css('meta[name^="twitter:"]').each do |meta|
name = meta['name']
content = meta['content']
if name && content
key = name.sub(/^twitter:/, '')
twitter_meta[key] = content
end
end
puts "Twitter Card Data:"
twitter_meta.each { |key, value| puts " #{key}: #{value}" }
Extracting Structured Data
JSON-LD Structured Data
JSON-LD is a popular format for structured data:
require 'json'
html = <<-HTML
<!DOCTYPE html>
<html>
<head>
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Web Scraping Best Practices",
"author": {
"@type": "Person",
"name": "John Doe"
},
"datePublished": "2024-01-15",
"dateModified": "2024-01-20",
"publisher": {
"@type": "Organization",
"name": "Tech Blog"
}
}
</script>
</head>
</html>
HTML
doc = Nokogiri::HTML(html)
# Extract JSON-LD structured data
json_ld_scripts = doc.css('script[type="application/ld+json"]')
structured_data = []
json_ld_scripts.each do |script|
begin
data = JSON.parse(script.content)
structured_data << data
rescue JSON::ParserError => e
puts "Error parsing JSON-LD: #{e.message}"
end
end
puts "Structured Data (JSON-LD):"
structured_data.each_with_index do |data, index|
puts "Script #{index + 1}:"
puts JSON.pretty_generate(data)
end
Microdata Extraction
Microdata uses HTML attributes to embed structured data:
html = <<-HTML
<!DOCTYPE html>
<html>
<body>
<div itemscope itemtype="https://schema.org/Person">
<span itemprop="name">John Doe</span>
<span itemprop="jobTitle">Web Developer</span>
<div itemprop="address" itemscope itemtype="https://schema.org/PostalAddress">
<span itemprop="streetAddress">123 Main St</span>
<span itemprop="addressLocality">New York</span>
</div>
</div>
</body>
</html>
HTML
doc = Nokogiri::HTML(html)
# Extract microdata
microdata = []
doc.css('[itemscope]').each do |item|
item_data = {
type: item['itemtype'],
properties: {}
}
item.css('[itemprop]').each do |prop|
prop_name = prop['itemprop']
prop_value = prop.text.strip
item_data[:properties][prop_name] = prop_value
end
microdata << item_data
end
puts "Microdata:"
microdata.each_with_index do |data, index|
puts "Item #{index + 1}:"
puts " Type: #{data[:type]}"
puts " Properties:"
data[:properties].each { |key, value| puts " #{key}: #{value}" }
end
Advanced Metadata Extraction
Comprehensive Metadata Extractor Class
Here's a complete class for extracting various types of metadata:
require 'nokogiri'
require 'json'
class MetadataExtractor
def initialize(html_content)
@doc = Nokogiri::HTML(html_content)
end
def extract_all
{
title: extract_title,
meta_tags: extract_meta_tags,
open_graph: extract_open_graph,
twitter_cards: extract_twitter_cards,
canonical_url: extract_canonical_url,
structured_data: extract_structured_data,
links: extract_links,
favicon: extract_favicon
}
end
private
def extract_title
@doc.at('title')&.text&.strip
end
def extract_meta_tags
meta_tags = {}
@doc.css('meta').each do |meta|
name = meta['name'] || meta['property'] || meta['http-equiv']
content = meta['content']
if name && content
meta_tags[name] = content
end
end
meta_tags
end
def extract_open_graph
og_data = {}
@doc.css('meta[property^="og:"]').each do |meta|
property = meta['property']
content = meta['content']
if property && content
key = property.sub(/^og:/, '')
og_data[key] = content
end
end
og_data
end
def extract_twitter_cards
twitter_data = {}
@doc.css('meta[name^="twitter:"]').each do |meta|
name = meta['name']
content = meta['content']
if name && content
key = name.sub(/^twitter:/, '')
twitter_data[key] = content
end
end
twitter_data
end
def extract_canonical_url
@doc.at('link[rel="canonical"]')&.[]('href')
end
def extract_structured_data
structured_data = []
@doc.css('script[type="application/ld+json"]').each do |script|
begin
data = JSON.parse(script.content)
structured_data << data
rescue JSON::ParserError
# Skip invalid JSON
end
end
structured_data
end
def extract_links
links = {}
@doc.css('link').each do |link|
rel = link['rel']
href = link['href']
if rel && href
links[rel] ||= []
links[rel] << {
href: href,
type: link['type'],
title: link['title']
}.compact
end
end
links
end
def extract_favicon
favicon_link = @doc.at('link[rel="icon"], link[rel="shortcut icon"]')
favicon_link&.[]('href')
end
end
# Usage example
html_content = File.read('webpage.html') # or fetch from URL
extractor = MetadataExtractor.new(html_content)
metadata = extractor.extract_all
puts JSON.pretty_generate(metadata)
Error Handling and Best Practices
When extracting metadata, always implement proper error handling:
def safe_extract_metadata(html_content)
begin
doc = Nokogiri::HTML(html_content)
metadata = {
title: doc.at('title')&.text&.strip || 'No title found',
description: doc.at('meta[name="description"]')&.[]('content') || 'No description found'
}
# Check for empty or missing content
metadata.each do |key, value|
if value.nil? || value.empty?
puts "Warning: #{key} is empty or missing"
end
end
metadata
rescue => e
puts "Error extracting metadata: #{e.message}"
{}
end
end
Performance Optimization
For large-scale metadata extraction, consider these optimizations:
# Use xpath for faster selections on large documents
title = doc.xpath('//title').first&.text&.strip
# Limit parsing to head section for metadata-only extraction
head_html = html_content.match(/<head.*?<\/head>/mi)&.[](0)
if head_html
head_doc = Nokogiri::HTML::DocumentFragment.parse(head_html)
# Extract metadata from head_doc
end
# Batch process multiple documents
def extract_metadata_batch(html_documents)
html_documents.map do |html|
MetadataExtractor.new(html).extract_all
rescue => e
puts "Error processing document: #{e.message}"
nil
end.compact
end
Integration with Web Scraping Workflows
While Nokogiri excels at static HTML parsing, for JavaScript-heavy sites that require dynamic content rendering, you might need tools like Puppeteer for handling AJAX requests or browser automation for single page applications.
Real-World Use Cases
SEO Analysis
Extract metadata for SEO auditing:
def analyze_seo_metadata(url)
html = URI.open(url).read
doc = Nokogiri::HTML(html)
seo_data = {
title: doc.at('title')&.text&.strip,
title_length: doc.at('title')&.text&.strip&.length,
description: doc.at('meta[name="description"]')&.[]('content'),
h1_tags: doc.css('h1').map(&:text),
canonical: doc.at('link[rel="canonical"]')&.[]('href'),
robots: doc.at('meta[name="robots"]')&.[]('content'),
og_image: doc.at('meta[property="og:image"]')&.[]('content')
}
# Check for SEO issues
issues = []
issues << "Missing title" if seo_data[:title].nil?
issues << "Title too long" if seo_data[:title_length] && seo_data[:title_length] > 60
issues << "Missing description" if seo_data[:description].nil?
issues << "Missing canonical URL" if seo_data[:canonical].nil?
{ metadata: seo_data, issues: issues }
end
Content Management
Extract metadata for content cataloging:
def catalog_content(html_files)
catalog = []
html_files.each do |file_path|
html = File.read(file_path)
extractor = MetadataExtractor.new(html)
metadata = extractor.extract_all
catalog << {
file: file_path,
title: metadata[:title],
description: metadata[:meta_tags]['description'],
last_modified: File.mtime(file_path),
word_count: Nokogiri::HTML(html).text.split.size
}
end
catalog
end
Common Pitfalls and Solutions
Handling Missing Metadata
Always use safe navigation and provide fallbacks:
# Safe extraction with fallbacks
title = doc.at('title')&.text&.strip ||
doc.at('meta[property="og:title"]')&.[]('content') ||
'Untitled Document'
description = doc.at('meta[name="description"]')&.[]('content') ||
doc.at('meta[property="og:description"]')&.[]('content') ||
doc.css('p').first&.text&.strip&.[](0, 160)
Character Encoding Issues
Handle encoding properly:
def parse_with_encoding(html_content)
# Detect encoding from meta tag
encoding = html_content.match(/<meta[^>]+charset=["']?([^"'>]+)/i)&.[](1)
if encoding
html_content = html_content.force_encoding(encoding).encode('UTF-8')
end
Nokogiri::HTML(html_content)
end
Conclusion
Nokogiri provides comprehensive tools for extracting metadata from HTML documents. Whether you need basic page titles and descriptions or complex structured data, Nokogiri's CSS selectors and XPath support make metadata extraction straightforward and efficient. Remember to implement proper error handling, validate extracted data, and consider performance implications when processing large volumes of documents.
The techniques covered in this guide will help you build robust metadata extraction systems for SEO analysis, content management, social media optimization, and data mining applications. Practice with different HTML structures to become proficient in handling various metadata formats and edge cases.