How do I Extract Metadata from Web Pages Using Ruby?
Extracting metadata from web pages is a crucial skill for web scraping, SEO analysis, and content aggregation. Ruby provides excellent tools for parsing HTML and extracting various types of metadata including meta tags, Open Graph properties, Twitter Cards, and structured data. This comprehensive guide will show you how to efficiently extract metadata using Ruby's most popular HTML parsing library, Nokogiri.
What is Web Page Metadata?
Web page metadata consists of information about a webpage that is typically not visible to users but is essential for search engines, social media platforms, and other automated systems. Common types of metadata include:
- HTML Meta Tags: Title, description, keywords, author
- Open Graph Tags: Social media sharing information
- Twitter Cards: Twitter-specific sharing metadata
- JSON-LD Structured Data: Schema.org markup
- Link Relations: Canonical URLs, alternate versions
Setting Up Your Ruby Environment
First, you'll need to install the necessary gems for web scraping and HTML parsing:
gem install nokogiri
gem install net-http
gem install uri
Or add them to your Gemfile:
# Gemfile
gem 'nokogiri'
gem 'net-http'
Basic HTML Meta Tag Extraction
Let's start with a simple example that extracts basic meta tags from a webpage:
require 'nokogiri'
require 'net/http'
require 'uri'
class MetadataExtractor
def initialize(url)
@url = url
@doc = fetch_and_parse
end
def extract_basic_metadata
{
title: extract_title,
description: extract_meta_content('description'),
keywords: extract_meta_content('keywords'),
author: extract_meta_content('author'),
robots: extract_meta_content('robots'),
viewport: extract_meta_content('viewport')
}
end
private
def fetch_and_parse
uri = URI(@url)
response = Net::HTTP.get_response(uri)
Nokogiri::HTML(response.body)
end
def extract_title
title_tag = @doc.at_css('title')
title_tag ? title_tag.text.strip : nil
end
def extract_meta_content(name)
meta_tag = @doc.at_css("meta[name='#{name}']")
meta_tag ? meta_tag['content'] : nil
end
end
# Usage
extractor = MetadataExtractor.new('https://example.com')
metadata = extractor.extract_basic_metadata
puts metadata
Extracting Open Graph Metadata
Open Graph tags are essential for social media sharing. Here's how to extract them:
class MetadataExtractor
def extract_open_graph
og_data = {}
@doc.css('meta[property^="og:"]').each do |meta|
property = meta['property']
content = meta['content']
# Remove 'og:' prefix and use as key
key = property.sub('og:', '').to_sym
og_data[key] = content
end
og_data
end
def extract_specific_og_tags
{
og_title: extract_property_content('og:title'),
og_description: extract_property_content('og:description'),
og_image: extract_property_content('og:image'),
og_url: extract_property_content('og:url'),
og_type: extract_property_content('og:type'),
og_site_name: extract_property_content('og:site_name')
}
end
private
def extract_property_content(property)
meta_tag = @doc.at_css("meta[property='#{property}']")
meta_tag ? meta_tag['content'] : nil
end
end
Extracting Twitter Card Metadata
Twitter Cards provide rich media experiences when URLs are shared on Twitter:
class MetadataExtractor
def extract_twitter_cards
twitter_data = {}
@doc.css('meta[name^="twitter:"]').each do |meta|
name = meta['name']
content = meta['content']
# Remove 'twitter:' prefix and use as key
key = name.sub('twitter:', '').to_sym
twitter_data[key] = content
end
twitter_data
end
def extract_specific_twitter_tags
{
twitter_card: extract_meta_content('twitter:card'),
twitter_site: extract_meta_content('twitter:site'),
twitter_creator: extract_meta_content('twitter:creator'),
twitter_title: extract_meta_content('twitter:title'),
twitter_description: extract_meta_content('twitter:description'),
twitter_image: extract_meta_content('twitter:image')
}
end
end
Extracting JSON-LD Structured Data
JSON-LD is a popular format for structured data markup:
require 'json'
class MetadataExtractor
def extract_json_ld
json_ld_scripts = @doc.css('script[type="application/ld+json"]')
structured_data = []
json_ld_scripts.each do |script|
begin
data = JSON.parse(script.content)
structured_data << data
rescue JSON::ParserError => e
puts "Error parsing JSON-LD: #{e.message}"
end
end
structured_data
end
def extract_schema_org_data
json_ld_data = extract_json_ld
schema_data = {}
json_ld_data.each do |data|
if data.is_a?(Hash) && data['@type']
schema_data[data['@type']] = data
elsif data.is_a?(Array)
data.each do |item|
if item.is_a?(Hash) && item['@type']
schema_data[item['@type']] = item
end
end
end
end
schema_data
end
end
Extracting Link Relations and Other Metadata
Link relations provide additional metadata about page relationships:
class MetadataExtractor
def extract_link_relations
links = {}
@doc.css('link[rel]').each do |link|
rel = link['rel']
href = link['href']
if links[rel]
# Handle multiple links with same rel
links[rel] = [links[rel]] unless links[rel].is_a?(Array)
links[rel] << href
else
links[rel] = href
end
end
links
end
def extract_canonical_url
canonical_link = @doc.at_css('link[rel="canonical"]')
canonical_link ? canonical_link['href'] : nil
end
def extract_alternate_languages
alternates = []
@doc.css('link[rel="alternate"][hreflang]').each do |link|
alternates << {
url: link['href'],
language: link['hreflang']
}
end
alternates
end
end
Complete Metadata Extraction Class
Here's a comprehensive class that combines all the extraction methods:
require 'nokogiri'
require 'net/http'
require 'uri'
require 'json'
class ComprehensiveMetadataExtractor
attr_reader :url, :doc
def initialize(url)
@url = url
@doc = fetch_and_parse
end
def extract_all_metadata
{
basic: extract_basic_metadata,
open_graph: extract_open_graph,
twitter: extract_twitter_cards,
json_ld: extract_json_ld,
links: extract_link_relations,
images: extract_images,
additional: extract_additional_metadata
}
rescue StandardError => e
{ error: "Failed to extract metadata: #{e.message}" }
end
def extract_images
images = []
@doc.css('img[src]').each do |img|
images << {
src: img['src'],
alt: img['alt'],
title: img['title']
}
end
images
end
def extract_additional_metadata
{
charset: extract_charset,
language: extract_language,
generator: extract_meta_content('generator'),
theme_color: extract_meta_content('theme-color'),
manifest: extract_link_href('manifest')
}
end
private
def fetch_and_parse
uri = URI(@url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; MetadataExtractor/1.0)'
response = http.request(request)
if response.code.to_i == 200
Nokogiri::HTML(response.body)
else
raise "HTTP Error: #{response.code} #{response.message}"
end
end
def extract_title
title_tag = @doc.at_css('title')
title_tag ? title_tag.text.strip : nil
end
def extract_meta_content(name)
meta_tag = @doc.at_css("meta[name='#{name}'], meta[property='#{name}']")
meta_tag ? meta_tag['content'] : nil
end
def extract_property_content(property)
meta_tag = @doc.at_css("meta[property='#{property}']")
meta_tag ? meta_tag['content'] : nil
end
def extract_link_href(rel)
link_tag = @doc.at_css("link[rel='#{rel}']")
link_tag ? link_tag['href'] : nil
end
def extract_charset
charset_meta = @doc.at_css('meta[charset]')
if charset_meta
charset_meta['charset']
else
http_equiv_meta = @doc.at_css('meta[http-equiv="Content-Type"]')
if http_equiv_meta && http_equiv_meta['content']
match = http_equiv_meta['content'].match(/charset=([^;]+)/)
match ? match[1] : nil
end
end
end
def extract_language
html_tag = @doc.at_css('html[lang]')
html_tag ? html_tag['lang'] : nil
end
# Include all previous methods here...
def extract_basic_metadata
{
title: extract_title,
description: extract_meta_content('description'),
keywords: extract_meta_content('keywords'),
author: extract_meta_content('author'),
robots: extract_meta_content('robots'),
viewport: extract_meta_content('viewport')
}
end
def extract_open_graph
og_data = {}
@doc.css('meta[property^="og:"]').each do |meta|
property = meta['property']
content = meta['content']
key = property.sub('og:', '').to_sym
og_data[key] = content
end
og_data
end
def extract_twitter_cards
twitter_data = {}
@doc.css('meta[name^="twitter:"]').each do |meta|
name = meta['name']
content = meta['content']
key = name.sub('twitter:', '').to_sym
twitter_data[key] = content
end
twitter_data
end
def extract_json_ld
json_ld_scripts = @doc.css('script[type="application/ld+json"]')
structured_data = []
json_ld_scripts.each do |script|
begin
data = JSON.parse(script.content)
structured_data << data
rescue JSON::ParserError => e
puts "Error parsing JSON-LD: #{e.message}"
end
end
structured_data
end
def extract_link_relations
links = {}
@doc.css('link[rel]').each do |link|
rel = link['rel']
href = link['href']
if links[rel]
links[rel] = [links[rel]] unless links[rel].is_a?(Array)
links[rel] << href
else
links[rel] = href
end
end
links
end
end
Usage Examples
Here's how to use the comprehensive metadata extractor:
# Extract metadata from a webpage
extractor = ComprehensiveMetadataExtractor.new('https://example.com')
all_metadata = extractor.extract_all_metadata
# Print specific metadata
puts "Title: #{all_metadata[:basic][:title]}"
puts "Description: #{all_metadata[:basic][:description]}"
puts "Open Graph Image: #{all_metadata[:open_graph][:image]}"
# Extract metadata from multiple URLs
urls = ['https://example1.com', 'https://example2.com', 'https://example3.com']
urls.each do |url|
extractor = ComprehensiveMetadataExtractor.new(url)
metadata = extractor.extract_all_metadata
puts "=== #{url} ==="
puts "Title: #{metadata[:basic][:title]}"
puts "Description: #{metadata[:basic][:description]}"
puts "---"
end
Error Handling and Best Practices
When extracting metadata, it's important to handle errors gracefully:
class RobustMetadataExtractor < ComprehensiveMetadataExtractor
def initialize(url, options = {})
@url = url
@timeout = options[:timeout] || 30
@retries = options[:retries] || 3
@doc = fetch_and_parse_with_retry
end
private
def fetch_and_parse_with_retry
retries = @retries
begin
fetch_and_parse
rescue StandardError => e
retries -= 1
if retries > 0
sleep(1)
retry
else
raise e
end
end
end
def fetch_and_parse
uri = URI(@url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true if uri.scheme == 'https'
http.read_timeout = @timeout
http.open_timeout = @timeout
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (compatible; MetadataExtractor/1.0)'
request['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
response = http.request(request)
case response.code.to_i
when 200
Nokogiri::HTML(response.body)
when 301, 302, 303, 307, 308
# Handle redirects
location = response['location']
if location
@url = location
fetch_and_parse
else
raise "Redirect without location header"
end
else
raise "HTTP Error: #{response.code} #{response.message}"
end
end
end
Advanced Use Cases
For more complex scenarios, you might want to integrate metadata extraction with other web scraping techniques. Consider combining Ruby metadata extraction with browser automation tools when dealing with JavaScript-heavy websites that require handling dynamic content that loads after page load.
Performance Optimization
When extracting metadata from multiple pages, consider implementing concurrent processing:
require 'concurrent'
class BatchMetadataExtractor
def self.extract_from_urls(urls, max_threads: 5)
pool = Concurrent::ThreadPoolExecutor.new(
min_threads: 1,
max_threads: max_threads
)
futures = urls.map do |url|
Concurrent::Future.execute(executor: pool) do
extractor = RobustMetadataExtractor.new(url)
{ url: url, metadata: extractor.extract_all_metadata }
rescue StandardError => e
{ url: url, error: e.message }
end
end
results = futures.map(&:value)
pool.shutdown
pool.wait_for_termination
results
end
end
# Usage
urls = ['https://example1.com', 'https://example2.com', 'https://example3.com']
results = BatchMetadataExtractor.extract_from_urls(urls)
Conclusion
Ruby's Nokogiri library provides a powerful and flexible way to extract metadata from web pages. Whether you need basic meta tags, social media metadata, or structured data, the techniques shown in this guide will help you build robust metadata extraction tools. Remember to handle errors gracefully, respect website rate limits, and consider the legal implications of web scraping.
For websites with complex JavaScript-rendered content, you might need to combine these Ruby techniques with browser automation tools to ensure you're capturing all available metadata. This comprehensive approach will serve you well for SEO analysis, content aggregation, and web scraping projects.