How do I extract metadata from HTML documents using Nokogiri?

HTML metadata extraction is a crucial skill for web scraping, SEO analysis, and content processing. Nokogiri, Ruby's premier HTML/XML parsing library, provides powerful tools for extracting various types of metadata from HTML documents. This comprehensive guide will show you how to extract different types of metadata including page titles, meta tags, Open Graph data, Twitter Cards, and structured data.

Understanding HTML Metadata

HTML metadata includes information that describes the document but isn't displayed as part of the main content. Common metadata types include:

Page title (<title> tag)
Meta tags (description, keywords, viewport, etc.)
Open Graph metadata (for social media sharing)
Twitter Card metadata
Structured data (JSON-LD, microdata)
Link tags (canonical URLs, favicons, stylesheets)

Basic Setup and Installation

First, ensure you have Nokogiri installed:

gem install nokogiri

For a Gemfile:

gem 'nokogiri'

Basic setup for parsing HTML:

require 'nokogiri'
require 'open-uri'

# Parse HTML from a string
html_content = '<html><head><title>Example</title></head></html>'
doc = Nokogiri::HTML(html_content)

# Parse HTML from a URL
doc = Nokogiri::HTML(URI.open('https://example.com'))

# Parse HTML from a file
doc = Nokogiri::HTML(File.open('page.html'))

Extracting Basic Metadata

Page Title

The page title is one of the most important metadata elements:

require 'nokogiri'

html = <<-HTML
<!DOCTYPE html>
<html>
<head>
    <title>Complete Guide to Web Scraping with Ruby</title>
</head>
<body>
    <h1>Content here</h1>
</body>
</html>
HTML

doc = Nokogiri::HTML(html)

# Extract the title
title = doc.at('title')&.text&.strip
puts "Page Title: #{title}"
# Output: Page Title: Complete Guide to Web Scraping with Ruby

# Alternative method using CSS selector
title = doc.css('title').first&.text&.strip

Meta Tags

Meta tags provide essential information about the document:

html = <<-HTML
<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <meta name="description" content="Learn web scraping techniques using Ruby and Nokogiri">
    <meta name="keywords" content="ruby, nokogiri, web scraping, html parsing">
    <meta name="author" content="John Doe">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="robots" content="index, follow">
</head>
</html>
HTML

doc = Nokogiri::HTML(html)

# Extract all meta tags
meta_tags = {}

doc.css('meta').each do |meta|
  name = meta['name'] || meta['property'] || meta['http-equiv']
  content = meta['content']

  if name && content
    meta_tags[name] = content
  end
end

puts "Meta Tags:"
meta_tags.each { |name, content| puts "  #{name}: #{content}" }

# Extract specific meta tags
description = doc.at('meta[name="description"]')&.[]('content')
keywords = doc.at('meta[name="keywords"]')&.[]('content')
author = doc.at('meta[name="author"]')&.[]('content')

puts "\nSpecific Meta Tags:"
puts "Description: #{description}"
puts "Keywords: #{keywords}"
puts "Author: #{author}"

Extracting Social Media Metadata

Open Graph Data

Open Graph metadata is used by Facebook, LinkedIn, and other social platforms:

html = <<-HTML
<!DOCTYPE html>
<html>
<head>
    <meta property="og:title" content="Amazing Web Scraping Tutorial">
    <meta property="og:description" content="Learn advanced web scraping techniques">
    <meta property="og:image" content="https://example.com/image.jpg">
    <meta property="og:url" content="https://example.com/tutorial">
    <meta property="og:type" content="article">
    <meta property="og:site_name" content="Web Scraping Hub">
    <meta property="article:author" content="Jane Smith">
    <meta property="article:published_time" content="2024-01-15T10:00:00Z">
</head>
</html>
HTML

doc = Nokogiri::HTML(html)

# Extract Open Graph data
open_graph = {}

doc.css('meta[property^="og:"]').each do |meta|
  property = meta['property']
  content = meta['content']

  if property && content
    # Remove 'og:' prefix for cleaner keys
    key = property.sub(/^og:/, '')
    open_graph[key] = content
  end
end

puts "Open Graph Data:"
open_graph.each { |key, value| puts "  #{key}: #{value}" }

# Extract article-specific metadata
article_meta = {}

doc.css('meta[property^="article:"]').each do |meta|
  property = meta['property']
  content = meta['content']

  if property && content
    key = property.sub(/^article:/, '')
    article_meta[key] = content
  end
end

puts "\nArticle Metadata:"
article_meta.each { |key, value| puts "  #{key}: #{value}" }

Twitter Card Data

Twitter uses its own metadata format:

html = <<-HTML
<!DOCTYPE html>
<html>
<head>
    <meta name="twitter:card" content="summary_large_image">
    <meta name="twitter:site" content="@webscraping">
    <meta name="twitter:creator" content="@johndoe">
    <meta name="twitter:title" content="Master Web Scraping with Nokogiri">
    <meta name="twitter:description" content="Complete guide to HTML parsing">
    <meta name="twitter:image" content="https://example.com/twitter-image.jpg">
</head>
</html>
HTML

doc = Nokogiri::HTML(html)

# Extract Twitter Card data
twitter_meta = {}

doc.css('meta[name^="twitter:"]').each do |meta|
  name = meta['name']
  content = meta['content']

  if name && content
    key = name.sub(/^twitter:/, '')
    twitter_meta[key] = content
  end
end

puts "Twitter Card Data:"
twitter_meta.each { |key, value| puts "  #{key}: #{value}" }

Extracting Structured Data

JSON-LD Structured Data

JSON-LD is a popular format for structured data:

require 'json'

html = <<-HTML
<!DOCTYPE html>
<html>
<head>
    <script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Article",
      "headline": "Web Scraping Best Practices",
      "author": {
        "@type": "Person",
        "name": "John Doe"
      },
      "datePublished": "2024-01-15",
      "dateModified": "2024-01-20",
      "publisher": {
        "@type": "Organization",
        "name": "Tech Blog"
      }
    }
    </script>
</head>
</html>
HTML

doc = Nokogiri::HTML(html)

# Extract JSON-LD structured data
json_ld_scripts = doc.css('script[type="application/ld+json"]')

structured_data = []

json_ld_scripts.each do |script|
  begin
    data = JSON.parse(script.content)
    structured_data << data
  rescue JSON::ParserError => e
    puts "Error parsing JSON-LD: #{e.message}"
  end
end

puts "Structured Data (JSON-LD):"
structured_data.each_with_index do |data, index|
  puts "Script #{index + 1}:"
  puts JSON.pretty_generate(data)
end

Microdata Extraction

Microdata uses HTML attributes to embed structured data:

html = <<-HTML
<!DOCTYPE html>
<html>
<body>
    <div itemscope itemtype="https://schema.org/Person">
        <span itemprop="name">John Doe</span>
        <span itemprop="jobTitle">Web Developer</span>
        <div itemprop="address" itemscope itemtype="https://schema.org/PostalAddress">
            <span itemprop="streetAddress">123 Main St</span>
            <span itemprop="addressLocality">New York</span>
        </div>
    </div>
</body>
</html>
HTML

doc = Nokogiri::HTML(html)

# Extract microdata
microdata = []

doc.css('[itemscope]').each do |item|
  item_data = {
    type: item['itemtype'],
    properties: {}
  }

  item.css('[itemprop]').each do |prop|
    prop_name = prop['itemprop']
    prop_value = prop.text.strip

    item_data[:properties][prop_name] = prop_value
  end

  microdata << item_data
end

puts "Microdata:"
microdata.each_with_index do |data, index|
  puts "Item #{index + 1}:"
  puts "  Type: #{data[:type]}"
  puts "  Properties:"
  data[:properties].each { |key, value| puts "    #{key}: #{value}" }
end

Advanced Metadata Extraction

Comprehensive Metadata Extractor Class

Here's a complete class for extracting various types of metadata:

require 'nokogiri'
require 'json'

class MetadataExtractor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content)
  end

  def extract_all
    {
      title: extract_title,
      meta_tags: extract_meta_tags,
      open_graph: extract_open_graph,
      twitter_cards: extract_twitter_cards,
      canonical_url: extract_canonical_url,
      structured_data: extract_structured_data,
      links: extract_links,
      favicon: extract_favicon
    }
  end

  private

  def extract_title
    @doc.at('title')&.text&.strip
  end

  def extract_meta_tags
    meta_tags = {}

    @doc.css('meta').each do |meta|
      name = meta['name'] || meta['property'] || meta['http-equiv']
      content = meta['content']

      if name && content
        meta_tags[name] = content
      end
    end

    meta_tags
  end

  def extract_open_graph
    og_data = {}

    @doc.css('meta[property^="og:"]').each do |meta|
      property = meta['property']
      content = meta['content']

      if property && content
        key = property.sub(/^og:/, '')
        og_data[key] = content
      end
    end

    og_data
  end

  def extract_twitter_cards
    twitter_data = {}

    @doc.css('meta[name^="twitter:"]').each do |meta|
      name = meta['name']
      content = meta['content']

      if name && content
        key = name.sub(/^twitter:/, '')
        twitter_data[key] = content
      end
    end

    twitter_data
  end

  def extract_canonical_url
    @doc.at('link[rel="canonical"]')&.[]('href')
  end

  def extract_structured_data
    structured_data = []

    @doc.css('script[type="application/ld+json"]').each do |script|
      begin
        data = JSON.parse(script.content)
        structured_data << data
      rescue JSON::ParserError
        # Skip invalid JSON
      end
    end

    structured_data
  end

  def extract_links
    links = {}

    @doc.css('link').each do |link|
      rel = link['rel']
      href = link['href']

      if rel && href
        links[rel] ||= []
        links[rel] << {
          href: href,
          type: link['type'],
          title: link['title']
        }.compact
      end
    end

    links
  end

  def extract_favicon
    favicon_link = @doc.at('link[rel="icon"], link[rel="shortcut icon"]')
    favicon_link&.[]('href')
  end
end

# Usage example
html_content = File.read('webpage.html') # or fetch from URL
extractor = MetadataExtractor.new(html_content)
metadata = extractor.extract_all

puts JSON.pretty_generate(metadata)

Error Handling and Best Practices

When extracting metadata, always implement proper error handling:

def safe_extract_metadata(html_content)
  begin
    doc = Nokogiri::HTML(html_content)

    metadata = {
      title: doc.at('title')&.text&.strip || 'No title found',
      description: doc.at('meta[name="description"]')&.[]('content') || 'No description found'
    }

    # Check for empty or missing content
    metadata.each do |key, value|
      if value.nil? || value.empty?
        puts "Warning: #{key} is empty or missing"
      end
    end

    metadata
  rescue => e
    puts "Error extracting metadata: #{e.message}"
    {}
  end
end

Performance Optimization

For large-scale metadata extraction, consider these optimizations:

# Use xpath for faster selections on large documents
title = doc.xpath('//title').first&.text&.strip

# Limit parsing to head section for metadata-only extraction
head_html = html_content.match(/<head.*?<\/head>/mi)&.[](0)
if head_html
  head_doc = Nokogiri::HTML::DocumentFragment.parse(head_html)
  # Extract metadata from head_doc
end

# Batch process multiple documents
def extract_metadata_batch(html_documents)
  html_documents.map do |html|
    MetadataExtractor.new(html).extract_all
  rescue => e
    puts "Error processing document: #{e.message}"
    nil
  end.compact
end

Integration with Web Scraping Workflows

While Nokogiri excels at static HTML parsing, for JavaScript-heavy sites that require dynamic content rendering, you might need tools like Puppeteer for handling AJAX requests or browser automation for single page applications.

Real-World Use Cases

SEO Analysis

Extract metadata for SEO auditing:

def analyze_seo_metadata(url)
  html = URI.open(url).read
  doc = Nokogiri::HTML(html)

  seo_data = {
    title: doc.at('title')&.text&.strip,
    title_length: doc.at('title')&.text&.strip&.length,
    description: doc.at('meta[name="description"]')&.[]('content'),
    h1_tags: doc.css('h1').map(&:text),
    canonical: doc.at('link[rel="canonical"]')&.[]('href'),
    robots: doc.at('meta[name="robots"]')&.[]('content'),
    og_image: doc.at('meta[property="og:image"]')&.[]('content')
  }

  # Check for SEO issues
  issues = []
  issues << "Missing title" if seo_data[:title].nil?
  issues << "Title too long" if seo_data[:title_length] && seo_data[:title_length] > 60
  issues << "Missing description" if seo_data[:description].nil?
  issues << "Missing canonical URL" if seo_data[:canonical].nil?

  { metadata: seo_data, issues: issues }
end

Content Management

Extract metadata for content cataloging:

def catalog_content(html_files)
  catalog = []

  html_files.each do |file_path|
    html = File.read(file_path)
    extractor = MetadataExtractor.new(html)
    metadata = extractor.extract_all

    catalog << {
      file: file_path,
      title: metadata[:title],
      description: metadata[:meta_tags]['description'],
      last_modified: File.mtime(file_path),
      word_count: Nokogiri::HTML(html).text.split.size
    }
  end

  catalog
end

Common Pitfalls and Solutions

Handling Missing Metadata

Always use safe navigation and provide fallbacks:

# Safe extraction with fallbacks
title = doc.at('title')&.text&.strip || 
        doc.at('meta[property="og:title"]')&.[]('content') || 
        'Untitled Document'

description = doc.at('meta[name="description"]')&.[]('content') || 
              doc.at('meta[property="og:description"]')&.[]('content') || 
              doc.css('p').first&.text&.strip&.[](0, 160)

Character Encoding Issues

Handle encoding properly:

def parse_with_encoding(html_content)
  # Detect encoding from meta tag
  encoding = html_content.match(/<meta[^>]+charset=["']?([^"'>]+)/i)&.[](1)

  if encoding
    html_content = html_content.force_encoding(encoding).encode('UTF-8')
  end

  Nokogiri::HTML(html_content)
end

Conclusion

Nokogiri provides comprehensive tools for extracting metadata from HTML documents. Whether you need basic page titles and descriptions or complex structured data, Nokogiri's CSS selectors and XPath support make metadata extraction straightforward and efficient. Remember to implement proper error handling, validate extracted data, and consider performance implications when processing large volumes of documents.

The techniques covered in this guide will help you build robust metadata extraction systems for SEO analysis, content management, social media optimization, and data mining applications. Practice with different HTML structures to become proficient in handling various metadata formats and edge cases.

Table of contents