Table of contents

How do I Extract Metadata from Web Pages Using Ruby?

Extracting metadata from web pages is a crucial skill for web scraping, SEO analysis, and content aggregation. Ruby provides excellent tools for parsing HTML and extracting various types of metadata including meta tags, Open Graph properties, Twitter Cards, and structured data. This comprehensive guide will show you how to efficiently extract metadata using Ruby's most popular HTML parsing library, Nokogiri.

What is Web Page Metadata?

Web page metadata consists of information about a webpage that is typically not visible to users but is essential for search engines, social media platforms, and other automated systems. Common types of metadata include:

  • HTML Meta Tags: Title, description, keywords, author
  • Open Graph Tags: Social media sharing information
  • Twitter Cards: Twitter-specific sharing metadata
  • JSON-LD Structured Data: Schema.org markup
  • Link Relations: Canonical URLs, alternate versions

Setting Up Your Ruby Environment

First, you'll need to install the necessary gems for web scraping and HTML parsing:

gem install nokogiri
gem install net-http
gem install uri

Or add them to your Gemfile:

# Gemfile
gem 'nokogiri'
gem 'net-http'

Basic HTML Meta Tag Extraction

Let's start with a simple example that extracts basic meta tags from a webpage:

require 'nokogiri'
require 'net/http'
require 'uri'

class MetadataExtractor
  def initialize(url)
    @url = url
    @doc = fetch_and_parse
  end

  def extract_basic_metadata
    {
      title: extract_title,
      description: extract_meta_content('description'),
      keywords: extract_meta_content('keywords'),
      author: extract_meta_content('author'),
      robots: extract_meta_content('robots'),
      viewport: extract_meta_content('viewport')
    }
  end

  private

  def fetch_and_parse
    uri = URI(@url)
    response = Net::HTTP.get_response(uri)
    Nokogiri::HTML(response.body)
  end

  def extract_title
    title_tag = @doc.at_css('title')
    title_tag ? title_tag.text.strip : nil
  end

  def extract_meta_content(name)
    meta_tag = @doc.at_css("meta[name='#{name}']")
    meta_tag ? meta_tag['content'] : nil
  end
end

# Usage
extractor = MetadataExtractor.new('https://example.com')
metadata = extractor.extract_basic_metadata
puts metadata

Extracting Open Graph Metadata

Open Graph tags are essential for social media sharing. Here's how to extract them:

class MetadataExtractor
  def extract_open_graph
    og_data = {}

    @doc.css('meta[property^="og:"]').each do |meta|
      property = meta['property']
      content = meta['content']

      # Remove 'og:' prefix and use as key
      key = property.sub('og:', '').to_sym
      og_data[key] = content
    end

    og_data
  end

  def extract_specific_og_tags
    {
      og_title: extract_property_content('og:title'),
      og_description: extract_property_content('og:description'),
      og_image: extract_property_content('og:image'),
      og_url: extract_property_content('og:url'),
      og_type: extract_property_content('og:type'),
      og_site_name: extract_property_content('og:site_name')
    }
  end

  private

  def extract_property_content(property)
    meta_tag = @doc.at_css("meta[property='#{property}']")
    meta_tag ? meta_tag['content'] : nil
  end
end

Extracting Twitter Card Metadata

Twitter Cards provide rich media experiences when URLs are shared on Twitter:

class MetadataExtractor
  def extract_twitter_cards
    twitter_data = {}

    @doc.css('meta[name^="twitter:"]').each do |meta|
      name = meta['name']
      content = meta['content']

      # Remove 'twitter:' prefix and use as key
      key = name.sub('twitter:', '').to_sym
      twitter_data[key] = content
    end

    twitter_data
  end

  def extract_specific_twitter_tags
    {
      twitter_card: extract_meta_content('twitter:card'),
      twitter_site: extract_meta_content('twitter:site'),
      twitter_creator: extract_meta_content('twitter:creator'),
      twitter_title: extract_meta_content('twitter:title'),
      twitter_description: extract_meta_content('twitter:description'),
      twitter_image: extract_meta_content('twitter:image')
    }
  end
end

Extracting JSON-LD Structured Data

JSON-LD is a popular format for structured data markup:

require 'json'

class MetadataExtractor
  def extract_json_ld
    json_ld_scripts = @doc.css('script[type="application/ld+json"]')
    structured_data = []

    json_ld_scripts.each do |script|
      begin
        data = JSON.parse(script.content)
        structured_data << data
      rescue JSON::ParserError => e
        puts "Error parsing JSON-LD: #{e.message}"
      end
    end

    structured_data
  end

  def extract_schema_org_data
    json_ld_data = extract_json_ld
    schema_data = {}

    json_ld_data.each do |data|
      if data.is_a?(Hash) && data['@type']
        schema_data[data['@type']] = data
      elsif data.is_a?(Array)
        data.each do |item|
          if item.is_a?(Hash) && item['@type']
            schema_data[item['@type']] = item
          end
        end
      end
    end

    schema_data
  end
end

Extracting Link Relations and Other Metadata

Link relations provide additional metadata about page relationships:

class MetadataExtractor
  def extract_link_relations
    links = {}

    @doc.css('link[rel]').each do |link|
      rel = link['rel']
      href = link['href']

      if links[rel]
        # Handle multiple links with same rel
        links[rel] = [links[rel]] unless links[rel].is_a?(Array)
        links[rel] << href
      else
        links[rel] = href
      end
    end

    links
  end

  def extract_canonical_url
    canonical_link = @doc.at_css('link[rel="canonical"]')
    canonical_link ? canonical_link['href'] : nil
  end

  def extract_alternate_languages
    alternates = []

    @doc.css('link[rel="alternate"][hreflang]').each do |link|
      alternates << {
        url: link['href'],
        language: link['hreflang']
      }
    end

    alternates
  end
end

Complete Metadata Extraction Class

Here's a comprehensive class that combines all the extraction methods:

require 'nokogiri'
require 'net/http'
require 'uri'
require 'json'

class ComprehensiveMetadataExtractor
  attr_reader :url, :doc

  def initialize(url)
    @url = url
    @doc = fetch_and_parse
  end

  def extract_all_metadata
    {
      basic: extract_basic_metadata,
      open_graph: extract_open_graph,
      twitter: extract_twitter_cards,
      json_ld: extract_json_ld,
      links: extract_link_relations,
      images: extract_images,
      additional: extract_additional_metadata
    }
  rescue StandardError => e
    { error: "Failed to extract metadata: #{e.message}" }
  end

  def extract_images
    images = []

    @doc.css('img[src]').each do |img|
      images << {
        src: img['src'],
        alt: img['alt'],
        title: img['title']
      }
    end

    images
  end

  def extract_additional_metadata
    {
      charset: extract_charset,
      language: extract_language,
      generator: extract_meta_content('generator'),
      theme_color: extract_meta_content('theme-color'),
      manifest: extract_link_href('manifest')
    }
  end

  private

  def fetch_and_parse
    uri = URI(@url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'

    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'Mozilla/5.0 (compatible; MetadataExtractor/1.0)'

    response = http.request(request)

    if response.code.to_i == 200
      Nokogiri::HTML(response.body)
    else
      raise "HTTP Error: #{response.code} #{response.message}"
    end
  end

  def extract_title
    title_tag = @doc.at_css('title')
    title_tag ? title_tag.text.strip : nil
  end

  def extract_meta_content(name)
    meta_tag = @doc.at_css("meta[name='#{name}'], meta[property='#{name}']")
    meta_tag ? meta_tag['content'] : nil
  end

  def extract_property_content(property)
    meta_tag = @doc.at_css("meta[property='#{property}']")
    meta_tag ? meta_tag['content'] : nil
  end

  def extract_link_href(rel)
    link_tag = @doc.at_css("link[rel='#{rel}']")
    link_tag ? link_tag['href'] : nil
  end

  def extract_charset
    charset_meta = @doc.at_css('meta[charset]')
    if charset_meta
      charset_meta['charset']
    else
      http_equiv_meta = @doc.at_css('meta[http-equiv="Content-Type"]')
      if http_equiv_meta && http_equiv_meta['content']
        match = http_equiv_meta['content'].match(/charset=([^;]+)/)
        match ? match[1] : nil
      end
    end
  end

  def extract_language
    html_tag = @doc.at_css('html[lang]')
    html_tag ? html_tag['lang'] : nil
  end

  # Include all previous methods here...
  def extract_basic_metadata
    {
      title: extract_title,
      description: extract_meta_content('description'),
      keywords: extract_meta_content('keywords'),
      author: extract_meta_content('author'),
      robots: extract_meta_content('robots'),
      viewport: extract_meta_content('viewport')
    }
  end

  def extract_open_graph
    og_data = {}

    @doc.css('meta[property^="og:"]').each do |meta|
      property = meta['property']
      content = meta['content']
      key = property.sub('og:', '').to_sym
      og_data[key] = content
    end

    og_data
  end

  def extract_twitter_cards
    twitter_data = {}

    @doc.css('meta[name^="twitter:"]').each do |meta|
      name = meta['name']
      content = meta['content']
      key = name.sub('twitter:', '').to_sym
      twitter_data[key] = content
    end

    twitter_data
  end

  def extract_json_ld
    json_ld_scripts = @doc.css('script[type="application/ld+json"]')
    structured_data = []

    json_ld_scripts.each do |script|
      begin
        data = JSON.parse(script.content)
        structured_data << data
      rescue JSON::ParserError => e
        puts "Error parsing JSON-LD: #{e.message}"
      end
    end

    structured_data
  end

  def extract_link_relations
    links = {}

    @doc.css('link[rel]').each do |link|
      rel = link['rel']
      href = link['href']

      if links[rel]
        links[rel] = [links[rel]] unless links[rel].is_a?(Array)
        links[rel] << href
      else
        links[rel] = href
      end
    end

    links
  end
end

Usage Examples

Here's how to use the comprehensive metadata extractor:

# Extract metadata from a webpage
extractor = ComprehensiveMetadataExtractor.new('https://example.com')
all_metadata = extractor.extract_all_metadata

# Print specific metadata
puts "Title: #{all_metadata[:basic][:title]}"
puts "Description: #{all_metadata[:basic][:description]}"
puts "Open Graph Image: #{all_metadata[:open_graph][:image]}"

# Extract metadata from multiple URLs
urls = ['https://example1.com', 'https://example2.com', 'https://example3.com']

urls.each do |url|
  extractor = ComprehensiveMetadataExtractor.new(url)
  metadata = extractor.extract_all_metadata

  puts "=== #{url} ==="
  puts "Title: #{metadata[:basic][:title]}"
  puts "Description: #{metadata[:basic][:description]}"
  puts "---"
end

Error Handling and Best Practices

When extracting metadata, it's important to handle errors gracefully:

class RobustMetadataExtractor < ComprehensiveMetadataExtractor
  def initialize(url, options = {})
    @url = url
    @timeout = options[:timeout] || 30
    @retries = options[:retries] || 3
    @doc = fetch_and_parse_with_retry
  end

  private

  def fetch_and_parse_with_retry
    retries = @retries

    begin
      fetch_and_parse
    rescue StandardError => e
      retries -= 1
      if retries > 0
        sleep(1)
        retry
      else
        raise e
      end
    end
  end

  def fetch_and_parse
    uri = URI(@url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'
    http.read_timeout = @timeout
    http.open_timeout = @timeout

    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'Mozilla/5.0 (compatible; MetadataExtractor/1.0)'
    request['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'

    response = http.request(request)

    case response.code.to_i
    when 200
      Nokogiri::HTML(response.body)
    when 301, 302, 303, 307, 308
      # Handle redirects
      location = response['location']
      if location
        @url = location
        fetch_and_parse
      else
        raise "Redirect without location header"
      end
    else
      raise "HTTP Error: #{response.code} #{response.message}"
    end
  end
end

Advanced Use Cases

For more complex scenarios, you might want to integrate metadata extraction with other web scraping techniques. Consider combining Ruby metadata extraction with browser automation tools when dealing with JavaScript-heavy websites that require handling dynamic content that loads after page load.

Performance Optimization

When extracting metadata from multiple pages, consider implementing concurrent processing:

require 'concurrent'

class BatchMetadataExtractor
  def self.extract_from_urls(urls, max_threads: 5)
    pool = Concurrent::ThreadPoolExecutor.new(
      min_threads: 1,
      max_threads: max_threads
    )

    futures = urls.map do |url|
      Concurrent::Future.execute(executor: pool) do
        extractor = RobustMetadataExtractor.new(url)
        { url: url, metadata: extractor.extract_all_metadata }
      rescue StandardError => e
        { url: url, error: e.message }
      end
    end

    results = futures.map(&:value)
    pool.shutdown
    pool.wait_for_termination

    results
  end
end

# Usage
urls = ['https://example1.com', 'https://example2.com', 'https://example3.com']
results = BatchMetadataExtractor.extract_from_urls(urls)

Conclusion

Ruby's Nokogiri library provides a powerful and flexible way to extract metadata from web pages. Whether you need basic meta tags, social media metadata, or structured data, the techniques shown in this guide will help you build robust metadata extraction tools. Remember to handle errors gracefully, respect website rate limits, and consider the legal implications of web scraping.

For websites with complex JavaScript-rendered content, you might need to combine these Ruby techniques with browser automation tools to ensure you're capturing all available metadata. This comprehensive approach will serve you well for SEO analysis, content aggregation, and web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon