How do I parse RSS and Atom feeds with Nokogiri?

Parsing RSS and Atom feeds is a common requirement when building web scrapers and data aggregation tools. Nokogiri, Ruby's premier XML/HTML parsing library, provides excellent support for parsing both RSS and Atom feeds through its robust XML parsing capabilities.

Understanding RSS and Atom Feed Formats

Before diving into parsing, it's important to understand the structure of these feed formats:

RSS (Really Simple Syndication): Uses XML with elements like <channel>, <item>, <title>, <description>, and <link>
Atom: A more modern XML-based format with elements like <feed>, <entry>, <title>, <content>, and <link>

Setting Up Nokogiri for Feed Parsing

First, ensure you have Nokogiri installed in your Ruby environment:

gem install nokogiri

Or add it to your Gemfile:

gem 'nokogiri'

Parsing RSS Feeds with Nokogiri

Basic RSS Parsing

Here's a complete example of parsing an RSS feed:

require 'nokogiri'
require 'open-uri'

def parse_rss_feed(url)
  # Fetch the RSS feed
  doc = Nokogiri::XML(URI.open(url))

  # Extract feed metadata
  feed_info = {
    title: doc.at('channel title')&.text,
    description: doc.at('channel description')&.text,
    link: doc.at('channel link')&.text,
    items: []
  }

  # Parse individual items
  doc.xpath('//item').each do |item|
    feed_info[:items] << {
      title: item.at('title')&.text,
      description: item.at('description')&.text,
      link: item.at('link')&.text,
      pub_date: item.at('pubDate')&.text,
      guid: item.at('guid')&.text
    }
  end

  feed_info
end

# Usage example
feed_url = 'https://example.com/rss.xml'
feed_data = parse_rss_feed(feed_url)

puts "Feed Title: #{feed_data[:title]}"
puts "Total Items: #{feed_data[:items].length}"

feed_data[:items].first(5).each_with_index do |item, index|
  puts "\n#{index + 1}. #{item[:title]}"
  puts "   Link: #{item[:link]}"
  puts "   Published: #{item[:pub_date]}"
end

Advanced RSS Parsing with Namespaces

Many RSS feeds include additional namespaces for extended functionality:

def parse_rss_with_namespaces(url)
  doc = Nokogiri::XML(URI.open(url))

  # Define namespaces
  namespaces = {
    'content' => 'http://purl.org/rss/1.0/modules/content/',
    'dc' => 'http://purl.org/dc/elements/1.1/',
    'media' => 'http://search.yahoo.com/mrss/'
  }

  items = []
  doc.xpath('//item').each do |item|
    items << {
      title: item.at('title')&.text,
      description: item.at('description')&.text,
      content: item.at('content:encoded', namespaces)&.text,
      author: item.at('dc:creator', namespaces)&.text,
      media_url: item.at('media:content', namespaces)&.[]('url'),
      link: item.at('link')&.text,
      pub_date: item.at('pubDate')&.text
    }
  end

  items
end

Parsing Atom Feeds with Nokogiri

Basic Atom Parsing

Atom feeds have a different structure than RSS feeds:

def parse_atom_feed(url)
  doc = Nokogiri::XML(URI.open(url))

  # Define Atom namespace
  atom_ns = { 'atom' => 'http://www.w3.org/2005/Atom' }

  # Extract feed metadata
  feed_info = {
    title: doc.at('atom:title', atom_ns)&.text,
    subtitle: doc.at('atom:subtitle', atom_ns)&.text,
    link: doc.at('atom:link[@rel="alternate"]', atom_ns)&.[]('href'),
    updated: doc.at('atom:updated', atom_ns)&.text,
    entries: []
  }

  # Parse individual entries
  doc.xpath('//atom:entry', atom_ns).each do |entry|
    feed_info[:entries] << {
      title: entry.at('atom:title', atom_ns)&.text,
      content: entry.at('atom:content', atom_ns)&.text,
      summary: entry.at('atom:summary', atom_ns)&.text,
      link: entry.at('atom:link[@rel="alternate"]', atom_ns)&.[]('href'),
      author: entry.at('atom:author/atom:name', atom_ns)&.text,
      published: entry.at('atom:published', atom_ns)&.text,
      updated: entry.at('atom:updated', atom_ns)&.text,
      id: entry.at('atom:id', atom_ns)&.text
    }
  end

  feed_info
end

# Usage example
atom_url = 'https://example.com/atom.xml'
atom_data = parse_atom_feed(atom_url)

puts "Feed Title: #{atom_data[:title]}"
puts "Last Updated: #{atom_data[:updated]}"
puts "Total Entries: #{atom_data[:entries].length}"

Unified Feed Parser for Both RSS and Atom

Create a flexible parser that can handle both feed types:

class FeedParser
  def self.parse(url)
    doc = Nokogiri::XML(URI.open(url))

    if doc.at('rss')
      parse_rss(doc)
    elsif doc.at('feed')
      parse_atom(doc)
    else
      raise "Unknown feed format"
    end
  end

  private

  def self.parse_rss(doc)
    {
      format: 'RSS',
      title: doc.at('channel title')&.text,
      description: doc.at('channel description')&.text,
      items: extract_rss_items(doc)
    }
  end

  def self.parse_atom(doc)
    atom_ns = { 'atom' => 'http://www.w3.org/2005/Atom' }
    {
      format: 'Atom',
      title: doc.at('atom:title', atom_ns)&.text,
      description: doc.at('atom:subtitle', atom_ns)&.text,
      items: extract_atom_entries(doc, atom_ns)
    }
  end

  def self.extract_rss_items(doc)
    doc.xpath('//item').map do |item|
      {
        title: item.at('title')&.text,
        description: item.at('description')&.text,
        link: item.at('link')&.text,
        date: item.at('pubDate')&.text
      }
    end
  end

  def self.extract_atom_entries(doc, namespaces)
    doc.xpath('//atom:entry', namespaces).map do |entry|
      {
        title: entry.at('atom:title', namespaces)&.text,
        description: entry.at('atom:summary', namespaces)&.text,
        link: entry.at('atom:link[@rel="alternate"]', namespaces)&.[]('href'),
        date: entry.at('atom:published', namespaces)&.text
      }
    end
  end
end

# Usage
feed_data = FeedParser.parse('https://example.com/feed.xml')
puts "Feed Format: #{feed_data[:format]}"
puts "Title: #{feed_data[:title]}"

Error Handling and Best Practices

Robust Feed Parsing with Error Handling

require 'timeout'

def safe_parse_feed(url, timeout_seconds = 10)
  begin
    Timeout::timeout(timeout_seconds) do
      doc = Nokogiri::XML(URI.open(url))

      # Validate that we have a valid feed
      unless doc.at('rss') || doc.at('feed')
        raise "Invalid feed format"
      end

      # Parse based on format
      if doc.at('rss')
        parse_rss_safely(doc)
      else
        parse_atom_safely(doc)
      end
    end
  rescue Timeout::Error
    { error: "Feed parsing timed out after #{timeout_seconds} seconds" }
  rescue OpenURI::HTTPError => e
    { error: "HTTP error: #{e.message}" }
  rescue Nokogiri::XML::SyntaxError => e
    { error: "XML parsing error: #{e.message}" }
  rescue => e
    { error: "Unexpected error: #{e.message}" }
  end
end

def parse_rss_safely(doc)
  {
    success: true,
    format: 'RSS',
    title: safe_extract_text(doc, 'channel title'),
    description: safe_extract_text(doc, 'channel description'),
    items: doc.xpath('//item').map { |item| extract_rss_item_safely(item) }
  }
end

def safe_extract_text(doc, selector)
  element = doc.at(selector)
  element ? element.text.strip : nil
end

def extract_rss_item_safely(item)
  {
    title: safe_extract_text(item, 'title'),
    description: safe_extract_text(item, 'description'),
    link: safe_extract_text(item, 'link'),
    pub_date: safe_extract_text(item, 'pubDate')
  }
end

Performance Optimization Techniques

Efficient Feed Processing

def optimized_feed_parser(url, limit: nil)
  doc = Nokogiri::XML(URI.open(url)) do |config|
    config.options = Nokogiri::XML::ParseOptions::NOBLANKS
  end

  items = []
  doc.xpath('//item | //entry').each_with_index do |element, index|
    break if limit && index >= limit

    if element.name == 'item'
      items << extract_rss_item(element)
    else
      items << extract_atom_entry(element)
    end
  end

  items
end

def extract_rss_item(item)
  # Use at() instead of xpath for single elements (faster)
  {
    title: item.at('title')&.text,
    link: item.at('link')&.text,
    date: item.at('pubDate')&.text
  }
end

Handling Different Character Encodings

When dealing with international feeds, proper encoding handling is crucial:

def parse_feed_with_encoding(url)
  content = URI.open(url).read

  # Detect encoding from XML declaration or meta tags
  encoding = content.match(/encoding=["']([^"']+)["']/i)&.captures&.first || 'UTF-8'

  # Force encoding if needed
  content.force_encoding(encoding)
  content = content.encode('UTF-8', invalid: :replace, undef: :replace)

  doc = Nokogiri::XML(content)

  # Continue with normal parsing...
end

Working with Feed Updates and Caching

For applications that need to monitor feeds regularly, implementing caching and conditional requests is important:

def fetch_feed_conditionally(url, last_modified: nil, etag: nil)
  headers = {}
  headers['If-Modified-Since'] = last_modified if last_modified
  headers['If-None-Match'] = etag if etag

  begin
    response = URI.open(url, headers)
    {
      content: response.read,
      last_modified: response.meta['last-modified'],
      etag: response.meta['etag'],
      modified: true
    }
  rescue OpenURI::HTTPError => e
    if e.message.include?('304')
      { modified: false }
    else
      raise e
    end
  end
end

Integration with Web Scraping Workflows

While Nokogiri excels at parsing XML-based feeds, modern web applications often require more complex scraping capabilities. For JavaScript-heavy sites or dynamic content that requires browser automation, you might want to consider combining Nokogiri with browser automation tools for comprehensive data extraction.

Common Pitfalls and Solutions

1. Namespace Issues

Always check for and handle XML namespaces properly, especially with Atom feeds.

2. Malformed XML

Use Nokogiri's error recovery features:

doc = Nokogiri::XML(content) do |config|
  config.options = Nokogiri::XML::ParseOptions::RECOVER
end

3. Memory Usage with Large Feeds

For very large feeds, consider using Nokogiri's SAX parser for streaming:

class FeedHandler < Nokogiri::XML::SAX::Document
  def start_element(name, attributes = [])
    # Handle start of elements
  end

  def characters(string)
    # Handle text content
  end

  def end_element(name)
    # Handle end of elements
  end
end

parser = Nokogiri::XML::SAX::Parser.new(FeedHandler.new)
parser.parse(File.open('large_feed.xml'))

Conclusion

Nokogiri provides powerful and flexible tools for parsing both RSS and Atom feeds in Ruby applications. Whether you're building a simple feed reader or a complex data aggregation system, these techniques will help you extract valuable information from syndicated content efficiently and reliably.

For applications requiring real-time feed monitoring or handling feeds from JavaScript-heavy websites, consider combining these Nokogiri techniques with modern web automation approaches for comprehensive data extraction capabilities.

Table of contents