How do I parse RSS and Atom feeds with Nokogiri?
Parsing RSS and Atom feeds is a common requirement when building web scrapers and data aggregation tools. Nokogiri, Ruby's premier XML/HTML parsing library, provides excellent support for parsing both RSS and Atom feeds through its robust XML parsing capabilities.
Understanding RSS and Atom Feed Formats
Before diving into parsing, it's important to understand the structure of these feed formats:
- RSS (Really Simple Syndication): Uses XML with elements like
<channel>
,<item>
,<title>
,<description>
, and<link>
- Atom: A more modern XML-based format with elements like
<feed>
,<entry>
,<title>
,<content>
, and<link>
Setting Up Nokogiri for Feed Parsing
First, ensure you have Nokogiri installed in your Ruby environment:
gem install nokogiri
Or add it to your Gemfile:
gem 'nokogiri'
Parsing RSS Feeds with Nokogiri
Basic RSS Parsing
Here's a complete example of parsing an RSS feed:
require 'nokogiri'
require 'open-uri'
def parse_rss_feed(url)
# Fetch the RSS feed
doc = Nokogiri::XML(URI.open(url))
# Extract feed metadata
feed_info = {
title: doc.at('channel title')&.text,
description: doc.at('channel description')&.text,
link: doc.at('channel link')&.text,
items: []
}
# Parse individual items
doc.xpath('//item').each do |item|
feed_info[:items] << {
title: item.at('title')&.text,
description: item.at('description')&.text,
link: item.at('link')&.text,
pub_date: item.at('pubDate')&.text,
guid: item.at('guid')&.text
}
end
feed_info
end
# Usage example
feed_url = 'https://example.com/rss.xml'
feed_data = parse_rss_feed(feed_url)
puts "Feed Title: #{feed_data[:title]}"
puts "Total Items: #{feed_data[:items].length}"
feed_data[:items].first(5).each_with_index do |item, index|
puts "\n#{index + 1}. #{item[:title]}"
puts " Link: #{item[:link]}"
puts " Published: #{item[:pub_date]}"
end
Advanced RSS Parsing with Namespaces
Many RSS feeds include additional namespaces for extended functionality:
def parse_rss_with_namespaces(url)
doc = Nokogiri::XML(URI.open(url))
# Define namespaces
namespaces = {
'content' => 'http://purl.org/rss/1.0/modules/content/',
'dc' => 'http://purl.org/dc/elements/1.1/',
'media' => 'http://search.yahoo.com/mrss/'
}
items = []
doc.xpath('//item').each do |item|
items << {
title: item.at('title')&.text,
description: item.at('description')&.text,
content: item.at('content:encoded', namespaces)&.text,
author: item.at('dc:creator', namespaces)&.text,
media_url: item.at('media:content', namespaces)&.[]('url'),
link: item.at('link')&.text,
pub_date: item.at('pubDate')&.text
}
end
items
end
Parsing Atom Feeds with Nokogiri
Basic Atom Parsing
Atom feeds have a different structure than RSS feeds:
def parse_atom_feed(url)
doc = Nokogiri::XML(URI.open(url))
# Define Atom namespace
atom_ns = { 'atom' => 'http://www.w3.org/2005/Atom' }
# Extract feed metadata
feed_info = {
title: doc.at('atom:title', atom_ns)&.text,
subtitle: doc.at('atom:subtitle', atom_ns)&.text,
link: doc.at('atom:link[@rel="alternate"]', atom_ns)&.[]('href'),
updated: doc.at('atom:updated', atom_ns)&.text,
entries: []
}
# Parse individual entries
doc.xpath('//atom:entry', atom_ns).each do |entry|
feed_info[:entries] << {
title: entry.at('atom:title', atom_ns)&.text,
content: entry.at('atom:content', atom_ns)&.text,
summary: entry.at('atom:summary', atom_ns)&.text,
link: entry.at('atom:link[@rel="alternate"]', atom_ns)&.[]('href'),
author: entry.at('atom:author/atom:name', atom_ns)&.text,
published: entry.at('atom:published', atom_ns)&.text,
updated: entry.at('atom:updated', atom_ns)&.text,
id: entry.at('atom:id', atom_ns)&.text
}
end
feed_info
end
# Usage example
atom_url = 'https://example.com/atom.xml'
atom_data = parse_atom_feed(atom_url)
puts "Feed Title: #{atom_data[:title]}"
puts "Last Updated: #{atom_data[:updated]}"
puts "Total Entries: #{atom_data[:entries].length}"
Unified Feed Parser for Both RSS and Atom
Create a flexible parser that can handle both feed types:
class FeedParser
def self.parse(url)
doc = Nokogiri::XML(URI.open(url))
if doc.at('rss')
parse_rss(doc)
elsif doc.at('feed')
parse_atom(doc)
else
raise "Unknown feed format"
end
end
private
def self.parse_rss(doc)
{
format: 'RSS',
title: doc.at('channel title')&.text,
description: doc.at('channel description')&.text,
items: extract_rss_items(doc)
}
end
def self.parse_atom(doc)
atom_ns = { 'atom' => 'http://www.w3.org/2005/Atom' }
{
format: 'Atom',
title: doc.at('atom:title', atom_ns)&.text,
description: doc.at('atom:subtitle', atom_ns)&.text,
items: extract_atom_entries(doc, atom_ns)
}
end
def self.extract_rss_items(doc)
doc.xpath('//item').map do |item|
{
title: item.at('title')&.text,
description: item.at('description')&.text,
link: item.at('link')&.text,
date: item.at('pubDate')&.text
}
end
end
def self.extract_atom_entries(doc, namespaces)
doc.xpath('//atom:entry', namespaces).map do |entry|
{
title: entry.at('atom:title', namespaces)&.text,
description: entry.at('atom:summary', namespaces)&.text,
link: entry.at('atom:link[@rel="alternate"]', namespaces)&.[]('href'),
date: entry.at('atom:published', namespaces)&.text
}
end
end
end
# Usage
feed_data = FeedParser.parse('https://example.com/feed.xml')
puts "Feed Format: #{feed_data[:format]}"
puts "Title: #{feed_data[:title]}"
Error Handling and Best Practices
Robust Feed Parsing with Error Handling
require 'timeout'
def safe_parse_feed(url, timeout_seconds = 10)
begin
Timeout::timeout(timeout_seconds) do
doc = Nokogiri::XML(URI.open(url))
# Validate that we have a valid feed
unless doc.at('rss') || doc.at('feed')
raise "Invalid feed format"
end
# Parse based on format
if doc.at('rss')
parse_rss_safely(doc)
else
parse_atom_safely(doc)
end
end
rescue Timeout::Error
{ error: "Feed parsing timed out after #{timeout_seconds} seconds" }
rescue OpenURI::HTTPError => e
{ error: "HTTP error: #{e.message}" }
rescue Nokogiri::XML::SyntaxError => e
{ error: "XML parsing error: #{e.message}" }
rescue => e
{ error: "Unexpected error: #{e.message}" }
end
end
def parse_rss_safely(doc)
{
success: true,
format: 'RSS',
title: safe_extract_text(doc, 'channel title'),
description: safe_extract_text(doc, 'channel description'),
items: doc.xpath('//item').map { |item| extract_rss_item_safely(item) }
}
end
def safe_extract_text(doc, selector)
element = doc.at(selector)
element ? element.text.strip : nil
end
def extract_rss_item_safely(item)
{
title: safe_extract_text(item, 'title'),
description: safe_extract_text(item, 'description'),
link: safe_extract_text(item, 'link'),
pub_date: safe_extract_text(item, 'pubDate')
}
end
Performance Optimization Techniques
Efficient Feed Processing
def optimized_feed_parser(url, limit: nil)
doc = Nokogiri::XML(URI.open(url)) do |config|
config.options = Nokogiri::XML::ParseOptions::NOBLANKS
end
items = []
doc.xpath('//item | //entry').each_with_index do |element, index|
break if limit && index >= limit
if element.name == 'item'
items << extract_rss_item(element)
else
items << extract_atom_entry(element)
end
end
items
end
def extract_rss_item(item)
# Use at() instead of xpath for single elements (faster)
{
title: item.at('title')&.text,
link: item.at('link')&.text,
date: item.at('pubDate')&.text
}
end
Handling Different Character Encodings
When dealing with international feeds, proper encoding handling is crucial:
def parse_feed_with_encoding(url)
content = URI.open(url).read
# Detect encoding from XML declaration or meta tags
encoding = content.match(/encoding=["']([^"']+)["']/i)&.captures&.first || 'UTF-8'
# Force encoding if needed
content.force_encoding(encoding)
content = content.encode('UTF-8', invalid: :replace, undef: :replace)
doc = Nokogiri::XML(content)
# Continue with normal parsing...
end
Working with Feed Updates and Caching
For applications that need to monitor feeds regularly, implementing caching and conditional requests is important:
def fetch_feed_conditionally(url, last_modified: nil, etag: nil)
headers = {}
headers['If-Modified-Since'] = last_modified if last_modified
headers['If-None-Match'] = etag if etag
begin
response = URI.open(url, headers)
{
content: response.read,
last_modified: response.meta['last-modified'],
etag: response.meta['etag'],
modified: true
}
rescue OpenURI::HTTPError => e
if e.message.include?('304')
{ modified: false }
else
raise e
end
end
end
Integration with Web Scraping Workflows
While Nokogiri excels at parsing XML-based feeds, modern web applications often require more complex scraping capabilities. For JavaScript-heavy sites or dynamic content that requires browser automation, you might want to consider combining Nokogiri with browser automation tools for comprehensive data extraction.
Common Pitfalls and Solutions
1. Namespace Issues
Always check for and handle XML namespaces properly, especially with Atom feeds.
2. Malformed XML
Use Nokogiri's error recovery features:
doc = Nokogiri::XML(content) do |config|
config.options = Nokogiri::XML::ParseOptions::RECOVER
end
3. Memory Usage with Large Feeds
For very large feeds, consider using Nokogiri's SAX parser for streaming:
class FeedHandler < Nokogiri::XML::SAX::Document
def start_element(name, attributes = [])
# Handle start of elements
end
def characters(string)
# Handle text content
end
def end_element(name)
# Handle end of elements
end
end
parser = Nokogiri::XML::SAX::Parser.new(FeedHandler.new)
parser.parse(File.open('large_feed.xml'))
Conclusion
Nokogiri provides powerful and flexible tools for parsing both RSS and Atom feeds in Ruby applications. Whether you're building a simple feed reader or a complex data aggregation system, these techniques will help you extract valuable information from syndicated content efficiently and reliably.
For applications requiring real-time feed monitoring or handling feeds from JavaScript-heavy websites, consider combining these Nokogiri techniques with modern web automation approaches for comprehensive data extraction capabilities.