How can I parse HTML responses with HTTParty and Nokogiri?
Combining HTTParty and Nokogiri creates a powerful toolchain for web scraping in Ruby. HTTParty handles HTTP requests while Nokogiri excels at parsing and manipulating HTML/XML documents. This combination allows you to fetch web pages and extract structured data efficiently.
Overview of HTTParty and Nokogiri
HTTParty is a Ruby gem that simplifies HTTP requests with an intuitive API. It handles various HTTP methods, headers, authentication, and response parsing.
Nokogiri is a fast and flexible HTML/XML parser that provides CSS selectors and XPath support for element selection and manipulation.
Installation and Setup
First, add both gems to your Gemfile:
gem 'httparty'
gem 'nokogiri'
Then install them:
bundle install
Or install individually:
gem install httparty nokogiri
Basic HTML Parsing with HTTParty and Nokogiri
Here's a simple example that fetches a webpage and parses its content:
require 'httparty'
require 'nokogiri'
class WebScraper
include HTTParty
def self.scrape_page(url)
# Fetch the webpage
response = get(url)
# Parse the HTML response
doc = Nokogiri::HTML(response.body)
# Extract data using CSS selectors
title = doc.css('title').text
headings = doc.css('h1, h2, h3').map(&:text)
{
title: title,
headings: headings,
status: response.code
}
end
end
# Usage
result = WebScraper.scrape_page('https://example.com')
puts result[:title]
Advanced HTML Parsing Techniques
Using XPath Selectors
XPath provides more powerful selection capabilities than CSS selectors:
require 'httparty'
require 'nokogiri'
def scrape_with_xpath(url)
response = HTTParty.get(url)
doc = Nokogiri::HTML(response.body)
# Extract links with specific attributes
external_links = doc.xpath('//a[starts-with(@href, "http")]/@href').map(&:value)
# Find elements containing specific text
paragraphs_with_keyword = doc.xpath('//p[contains(text(), "scraping")]').map(&:text)
# Get elements by position
first_paragraph = doc.xpath('//p[1]').text
{
external_links: external_links,
keyword_paragraphs: paragraphs_with_keyword,
first_paragraph: first_paragraph
}
end
Handling Different Content Types
def parse_response_by_type(url)
response = HTTParty.get(url)
case response.headers['content-type']
when /html/
doc = Nokogiri::HTML(response.body)
return extract_html_data(doc)
when /xml/
doc = Nokogiri::XML(response.body)
return extract_xml_data(doc)
when /json/
return JSON.parse(response.body)
else
return response.body
end
end
def extract_html_data(doc)
{
title: doc.css('title').text,
meta_description: doc.css('meta[name="description"]').first&.[]('content'),
article_content: doc.css('article, .content, .post').map(&:text)
}
end
Error Handling and Robust Parsing
Implement proper error handling to make your scraper more reliable:
require 'httparty'
require 'nokogiri'
class RobustScraper
include HTTParty
# Set default options
default_timeout 30
headers 'User-Agent' => 'Mozilla/5.0 (compatible; RubyScraper/1.0)'
def self.safe_scrape(url)
begin
response = get(url)
# Check response status
unless response.success?
return { error: "HTTP #{response.code}: #{response.message}" }
end
# Attempt to parse HTML
doc = Nokogiri::HTML(response.body)
# Verify the document was parsed successfully
if doc.nil? || doc.errors.any?
return { error: "Failed to parse HTML", parse_errors: doc.errors }
end
# Extract data safely
data = extract_safe_data(doc)
return { success: true, data: data }
rescue HTTParty::Error => e
return { error: "HTTP error: #{e.message}" }
rescue Nokogiri::XML::SyntaxError => e
return { error: "Parse error: #{e.message}" }
rescue => e
return { error: "Unexpected error: #{e.message}" }
end
end
private
def self.extract_safe_data(doc)
{
title: doc.css('title').first&.text&.strip || 'No title found',
headings: doc.css('h1, h2, h3').map { |h| h.text.strip }.reject(&:empty?),
links: doc.css('a[href]').map { |link|
{ text: link.text.strip, url: link['href'] }
}.reject { |link| link[:text].empty? },
images: doc.css('img[src]').map { |img|
{ alt: img['alt'] || '', src: img['src'] }
}
}
end
end
# Usage with error handling
result = RobustScraper.safe_scrape('https://example.com')
if result[:success]
puts "Title: #{result[:data][:title]}"
puts "Found #{result[:data][:links].count} links"
else
puts "Error: #{result[:error]}"
end
Working with Forms and Form Data
Extract form information for automated form submissions:
def extract_form_data(url)
response = HTTParty.get(url)
doc = Nokogiri::HTML(response.body)
forms = doc.css('form').map do |form|
{
action: form['action'],
method: form['method'] || 'GET',
fields: form.css('input, select, textarea').map do |field|
{
name: field['name'],
type: field['type'] || field.name,
value: field['value'],
required: field.has_attribute?('required')
}
end.compact
}
end
forms
end
# Example: Auto-fill and submit a form
def submit_form_data(url, form_data)
# First, get the page to extract CSRF tokens or hidden fields
response = HTTParty.get(url)
doc = Nokogiri::HTML(response.body)
# Extract hidden fields (like CSRF tokens)
hidden_fields = {}
doc.css('form input[type="hidden"]').each do |field|
hidden_fields[field['name']] = field['value']
end
# Merge with our form data
complete_data = hidden_fields.merge(form_data)
# Submit the form
form_action = doc.css('form').first['action']
HTTParty.post(form_action, body: complete_data)
end
Performance Optimization
Streaming Large Documents
For large HTML documents, consider streaming to reduce memory usage:
require 'nokogiri'
def parse_large_document(url)
response = HTTParty.get(url, stream_body: true)
# Use SAX parser for large documents
parser = Nokogiri::HTML::SAX::Parser.new(MyDocumentHandler.new)
parser.parse(response.body)
end
class MyDocumentHandler < Nokogiri::XML::SAX::Document
def initialize
@current_element = nil
@data = []
end
def start_element(name, attrs = [])
@current_element = name
if name == 'a' && attrs.find { |attr| attr[0] == 'href' }
@data << { type: 'link', href: attrs.find { |attr| attr[0] == 'href' }[1] }
end
end
def characters(string)
if @current_element == 'title'
@data << { type: 'title', text: string.strip }
end
end
end
Connection Reuse and Pooling
class PooledScraper
include HTTParty
# Enable connection persistence
persistent_connection_adapter
# Set connection pool options
default_options.update(
timeout: 30,
open_timeout: 10,
read_timeout: 30,
pool_size: 10
)
def self.scrape_multiple_pages(urls)
results = urls.map do |url|
response = get(url)
doc = Nokogiri::HTML(response.body)
{
url: url,
title: doc.css('title').text,
status: response.code
}
end
results
end
end
Common Patterns and Best Practices
Data Extraction with CSS Selectors
def extract_structured_data(doc)
# Extract article metadata
article_data = {
title: doc.css('h1').first&.text&.strip,
author: doc.css('.author, [data-author]').first&.text&.strip,
date: doc.css('time, .date, [datetime]').first&.[]('datetime'),
content: doc.css('.content, .article-body, main').first&.text&.strip,
tags: doc.css('.tag, .category').map { |tag| tag.text.strip },
images: doc.css('img').map { |img|
{
src: img['src'],
alt: img['alt'],
caption: img.parent.css('figcaption').text
}
}
}
# Clean up empty values
article_data.compact
end
Handling Dynamic Content
While HTTParty and Nokogiri work great for static HTML, some sites require JavaScript execution. For such cases, you might need to consider alternatives that can handle dynamic content generation, similar to how modern browser automation tools handle JavaScript-heavy websites.
Rate Limiting and Respectful Scraping
class RespectfulScraper
include HTTParty
def self.scrape_with_delays(urls, delay: 1)
results = []
urls.each_with_index do |url, index|
# Add delay between requests
sleep(delay) unless index.zero?
begin
response = get(url)
doc = Nokogiri::HTML(response.body)
results << extract_data(doc)
puts "Scraped: #{url} (#{response.code})"
rescue => e
puts "Failed to scrape #{url}: #{e.message}"
results << { error: e.message, url: url }
end
end
results
end
end
Debugging and Troubleshooting
Inspecting HTTP Responses
def debug_response(url)
response = HTTParty.get(url, debug_output: STDOUT)
puts "Status: #{response.code}"
puts "Headers: #{response.headers}"
puts "Content-Type: #{response.headers['content-type']}"
puts "Body length: #{response.body.length}"
# Save response for inspection
File.write('debug_response.html', response.body)
doc = Nokogiri::HTML(response.body)
puts "Document errors: #{doc.errors}" unless doc.errors.empty?
end
Testing Selectors
def test_selectors(html_content)
doc = Nokogiri::HTML(html_content)
selectors = [
'title',
'h1, h2, h3',
'.content, .article, main',
'a[href^="http"]'
]
selectors.each do |selector|
elements = doc.css(selector)
puts "#{selector}: Found #{elements.count} elements"
elements.first(3).each { |el| puts " - #{el.text.strip[0..50]}..." }
end
end
Conclusion
HTTParty and Nokogiri provide a robust foundation for web scraping in Ruby. HTTParty handles the HTTP complexity while Nokogiri offers powerful HTML parsing capabilities. This combination allows you to build efficient scrapers that can extract structured data from websites.
Key takeaways: - Always implement proper error handling for network and parsing errors - Use appropriate selectors (CSS or XPath) based on your extraction needs - Respect website resources with proper delays and connection management - Consider the website's structure and content type when designing your parsing logic - Test your selectors thoroughly and handle edge cases gracefully
For more complex scenarios involving dynamic content or sophisticated browser interactions, you might need to explore headless browser solutions that can handle authentication flows or manage complex browser sessions.