Using HTTParty with Nokogiri is a common approach to making HTTP requests and parsing HTML responses in Ruby. HTTParty simplifies the process of making HTTP requests, while Nokogiri is a powerful and fast HTML, SAX, and Reader parser.
Here is a step-by-step guide on how to use HTTParty with Nokogiri for parsing HTML responses:
Step 1: Install the Required Gems
Before you start, make sure you have both HTTParty and Nokogiri gems installed. You can install them using gem
:
gem install httparty
gem install nokogiri
Or include them in your Gemfile
if you're using Bundler:
gem 'httparty'
gem 'nokogiri'
And run bundle install
to install the gems.
Step 2: Require the Gems in Your Script
In your Ruby script or application, require both HTTParty and Nokogiri:
require 'httparty'
require 'nokogiri'
Step 3: Make an HTTP Request Using HTTParty
Use HTTParty to make an HTTP request to the desired URL:
response = HTTParty.get('https://example.com')
Step 4: Parse the HTML Response Using Nokogiri
Create a Nokogiri HTML document from the response body:
html_doc = Nokogiri::HTML(response.body)
Step 5: Extract Data Using Nokogiri's Search Methods
Use Nokogiri's .css
, .xpath
, or other search methods to parse and extract data from the HTML document:
# Using CSS selectors
titles = html_doc.css('h1').map(&:text)
# Using XPath selectors
links = html_doc.xpath('//a[@href]').map { |link| link['href'] }
Complete Example
Here is a complete example that fetches a web page and extracts the titles (using h1
tags) and all the hyperlinks:
require 'httparty'
require 'nokogiri'
# Make an HTTP GET request
response = HTTParty.get('https://example.com')
# Parse the HTML response body with Nokogiri
html_doc = Nokogiri::HTML(response.body)
# Extract and print all h1 titles
html_doc.css('h1').each do |title|
puts title.text.strip
end
# Extract and print all hyperlinks
html_doc.css('a').each do |link|
puts link['href']
end
Keep in mind that web scraping might be against the terms of service of some websites. Always check the robots.txt
file of the website and ensure you are allowed to scrape it. Also, ensure that your scraping activities do not put an undue load on the website's servers.
Remember to handle errors and potential exceptions that may arise when making HTTP requests or parsing HTML. You might want to add error checking after the HTTP request to verify that it was successful before attempting to parse the HTML.