How do I use HTTParty with Nokogiri for parsing HTML responses?

Using HTTParty with Nokogiri is a common approach to making HTTP requests and parsing HTML responses in Ruby. HTTParty simplifies the process of making HTTP requests, while Nokogiri is a powerful and fast HTML, SAX, and Reader parser.

Here is a step-by-step guide on how to use HTTParty with Nokogiri for parsing HTML responses:

Step 1: Install the Required Gems

Before you start, make sure you have both HTTParty and Nokogiri gems installed. You can install them using gem:

gem install httparty
gem install nokogiri

Or include them in your Gemfile if you're using Bundler:

gem 'httparty'
gem 'nokogiri'

And run bundle install to install the gems.

Step 2: Require the Gems in Your Script

In your Ruby script or application, require both HTTParty and Nokogiri:

require 'httparty'
require 'nokogiri'

Step 3: Make an HTTP Request Using HTTParty

Use HTTParty to make an HTTP request to the desired URL:

response = HTTParty.get('https://example.com')

Step 4: Parse the HTML Response Using Nokogiri

Create a Nokogiri HTML document from the response body:

html_doc = Nokogiri::HTML(response.body)

Step 5: Extract Data Using Nokogiri's Search Methods

Use Nokogiri's .css, .xpath, or other search methods to parse and extract data from the HTML document:

# Using CSS selectors
titles = html_doc.css('h1').map(&:text)

# Using XPath selectors
links = html_doc.xpath('//a[@href]').map { |link| link['href'] }

Complete Example

Here is a complete example that fetches a web page and extracts the titles (using h1 tags) and all the hyperlinks:

require 'httparty'
require 'nokogiri'

# Make an HTTP GET request
response = HTTParty.get('https://example.com')

# Parse the HTML response body with Nokogiri
html_doc = Nokogiri::HTML(response.body)

# Extract and print all h1 titles
html_doc.css('h1').each do |title|
  puts title.text.strip
end

# Extract and print all hyperlinks
html_doc.css('a').each do |link|
  puts link['href']
end

Keep in mind that web scraping might be against the terms of service of some websites. Always check the robots.txt file of the website and ensure you are allowed to scrape it. Also, ensure that your scraping activities do not put an undue load on the website's servers.

Remember to handle errors and potential exceptions that may arise when making HTTP requests or parsing HTML. You might want to add error checking after the HTTP request to verify that it was successful before attempting to parse the HTML.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon