When scraping web pages using Nokogiri, an open-source Ruby library for parsing HTML and XML, you may encounter redirects. Website servers often use redirects to take you from an old or alternate URL to the current URL. When you're scraping, it's important to handle these redirects to reach the actual content you're trying to scrape.
Nokogiri itself doesn't handle the HTTP requests; it's simply a parsing library. To handle redirects, you need to combine Nokogiri with an HTTP client library that supports following redirects, such as Net::HTTP
(which comes with Ruby's standard library) or third-party libraries like HTTParty
, Open-URI
, or Faraday
.
Here is an example of handling redirects using Net::HTTP
:
require 'net/http'
require 'nokogiri'
require 'uri'
def fetch_and_parse(url)
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
# Follow redirects
if response.is_a?(Net::HTTPRedirection)
location = response['location']
warn "Redirected to #{location}"
return fetch_and_parse(location) # Recursive call to handle the new location
end
# Ensure that the response code is 200 OK before parsing
unless response.is_a?(Net::HTTPOK)
raise "Unable to fetch page: #{response.code} #{response.message}"
end
# Parse the body with Nokogiri
document = Nokogiri::HTML(response.body)
return document
end
# Example usage
begin
url = 'http://example.com/some-page'
document = fetch_and_parse(url)
# Now you can work with `document` as a Nokogiri::HTML::Document
puts document.title
rescue => e
puts e.message
end
In the example above, the fetch_and_parse
method is defined to handle the actual fetching of the page and following any redirects. The Net::HTTP.get_response
method is used to perform an HTTP GET request, and if a redirect response is received (Net::HTTPRedirection
), it extracts the Location
header to get the new URL and then recursively calls itself with the new URL. It's important to have some mechanism to prevent infinite loops in the case of circular redirects.
If you prefer a more concise and less manual approach, you can use the Open-URI
library, which is a wrapper around Net::HTTP
that handles redirects automatically.
require 'open-uri'
require 'nokogiri'
url = 'http://example.com/some-page'
document = Nokogiri::HTML(URI.open(url)) # Open-URI handles redirects for you
# Now you can work with `document` as a Nokogiri::HTML::Document
puts document.title
Remember to be considerate when scraping websites: check the site's robots.txt
to see if scraping is allowed, don't overload their servers with frequent requests, and respect any rate limits they may have. Also, be aware of the legal implications of scraping, as it can be restricted or prohibited by the website's terms of service.