How do I handle redirects when scraping with Nokogiri?

When scraping web pages using Nokogiri, an open-source Ruby library for parsing HTML and XML, you may encounter redirects. Website servers often use redirects to take you from an old or alternate URL to the current URL. When you're scraping, it's important to handle these redirects to reach the actual content you're trying to scrape.

Nokogiri itself doesn't handle the HTTP requests; it's simply a parsing library. To handle redirects, you need to combine Nokogiri with an HTTP client library that supports following redirects, such as Net::HTTP (which comes with Ruby's standard library) or third-party libraries like HTTParty, Open-URI, or Faraday.

Here is an example of handling redirects using Net::HTTP:

require 'net/http'
require 'nokogiri'
require 'uri'

def fetch_and_parse(url)
  uri = URI.parse(url)
  response = Net::HTTP.get_response(uri)

  # Follow redirects
  if response.is_a?(Net::HTTPRedirection)
    location = response['location']
    warn "Redirected to #{location}"
    return fetch_and_parse(location) # Recursive call to handle the new location
  end

  # Ensure that the response code is 200 OK before parsing
  unless response.is_a?(Net::HTTPOK)
    raise "Unable to fetch page: #{response.code} #{response.message}"
  end

  # Parse the body with Nokogiri
  document = Nokogiri::HTML(response.body)
  return document
end

# Example usage
begin
  url = 'http://example.com/some-page'
  document = fetch_and_parse(url)
  # Now you can work with `document` as a Nokogiri::HTML::Document
  puts document.title
rescue => e
  puts e.message
end

In the example above, the fetch_and_parse method is defined to handle the actual fetching of the page and following any redirects. The Net::HTTP.get_response method is used to perform an HTTP GET request, and if a redirect response is received (Net::HTTPRedirection), it extracts the Location header to get the new URL and then recursively calls itself with the new URL. It's important to have some mechanism to prevent infinite loops in the case of circular redirects.

If you prefer a more concise and less manual approach, you can use the Open-URI library, which is a wrapper around Net::HTTP that handles redirects automatically.

require 'open-uri'
require 'nokogiri'

url = 'http://example.com/some-page'
document = Nokogiri::HTML(URI.open(url)) # Open-URI handles redirects for you

# Now you can work with `document` as a Nokogiri::HTML::Document
puts document.title

Remember to be considerate when scraping websites: check the site's robots.txt to see if scraping is allowed, don't overload their servers with frequent requests, and respect any rate limits they may have. Also, be aware of the legal implications of scraping, as it can be restricted or prohibited by the website's terms of service.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon