How do I use Nokogiri with a proxy?

Nokogiri is a popular Ruby library for parsing HTML, XML, SAX, and Reader. It's a great tool for web scraping because it provides a simple way to navigate and manipulate the structure of an HTML/XML document.

When web scraping, you might need to use a proxy to avoid IP bans, bypass geographical restrictions, or scrape websites that have rate limiting in place. Here's how you can use Nokogiri with a proxy in Ruby:

First, you'll need to use an HTTP client that supports proxies, such as Net::HTTP, OpenURI, or any other HTTP client libraries like HTTParty or Faraday. In this example, we will use Net::HTTP:

require 'nokogiri'
require 'open-uri'

# Specify the proxy address and port
proxy_addr = 'your.proxy.address'
proxy_port = 8080

# Create a new Net::HTTP object with the proxy settings
proxy = Net::HTTP::Proxy(proxy_addr, proxy_port)

# Define the URL you want to scrape
url = 'http://example.com'

# Use the proxy object to create a new HTTP request and get the response
response = proxy.start(url.host, url.port) do |http|
  request = Net::HTTP::Get.new(url)
  http.request(request)
end

# Parse the response body with Nokogiri
doc = Nokogiri::HTML(response.body)

# Now you can use Nokogiri methods to navigate the document
titles = doc.css('h1').map(&:text)
puts titles

If you need to use authentication for the proxy, you can modify the proxy initialization to include the username and password:

proxy = Net::HTTP::Proxy(proxy_addr, proxy_port, proxy_user, proxy_pass)

Replace proxy_user and proxy_pass with your proxy's username and password.

If you're using OpenURI, you can set the proxy environment variables before making the request:

require 'nokogiri'
require 'open-uri'

# Set environment variables for the proxy
ENV['http_proxy'] = 'http://your.proxy.address:8080'
ENV['https_proxy'] = 'https://your.proxy.address:8080'

# Now when you open a URL, OpenURI will use the proxy settings
doc = Nokogiri::HTML(URI.open('http://example.com'))

# Use Nokogiri as you normally would
puts doc.css('title').text

Keep in mind that the environment variables http_proxy and https_proxy should be set to the full URL of the proxy, including "http://" or "https://".

Remember that using a proxy can slow down your web scraping because your requests have to go through another server. Additionally, ensure you have the legal right to scrape the website and that you comply with its robots.txt file and Terms of Service.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon