Nokogiri is a popular Ruby library for parsing HTML, XML, SAX, and Reader. It's a great tool for web scraping because it provides a simple way to navigate and manipulate the structure of an HTML/XML document.
When web scraping, you might need to use a proxy to avoid IP bans, bypass geographical restrictions, or scrape websites that have rate limiting in place. Here's how you can use Nokogiri with a proxy in Ruby:
First, you'll need to use an HTTP client that supports proxies, such as Net::HTTP
, OpenURI
, or any other HTTP client libraries like HTTParty
or Faraday
. In this example, we will use Net::HTTP
:
require 'nokogiri'
require 'open-uri'
# Specify the proxy address and port
proxy_addr = 'your.proxy.address'
proxy_port = 8080
# Create a new Net::HTTP object with the proxy settings
proxy = Net::HTTP::Proxy(proxy_addr, proxy_port)
# Define the URL you want to scrape
url = 'http://example.com'
# Use the proxy object to create a new HTTP request and get the response
response = proxy.start(url.host, url.port) do |http|
request = Net::HTTP::Get.new(url)
http.request(request)
end
# Parse the response body with Nokogiri
doc = Nokogiri::HTML(response.body)
# Now you can use Nokogiri methods to navigate the document
titles = doc.css('h1').map(&:text)
puts titles
If you need to use authentication for the proxy, you can modify the proxy initialization to include the username and password:
proxy = Net::HTTP::Proxy(proxy_addr, proxy_port, proxy_user, proxy_pass)
Replace proxy_user
and proxy_pass
with your proxy's username and password.
If you're using OpenURI
, you can set the proxy environment variables before making the request:
require 'nokogiri'
require 'open-uri'
# Set environment variables for the proxy
ENV['http_proxy'] = 'http://your.proxy.address:8080'
ENV['https_proxy'] = 'https://your.proxy.address:8080'
# Now when you open a URL, OpenURI will use the proxy settings
doc = Nokogiri::HTML(URI.open('http://example.com'))
# Use Nokogiri as you normally would
puts doc.css('title').text
Keep in mind that the environment variables http_proxy
and https_proxy
should be set to the full URL of the proxy, including "http://" or "https://".
Remember that using a proxy can slow down your web scraping because your requests have to go through another server. Additionally, ensure you have the legal right to scrape the website and that you comply with its robots.txt
file and Terms of Service.