Nokogiri and Mechanize are two Ruby gems that are often used together for web scraping. Nokogiri is a powerful HTML, SAX, and Reader parser, while Mechanize is used to simulate a web browser. Mechanize uses Nokogiri for parsing HTML, which means you don't need to explicitly use it for most tasks. However, knowing how to use Nokogiri with Mechanize can be crucial when you need to perform complex document traversals and manipulations.
Here's how to use Nokogiri with Mechanize for stateful web scraping:
Installation
Firstly, you need to install the Mechanize and Nokogiri gems if you haven't already:
gem install mechanize
gem install nokogiri
Basic Usage
When you use Mechanize to fetch a page, it automatically uses Nokogiri to parse the page, and you can use Nokogiri methods to navigate the HTML document.
Below is an example of using Mechanize and Nokogiri together:
require 'mechanize'
# Create a new Mechanize object
agent = Mechanize.new
# Fetch a page
page = agent.get('https://example.com')
# Mechanize returns a Mechanize::Page object which uses Nokogiri internally
# for parsing and provides Nokogiri methods for traversal and manipulation
# Use Nokogiri to search for nodes by CSS
nodes = page.search('css-selector')
# Iterate over found nodes
nodes.each do |node|
puts node.text.strip
end
# You can also use XPath expressions
nodes = page.xpath('//xpath/expression')
# Perform actions with the found nodes
nodes.each do |node|
# Do something with the node
end
Handling Forms
Mechanize excels at handling forms, making it easy to perform stateful operations such as logging in:
# Assuming 'page' is a Mechanize::Page object containing a form
form = page.form_with(id: 'login-form')
# Fill in the form fields
form.field_with(name: 'username').value = 'your_username'
form.field_with(name: 'password').value = 'your_password'
# Submit the form
page = agent.submit(form)
# Now you're logged in, and 'page' contains the new page after login
Advanced Nokogiri Usage
Sometimes you may want to do more advanced manipulations with Nokogiri that aren't directly supported by Mechanize. You can access the underlying Nokogiri document directly:
# Access the Nokogiri document
nokogiri_doc = page.parser
# Now you can use any Nokogiri method on nokogiri_doc
nokogiri_doc.css('div.some-class').each do |div|
# Do something with each div
end
Example: Extracting Data
Here's a more concrete example of extracting data from a page:
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com')
# Let's say we want to extract product information from an e-commerce page
products = page.search('div.product')
product_details = products.map do |product|
{
name: product.at('h2.product-name').text.strip,
price: product.at('span.product-price').text.strip,
url: product.at('a')['href']
}
end
# Now product_details contains an array of hashes with product information
product_details.each do |product|
puts "Name: #{product[:name]}, Price: #{product[:price]}, URL: #{product[:url]}"
end
In this example, at
is a Nokogiri method that finds the first matching element for the provided CSS selector, while search
finds all matching elements.
Conclusion
Using Mechanize with Nokogiri for stateful web scraping allows you to automate interactions with websites and extract data efficiently. Mechanize handles the stateful parts like maintaining sessions and submitting forms, while Nokogiri provides powerful document traversal and manipulation capabilities. By combining the two, you can create robust web scraping scripts that can handle complex scraping tasks.