How do you create a basic Mechanize agent to fetch web pages?
Mechanize is a powerful Ruby library that automates web browsing interactions, making it an excellent choice for web scraping tasks. Creating a basic Mechanize agent is straightforward and provides a solid foundation for more complex web scraping operations.
Installing Mechanize
Before creating a Mechanize agent, you need to install the gem in your Ruby project:
gem install mechanize
Or add it to your Gemfile:
gem 'mechanize'
Then run:
bundle install
Creating a Basic Mechanize Agent
Here's how to create and use a basic Mechanize agent to fetch web pages:
Simple Page Fetching
require 'mechanize'
# Create a new Mechanize agent
agent = Mechanize.new
# Fetch a web page
page = agent.get('https://example.com')
# Access page content
puts page.title
puts page.body
Complete Basic Example
require 'mechanize'
class WebScraper
def initialize
@agent = Mechanize.new
configure_agent
end
def fetch_page(url)
begin
page = @agent.get(url)
puts "Successfully fetched: #{page.title}"
return page
rescue Mechanize::ResponseCodeError => e
puts "HTTP Error: #{e.response_code}"
return nil
rescue => e
puts "Error: #{e.message}"
return nil
end
end
private
def configure_agent
# Set user agent to appear more like a real browser
@agent.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# Configure timeouts
@agent.open_timeout = 10
@agent.read_timeout = 30
# Follow redirects automatically
@agent.redirect_ok = true
@agent.max_redirects = 5
end
end
# Usage
scraper = WebScraper.new
page = scraper.fetch_page('https://httpbin.org/html')
if page
# Extract specific elements
puts "Page title: #{page.title}"
puts "Number of links: #{page.links.count}"
end
Advanced Agent Configuration
Setting Custom Headers
require 'mechanize'
agent = Mechanize.new
# Set custom headers
agent.request_headers = {
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
}
# Set a realistic user agent
agent.user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
Handling Cookies and Sessions
require 'mechanize'
agent = Mechanize.new
# Enable cookie jar (enabled by default)
agent.cookie_jar = Mechanize::CookieJar.new
# Fetch a page that sets cookies
login_page = agent.get('https://example.com/login')
# Cookies are automatically stored and sent with subsequent requests
dashboard = agent.get('https://example.com/dashboard')
Configuring SSL and Security
require 'mechanize'
agent = Mechanize.new
# Configure SSL verification
agent.verify_mode = OpenSSL::SSL::VERIFY_PEER
# For development/testing only - disable SSL verification
# agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
# Set certificate store
agent.cert_store = OpenSSL::X509::Store.new
agent.cert_store.set_default_paths
Working with Forms and Authentication
Basic Form Submission
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com/login')
# Find and fill out a form
form = page.form_with(name: 'login') || page.forms.first
form.username = 'your_username'
form.password = 'your_password'
# Submit the form
result_page = agent.submit(form)
puts "Login result: #{result_page.title}"
Handling Different Response Types
require 'mechanize'
agent = Mechanize.new
# Handle different content types
agent.pluggable_parser['text/html'] = Mechanize::Page
agent.pluggable_parser['application/json'] = Mechanize::File
agent.pluggable_parser['text/plain'] = Mechanize::File
page = agent.get('https://api.example.com/data.json')
case page
when Mechanize::Page
puts "HTML content: #{page.title}"
when Mechanize::File
puts "File content: #{page.body}"
end
Error Handling and Debugging
Comprehensive Error Handling
require 'mechanize'
class RobustScraper
def initialize
@agent = Mechanize.new
setup_agent
end
def fetch_with_retry(url, max_retries = 3)
retries = 0
begin
page = @agent.get(url)
return page
rescue Mechanize::ResponseCodeError => e
handle_http_error(e, retries, max_retries, url)
rescue Net::TimeoutError => e
handle_timeout_error(e, retries, max_retries, url)
rescue => e
handle_general_error(e, retries, max_retries, url)
end
end
private
def setup_agent
@agent.open_timeout = 10
@agent.read_timeout = 30
@agent.user_agent = 'Mozilla/5.0 (compatible; Ruby Scraper)'
# Enable gzip compression
@agent.gzip_enabled = true
end
def handle_http_error(error, retries, max_retries, url)
case error.response_code.to_i
when 429, 503, 502, 504
if retries < max_retries
sleep_time = 2 ** retries
puts "Rate limited or server error. Retrying in #{sleep_time} seconds..."
sleep(sleep_time)
retries += 1
retry
end
when 404
puts "Page not found: #{url}"
return nil
else
puts "HTTP Error #{error.response_code}: #{error.message}"
return nil
end
end
def handle_timeout_error(error, retries, max_retries, url)
if retries < max_retries
puts "Timeout error. Retrying..."
retries += 1
retry
else
puts "Max retries exceeded for timeout: #{url}"
return nil
end
end
def handle_general_error(error, retries, max_retries, url)
puts "Unexpected error: #{error.class} - #{error.message}"
return nil
end
end
Performance Optimization
Connection Pooling and Keep-Alive
require 'mechanize'
agent = Mechanize.new
# Enable keep-alive connections
agent.keep_alive = true
# Set connection pool size
agent.max_pool_size = 10
# Configure connection reuse
agent.idle_timeout = 300
# Use connection pooling for multiple requests
urls = ['https://example.com/page1', 'https://example.com/page2', 'https://example.com/page3']
pages = urls.map do |url|
agent.get(url)
end
Concurrent Processing with Thread Pool
require 'mechanize'
require 'concurrent'
class ConcurrentScraper
def initialize(pool_size = 5)
@pool = Concurrent::FixedThreadPool.new(pool_size)
end
def scrape_urls(urls)
futures = urls.map do |url|
Concurrent::Future.execute(executor: @pool) do
agent = Mechanize.new
agent.user_agent = 'Mozilla/5.0 (compatible; Ruby Scraper)'
begin
page = agent.get(url)
{ url: url, title: page.title, success: true }
rescue => e
{ url: url, error: e.message, success: false }
end
end
end
# Wait for all futures to complete
results = futures.map(&:value)
@pool.shutdown
results
end
end
# Usage
scraper = ConcurrentScraper.new(3)
urls = ['https://example.com', 'https://google.com', 'https://github.com']
results = scraper.scrape_urls(urls)
results.each do |result|
if result[:success]
puts "#{result[:url]}: #{result[:title]}"
else
puts "#{result[:url]}: Error - #{result[:error]}"
end
end
Best Practices and Considerations
Rate Limiting and Politeness
require 'mechanize'
class PoliteScraper
def initialize(delay = 1)
@agent = Mechanize.new
@delay = delay
@last_request_time = nil
setup_agent
end
def get(url)
# Implement polite delay
if @last_request_time
time_since_last = Time.now - @last_request_time
sleep(@delay - time_since_last) if time_since_last < @delay
end
page = @agent.get(url)
@last_request_time = Time.now
page
end
private
def setup_agent
@agent.user_agent = 'Mozilla/5.0 (compatible; Ruby Scraper; +http://example.com/bot)'
@agent.robots = true # Respect robots.txt
end
end
User Agent Rotation
require 'mechanize'
class UserAgentRotator
USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101'
].freeze
def initialize
@agent = Mechanize.new
end
def get(url)
# Rotate user agent for each request
@agent.user_agent = USER_AGENTS.sample
@agent.get(url)
end
end
Comparing with Browser Automation Tools
While Mechanize is excellent for simple web scraping tasks, it has limitations with JavaScript-heavy websites. For sites requiring JavaScript execution, you might need browser automation tools. For example, how to navigate to different pages using Puppeteer provides an alternative approach for complex web applications.
Troubleshooting Common Issues
Debugging Network Requests
require 'mechanize'
require 'logger'
agent = Mechanize.new
agent.log = Logger.new(STDOUT)
agent.log.level = Logger::DEBUG
# This will log all HTTP requests and responses
page = agent.get('https://example.com')
Handling JavaScript Requirements
# Mechanize cannot execute JavaScript
# For JavaScript-heavy sites, consider using:
# - Watir with headless browsers
# - Puppeteer (Node.js)
# - Selenium WebDriver
# Example check for JavaScript requirement
page = agent.get('https://example.com')
if page.body.include?('Please enable JavaScript')
puts "This site requires JavaScript - consider using a headless browser"
end
Integration with WebScraping.AI API
For more complex scraping tasks that require JavaScript execution or advanced features, consider using the WebScraping.AI API which handles browser automation, proxy rotation, and CAPTCHA solving automatically. This can complement Mechanize for comprehensive web scraping solutions.
Conclusion
Creating a basic Mechanize agent is straightforward and provides a solid foundation for web scraping projects. The library excels at handling forms, cookies, sessions, and standard HTTP interactions. While it cannot execute JavaScript like browser automation tools such as Puppeteer's waitFor function, Mechanize is perfect for traditional web scraping tasks where speed and efficiency are priorities.
Remember to always respect robots.txt files, implement proper rate limiting, and handle errors gracefully to create robust and ethical web scraping applications.