How do you test Mechanize scripts and mock HTTP responses?
Testing Mechanize scripts is essential for building reliable web scraping applications. Unlike traditional unit tests, web scraping tests need to handle external dependencies, network requests, and dynamic content. This guide covers comprehensive testing strategies, mocking techniques, and best practices for testing Ruby Mechanize scripts.
Why Testing Mechanize Scripts is Important
Testing web scraping scripts presents unique challenges:
- External Dependencies: Scripts rely on third-party websites that can change or become unavailable
- Network Variability: Internet connectivity and server response times can affect test reliability
- Dynamic Content: Websites may serve different content based on location, time, or user session
- Rate Limiting: Frequent testing can trigger rate limits or IP blocks
Proper testing ensures your scraping scripts are robust, maintainable, and won't break when websites change.
Setting Up the Testing Environment
Required Gems
Add these gems to your Gemfile
:
group :test do
gem 'rspec'
gem 'webmock'
gem 'vcr'
gem 'mechanize'
end
Install the dependencies:
bundle install
Basic RSpec Configuration
Create a spec/spec_helper.rb
file:
require 'mechanize'
require 'webmock/rspec'
require 'vcr'
# Configure WebMock
WebMock.disable_net_connect!(allow_localhost: true)
# Configure VCR
VCR.configure do |config|
config.cassette_library_dir = 'spec/vcr_cassettes'
config.hook_into :webmock
config.configure_rspec_metadata!
config.allow_http_connections_when_no_cassette = false
end
RSpec.configure do |config|
config.expect_with :rspec do |expectations|
expectations.include_chain_clauses_in_custom_matcher_descriptions = true
end
end
Mocking HTTP Responses with WebMock
WebMock allows you to stub HTTP requests and return predetermined responses, making tests fast and reliable.
Basic WebMock Example
require 'spec_helper'
RSpec.describe 'Product Scraper' do
let(:agent) { Mechanize.new }
before do
# Mock the HTTP response
stub_request(:get, 'https://example-shop.com/products')
.to_return(
status: 200,
headers: { 'Content-Type' => 'text/html' },
body: File.read('spec/fixtures/products_page.html')
)
end
it 'extracts product information correctly' do
page = agent.get('https://example-shop.com/products')
products = page.search('.product')
expect(products.length).to eq(10)
first_product = products.first
expect(first_product.at('.product-name').text.strip).to eq('Sample Product')
expect(first_product.at('.price').text.strip).to eq('$29.99')
end
end
Advanced WebMock Patterns
RSpec.describe 'Advanced Scraping Scenarios' do
let(:agent) { Mechanize.new }
context 'handling different response codes' do
it 'handles 404 errors gracefully' do
stub_request(:get, 'https://example.com/missing-page')
.to_return(status: 404)
expect {
agent.get('https://example.com/missing-page')
}.to raise_error(Mechanize::ResponseCodeError)
end
it 'retries on 5xx errors' do
stub_request(:get, 'https://example.com/unstable-page')
.to_return(status: 500)
.then.to_return(
status: 200,
body: '<html><body>Success</body></html>'
)
# Implement retry logic in your scraper
page = scraper_with_retry('https://example.com/unstable-page')
expect(page.body).to include('Success')
end
end
context 'testing form submissions' do
it 'mocks form submission responses' do
# Mock the login form page
stub_request(:get, 'https://example.com/login')
.to_return(
status: 200,
body: File.read('spec/fixtures/login_form.html')
)
# Mock the form submission
stub_request(:post, 'https://example.com/login')
.with(
body: hash_including({
'username' => 'testuser',
'password' => 'testpass'
})
)
.to_return(
status: 302,
headers: { 'Location' => '/dashboard' }
)
# Mock the dashboard page
stub_request(:get, 'https://example.com/dashboard')
.to_return(
status: 200,
body: '<html><body>Welcome, testuser!</body></html>'
)
login_page = agent.get('https://example.com/login')
form = login_page.form
form.username = 'testuser'
form.password = 'testpass'
dashboard = agent.submit(form)
expect(dashboard.body).to include('Welcome, testuser!')
end
end
end
Using VCR for Recording Real HTTP Interactions
VCR records real HTTP interactions and replays them in tests, providing more realistic test scenarios.
Basic VCR Usage
RSpec.describe 'Real Website Integration', :vcr do
let(:agent) { Mechanize.new }
it 'scrapes real website data' do
VCR.use_cassette('github_homepage') do
page = agent.get('https://github.com')
expect(page.title).to include('GitHub')
end
end
end
Advanced VCR Configuration
# In spec_helper.rb
VCR.configure do |config|
config.cassette_library_dir = 'spec/vcr_cassettes'
config.hook_into :webmock
# Filter sensitive data
config.filter_sensitive_data('<API_KEY>') { ENV['API_KEY'] }
config.filter_sensitive_data('<SESSION_ID>') do |interaction|
interaction.request.headers['Cookie']&.first
end
# Configure recording modes
config.default_cassette_options = {
record: :new_episodes, # Only record new HTTP interactions
match_requests_on: [:method, :uri, :body]
}
end
VCR with Dynamic Content
RSpec.describe 'Dynamic Content Scraping' do
let(:agent) { Mechanize.new }
it 'handles time-sensitive content', :vcr do
VCR.use_cassette('weather_data', record: :new_episodes) do
page = agent.get('https://weather-api.com/current')
# Test structure rather than exact values for dynamic content
expect(page.body).to match(/temperature.*\d+/)
expect(page.search('.weather-condition')).not_to be_empty
end
end
end
Testing Complex Scraping Workflows
Multi-Page Scraping Tests
RSpec.describe 'E-commerce Scraper' do
let(:scraper) { EcommerceScraper.new }
before do
# Mock category page
stub_request(:get, 'https://shop.com/categories/electronics')
.to_return(body: File.read('spec/fixtures/category_page.html'))
# Mock individual product pages
(1..5).each do |id|
stub_request(:get, "https://shop.com/products/#{id}")
.to_return(body: File.read("spec/fixtures/product_#{id}.html"))
end
end
it 'scrapes all products from category' do
products = scraper.scrape_category('electronics')
expect(products.length).to eq(5)
expect(products.first).to include(:name, :price, :description, :url)
expect(products.all? { |p| p[:price] > 0 }).to be true
end
end
Testing Error Handling and Resilience
RSpec.describe 'Scraper Error Handling' do
let(:agent) { Mechanize.new }
it 'handles network timeouts gracefully' do
stub_request(:get, 'https://slow-site.com/page')
.to_timeout
expect {
agent.read_timeout = 1
agent.get('https://slow-site.com/page')
}.to raise_error(Net::TimeoutError)
end
it 'handles malformed HTML' do
stub_request(:get, 'https://broken-site.com/page')
.to_return(
status: 200,
body: '<html><div><p>Unclosed tags<div></html>'
)
page = agent.get('https://broken-site.com/page')
# Mechanize should still parse this
expect(page.search('p').text).to eq('Unclosed tags')
end
it 'retries failed requests with exponential backoff' do
call_count = 0
stub_request(:get, 'https://unreliable-site.com/data')
.to_return do
call_count += 1
if call_count < 3
{ status: 503 }
else
{ status: 200, body: 'Success' }
end
end
result = scraper_with_retry('https://unreliable-site.com/data')
expect(result.body).to eq('Success')
expect(call_count).to eq(3)
end
end
Performance Testing
Memory Usage Testing
RSpec.describe 'Memory Performance' do
let(:agent) { Mechanize.new }
it 'does not leak memory during large scraping operations' do
# Mock 100 pages
(1..100).each do |i|
stub_request(:get, "https://site.com/page#{i}")
.to_return(body: "<html><body>Page #{i}</body></html>")
end
initial_memory = memory_usage
(1..100).each do |i|
page = agent.get("https://site.com/page#{i}")
# Process page data
end
final_memory = memory_usage
memory_increase = final_memory - initial_memory
# Expect memory increase to be reasonable (adjust threshold as needed)
expect(memory_increase).to be < 50 # MB
end
private
def memory_usage
`ps -o rss= -p #{Process.pid}`.to_i / 1024.0 # Convert to MB
end
end
Integration with CI/CD
GitHub Actions Configuration
Create .github/workflows/test.yml
:
name: Test Mechanize Scripts
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: 3.0
bundler-cache: true
- name: Run tests
run: |
bundle exec rspec --format documentation
- name: Upload VCR cassettes
uses: actions/upload-artifact@v2
if: failure()
with:
name: vcr-cassettes
path: spec/vcr_cassettes/
Best Practices for Testing Mechanize Scripts
1. Separate Business Logic from HTTP Requests
class ProductScraper
def initialize(agent = Mechanize.new)
@agent = agent
end
def scrape_products(url)
page = @agent.get(url)
extract_products(page)
end
private
def extract_products(page)
page.search('.product').map do |product_element|
{
name: product_element.at('.name')&.text&.strip,
price: parse_price(product_element.at('.price')&.text),
url: product_element.at('a')&.[]('href')
}
end
end
def parse_price(price_text)
return nil unless price_text
price_text.gsub(/[^\d.]/, '').to_f
end
end
2. Test Edge Cases
RSpec.describe ProductScraper do
let(:scraper) { ProductScraper.new(agent) }
let(:agent) { instance_double(Mechanize) }
context 'edge cases' do
it 'handles missing product names' do
allow(agent).to receive(:get).and_return(mock_page_with_incomplete_data)
products = scraper.scrape_products('https://example.com')
expect(products.first[:name]).to be_nil
end
it 'handles invalid price formats' do
allow(agent).to receive(:get).and_return(mock_page_with_invalid_prices)
products = scraper.scrape_products('https://example.com')
expect(products.first[:price]).to be_nil
end
end
end
3. Use Factories for Test Data
# spec/factories/pages.rb
FactoryBot.define do
factory :product_page, class: String do
initialize_with do
<<~HTML
<html>
<body>
<div class="product">
<h2 class="name">#{name}</h2>
<span class="price">$#{price}</span>
<a href="/product/#{id}">View Details</a>
</div>
</body>
</html>
HTML
end
transient do
name { 'Sample Product' }
price { '29.99' }
id { 1 }
end
end
end
Debugging Failed Tests
Logging HTTP Interactions
# Add to spec_helper.rb
WebMock.after_request do |request_signature, response|
puts "#{request_signature.method.upcase} #{request_signature.uri}"
puts "Response: #{response.status.first} #{response.status.last}"
puts "Body preview: #{response.body&.slice(0, 200)}..."
puts "-" * 50
end
Saving Failed Response Bodies
RSpec.configure do |config|
config.after(:each) do |example|
if example.exception && @last_response
File.write(
"tmp/failed_response_#{example.full_description.gsub(/\W/, '_')}.html",
@last_response.body
)
end
end
end
When testing web scraping applications, it's important to consider the broader testing ecosystem. For JavaScript-based scraping solutions, you might want to explore how to handle browser sessions in Puppeteer for comparison with Mechanize's session handling capabilities.
Conclusion
Testing Mechanize scripts effectively requires a combination of mocking techniques, realistic test data, and comprehensive edge case coverage. By using WebMock for fast unit tests and VCR for integration tests, you can build a robust testing suite that ensures your web scraping applications remain reliable and maintainable.
Key takeaways: - Use WebMock for fast, isolated unit tests - Employ VCR for realistic integration testing with real HTTP interactions - Test edge cases and error conditions thoroughly - Separate business logic from HTTP requests for better testability - Implement performance tests for memory usage and execution time - Use CI/CD integration to catch regressions early
Similar testing principles apply to other scraping tools, and understanding how to handle timeouts in Puppeteer can provide additional insights into building resilient scraping applications across different technologies.