How do you test Mechanize scripts and mock HTTP responses?

Testing Mechanize scripts is essential for building reliable web scraping applications. Unlike traditional unit tests, web scraping tests need to handle external dependencies, network requests, and dynamic content. This guide covers comprehensive testing strategies, mocking techniques, and best practices for testing Ruby Mechanize scripts.

Why Testing Mechanize Scripts is Important

Testing web scraping scripts presents unique challenges:

External Dependencies: Scripts rely on third-party websites that can change or become unavailable
Network Variability: Internet connectivity and server response times can affect test reliability
Dynamic Content: Websites may serve different content based on location, time, or user session
Rate Limiting: Frequent testing can trigger rate limits or IP blocks

Proper testing ensures your scraping scripts are robust, maintainable, and won't break when websites change.

Setting Up the Testing Environment

Required Gems

Add these gems to your Gemfile:

group :test do
  gem 'rspec'
  gem 'webmock'
  gem 'vcr'
  gem 'mechanize'
end

Install the dependencies:

bundle install

Basic RSpec Configuration

Create a spec/spec_helper.rb file:

require 'mechanize'
require 'webmock/rspec'
require 'vcr'

# Configure WebMock
WebMock.disable_net_connect!(allow_localhost: true)

# Configure VCR
VCR.configure do |config|
  config.cassette_library_dir = 'spec/vcr_cassettes'
  config.hook_into :webmock
  config.configure_rspec_metadata!
  config.allow_http_connections_when_no_cassette = false
end

RSpec.configure do |config|
  config.expect_with :rspec do |expectations|
    expectations.include_chain_clauses_in_custom_matcher_descriptions = true
  end
end

Mocking HTTP Responses with WebMock

WebMock allows you to stub HTTP requests and return predetermined responses, making tests fast and reliable.

Basic WebMock Example

require 'spec_helper'

RSpec.describe 'Product Scraper' do
  let(:agent) { Mechanize.new }

  before do
    # Mock the HTTP response
    stub_request(:get, 'https://example-shop.com/products')
      .to_return(
        status: 200,
        headers: { 'Content-Type' => 'text/html' },
        body: File.read('spec/fixtures/products_page.html')
      )
  end

  it 'extracts product information correctly' do
    page = agent.get('https://example-shop.com/products')

    products = page.search('.product')
    expect(products.length).to eq(10)

    first_product = products.first
    expect(first_product.at('.product-name').text.strip).to eq('Sample Product')
    expect(first_product.at('.price').text.strip).to eq('$29.99')
  end
end

Advanced WebMock Patterns

RSpec.describe 'Advanced Scraping Scenarios' do
  let(:agent) { Mechanize.new }

  context 'handling different response codes' do
    it 'handles 404 errors gracefully' do
      stub_request(:get, 'https://example.com/missing-page')
        .to_return(status: 404)

      expect {
        agent.get('https://example.com/missing-page')
      }.to raise_error(Mechanize::ResponseCodeError)
    end

    it 'retries on 5xx errors' do
      stub_request(:get, 'https://example.com/unstable-page')
        .to_return(status: 500)
        .then.to_return(
          status: 200,
          body: '<html><body>Success</body></html>'
        )

      # Implement retry logic in your scraper
      page = scraper_with_retry('https://example.com/unstable-page')
      expect(page.body).to include('Success')
    end
  end

  context 'testing form submissions' do
    it 'mocks form submission responses' do
      # Mock the login form page
      stub_request(:get, 'https://example.com/login')
        .to_return(
          status: 200,
          body: File.read('spec/fixtures/login_form.html')
        )

      # Mock the form submission
      stub_request(:post, 'https://example.com/login')
        .with(
          body: hash_including({
            'username' => 'testuser',
            'password' => 'testpass'
          })
        )
        .to_return(
          status: 302,
          headers: { 'Location' => '/dashboard' }
        )

      # Mock the dashboard page
      stub_request(:get, 'https://example.com/dashboard')
        .to_return(
          status: 200,
          body: '<html><body>Welcome, testuser!</body></html>'
        )

      login_page = agent.get('https://example.com/login')
      form = login_page.form
      form.username = 'testuser'
      form.password = 'testpass'

      dashboard = agent.submit(form)
      expect(dashboard.body).to include('Welcome, testuser!')
    end
  end
end

Using VCR for Recording Real HTTP Interactions

VCR records real HTTP interactions and replays them in tests, providing more realistic test scenarios.

Basic VCR Usage

RSpec.describe 'Real Website Integration', :vcr do
  let(:agent) { Mechanize.new }

  it 'scrapes real website data' do
    VCR.use_cassette('github_homepage') do
      page = agent.get('https://github.com')
      expect(page.title).to include('GitHub')
    end
  end
end

Advanced VCR Configuration

# In spec_helper.rb
VCR.configure do |config|
  config.cassette_library_dir = 'spec/vcr_cassettes'
  config.hook_into :webmock

  # Filter sensitive data
  config.filter_sensitive_data('<API_KEY>') { ENV['API_KEY'] }
  config.filter_sensitive_data('<SESSION_ID>') do |interaction|
    interaction.request.headers['Cookie']&.first
  end

  # Configure recording modes
  config.default_cassette_options = {
    record: :new_episodes,    # Only record new HTTP interactions
    match_requests_on: [:method, :uri, :body]
  }
end

VCR with Dynamic Content

RSpec.describe 'Dynamic Content Scraping' do
  let(:agent) { Mechanize.new }

  it 'handles time-sensitive content', :vcr do
    VCR.use_cassette('weather_data', record: :new_episodes) do
      page = agent.get('https://weather-api.com/current')

      # Test structure rather than exact values for dynamic content
      expect(page.body).to match(/temperature.*\d+/)
      expect(page.search('.weather-condition')).not_to be_empty
    end
  end
end

Testing Complex Scraping Workflows

Multi-Page Scraping Tests

RSpec.describe 'E-commerce Scraper' do
  let(:scraper) { EcommerceScraper.new }

  before do
    # Mock category page
    stub_request(:get, 'https://shop.com/categories/electronics')
      .to_return(body: File.read('spec/fixtures/category_page.html'))

    # Mock individual product pages
    (1..5).each do |id|
      stub_request(:get, "https://shop.com/products/#{id}")
        .to_return(body: File.read("spec/fixtures/product_#{id}.html"))
    end
  end

  it 'scrapes all products from category' do
    products = scraper.scrape_category('electronics')

    expect(products.length).to eq(5)
    expect(products.first).to include(:name, :price, :description, :url)
    expect(products.all? { |p| p[:price] > 0 }).to be true
  end
end

Testing Error Handling and Resilience

RSpec.describe 'Scraper Error Handling' do
  let(:agent) { Mechanize.new }

  it 'handles network timeouts gracefully' do
    stub_request(:get, 'https://slow-site.com/page')
      .to_timeout

    expect {
      agent.read_timeout = 1
      agent.get('https://slow-site.com/page')
    }.to raise_error(Net::TimeoutError)
  end

  it 'handles malformed HTML' do
    stub_request(:get, 'https://broken-site.com/page')
      .to_return(
        status: 200,
        body: '<html><div><p>Unclosed tags<div></html>'
      )

    page = agent.get('https://broken-site.com/page')
    # Mechanize should still parse this
    expect(page.search('p').text).to eq('Unclosed tags')
  end

  it 'retries failed requests with exponential backoff' do
    call_count = 0

    stub_request(:get, 'https://unreliable-site.com/data')
      .to_return do
        call_count += 1
        if call_count < 3
          { status: 503 }
        else
          { status: 200, body: 'Success' }
        end
      end

    result = scraper_with_retry('https://unreliable-site.com/data')
    expect(result.body).to eq('Success')
    expect(call_count).to eq(3)
  end
end

Performance Testing

Memory Usage Testing

RSpec.describe 'Memory Performance' do
  let(:agent) { Mechanize.new }

  it 'does not leak memory during large scraping operations' do
    # Mock 100 pages
    (1..100).each do |i|
      stub_request(:get, "https://site.com/page#{i}")
        .to_return(body: "<html><body>Page #{i}</body></html>")
    end

    initial_memory = memory_usage

    (1..100).each do |i|
      page = agent.get("https://site.com/page#{i}")
      # Process page data
    end

    final_memory = memory_usage
    memory_increase = final_memory - initial_memory

    # Expect memory increase to be reasonable (adjust threshold as needed)
    expect(memory_increase).to be < 50 # MB
  end

  private

  def memory_usage
    `ps -o rss= -p #{Process.pid}`.to_i / 1024.0 # Convert to MB
  end
end

Integration with CI/CD

GitHub Actions Configuration

Create .github/workflows/test.yml:

name: Test Mechanize Scripts

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2

    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        ruby-version: 3.0
        bundler-cache: true

    - name: Run tests
      run: |
        bundle exec rspec --format documentation

    - name: Upload VCR cassettes
      uses: actions/upload-artifact@v2
      if: failure()
      with:
        name: vcr-cassettes
        path: spec/vcr_cassettes/

Best Practices for Testing Mechanize Scripts

1. Separate Business Logic from HTTP Requests

class ProductScraper
  def initialize(agent = Mechanize.new)
    @agent = agent
  end

  def scrape_products(url)
    page = @agent.get(url)
    extract_products(page)
  end

  private

  def extract_products(page)
    page.search('.product').map do |product_element|
      {
        name: product_element.at('.name')&.text&.strip,
        price: parse_price(product_element.at('.price')&.text),
        url: product_element.at('a')&.[]('href')
      }
    end
  end

  def parse_price(price_text)
    return nil unless price_text
    price_text.gsub(/[^\d.]/, '').to_f
  end
end

2. Test Edge Cases

RSpec.describe ProductScraper do
  let(:scraper) { ProductScraper.new(agent) }
  let(:agent) { instance_double(Mechanize) }

  context 'edge cases' do
    it 'handles missing product names' do
      allow(agent).to receive(:get).and_return(mock_page_with_incomplete_data)

      products = scraper.scrape_products('https://example.com')
      expect(products.first[:name]).to be_nil
    end

    it 'handles invalid price formats' do
      allow(agent).to receive(:get).and_return(mock_page_with_invalid_prices)

      products = scraper.scrape_products('https://example.com')
      expect(products.first[:price]).to be_nil
    end
  end
end

3. Use Factories for Test Data

# spec/factories/pages.rb
FactoryBot.define do
  factory :product_page, class: String do
    initialize_with do
      <<~HTML
        <html>
          <body>
            <div class="product">
              <h2 class="name">#{name}</h2>
              <span class="price">$#{price}</span>
              <a href="/product/#{id}">View Details</a>
            </div>
          </body>
        </html>
      HTML
    end

    transient do
      name { 'Sample Product' }
      price { '29.99' }
      id { 1 }
    end
  end
end

Debugging Failed Tests

Logging HTTP Interactions

# Add to spec_helper.rb
WebMock.after_request do |request_signature, response|
  puts "#{request_signature.method.upcase} #{request_signature.uri}"
  puts "Response: #{response.status.first} #{response.status.last}"
  puts "Body preview: #{response.body&.slice(0, 200)}..."
  puts "-" * 50
end

Saving Failed Response Bodies

RSpec.configure do |config|
  config.after(:each) do |example|
    if example.exception && @last_response
      File.write(
        "tmp/failed_response_#{example.full_description.gsub(/\W/, '_')}.html",
        @last_response.body
      )
    end
  end
end

When testing web scraping applications, it's important to consider the broader testing ecosystem. For JavaScript-based scraping solutions, you might want to explore how to handle browser sessions in Puppeteer for comparison with Mechanize's session handling capabilities.

Conclusion

Testing Mechanize scripts effectively requires a combination of mocking techniques, realistic test data, and comprehensive edge case coverage. By using WebMock for fast unit tests and VCR for integration tests, you can build a robust testing suite that ensures your web scraping applications remain reliable and maintainable.

Key takeaways: - Use WebMock for fast, isolated unit tests - Employ VCR for realistic integration testing with real HTTP interactions - Test edge cases and error conditions thoroughly - Separate business logic from HTTP requests for better testability - Implement performance tests for memory usage and execution time - Use CI/CD integration to catch regressions early

Similar testing principles apply to other scraping tools, and understanding how to handle timeouts in Puppeteer can provide additional insights into building resilient scraping applications across different technologies.

Table of contents