Table of contents

How to Set Custom Headers When Making Requests with Mechanize

When web scraping with Mechanize in Ruby, setting custom HTTP headers is essential for mimicking real browser behavior, handling authentication, and bypassing certain restrictions. Custom headers allow you to control how your requests appear to target servers, making your scraping activities more reliable and less likely to be blocked.

Understanding HTTP Headers in Web Scraping

HTTP headers are key-value pairs sent with every HTTP request that provide additional information about the request or the client making it. Common headers include User-Agent (identifying the browser), Accept (specifying content types), Authorization (for authentication), and many others.

Basic Header Configuration in Mechanize

Mechanize provides several methods to set custom headers. The most straightforward approach is using the request_headers property:

require 'mechanize'

agent = Mechanize.new

# Set headers using request_headers
agent.request_headers = {
  'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language' => 'en-US,en;q=0.5',
  'Accept-Encoding' => 'gzip, deflate',
  'Connection' => 'keep-alive'
}

page = agent.get('https://example.com')

Setting Individual Headers

You can also set headers individually, which is useful when you need to modify specific headers without affecting others:

agent = Mechanize.new

# Set individual headers
agent.request_headers['User-Agent'] = 'CustomBot/1.0'
agent.request_headers['Referer'] = 'https://google.com'
agent.request_headers['X-Requested-With'] = 'XMLHttpRequest'

# These headers will be sent with all subsequent requests
page = agent.get('https://api.example.com/data')

Using the user_agent Property

Mechanize provides a convenient shortcut for setting the User-Agent header:

agent = Mechanize.new

# Method 1: Direct assignment
agent.user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'

# Method 2: Using predefined aliases
agent.user_agent_alias = 'Windows Chrome'

# Method 3: Using custom alias
Mechanize::AGENT_ALIASES['Custom'] = 'MyCustomBot/2.0 (compatible; Ruby/Mechanize)'
agent.user_agent_alias = 'Custom'

Per-Request Header Customization

Sometimes you need different headers for specific requests. Mechanize allows you to set headers on individual requests:

agent = Mechanize.new

# Set headers for a specific request
page = agent.get('https://example.com') do |request|
  request['Authorization'] = 'Bearer your_token_here'
  request['Content-Type'] = 'application/json'
  request['X-API-Key'] = 'your_api_key'
end

# Headers for POST requests
agent.post('https://api.example.com/submit') do |request|
  request['Authorization'] = 'Basic ' + Base64.encode64('username:password').chomp
  request.body = JSON.generate({data: 'value'})
end

Authentication Headers

Setting authentication headers is crucial when scraping protected content:

require 'mechanize'
require 'base64'

agent = Mechanize.new

# Basic Authentication
credentials = Base64.encode64('username:password').chomp
agent.request_headers['Authorization'] = "Basic #{credentials}"

# Bearer Token Authentication
agent.request_headers['Authorization'] = 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'

# API Key Authentication
agent.request_headers['X-API-Key'] = 'your_api_key_here'
agent.request_headers['X-RapidAPI-Key'] = 'your_rapidapi_key'

# Custom authentication headers
agent.request_headers['X-Auth-Token'] = 'custom_token'
agent.request_headers['X-Session-ID'] = 'session_identifier'

Advanced Header Management

For complex scraping scenarios, you might need dynamic header management:

class AdvancedMechanizeAgent
  def initialize
    @agent = Mechanize.new
    setup_default_headers
  end

  private

  def setup_default_headers
    @agent.request_headers = {
      'User-Agent' => random_user_agent,
      'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Language' => 'en-US,en;q=0.9',
      'Accept-Encoding' => 'gzip, deflate, br',
      'DNT' => '1',
      'Connection' => 'keep-alive',
      'Upgrade-Insecure-Requests' => '1'
    }
  end

  def random_user_agent
    user_agents = [
      'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
      'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    ]
    user_agents.sample
  end

  def scrape_with_custom_headers(url, custom_headers = {})
    # Merge custom headers with defaults
    original_headers = @agent.request_headers.dup
    @agent.request_headers.merge!(custom_headers)

    begin
      page = @agent.get(url)
      # Process page...
      return page
    ensure
      # Restore original headers
      @agent.request_headers = original_headers
    end
  end
end

# Usage
scraper = AdvancedMechanizeAgent.new
page = scraper.scrape_with_custom_headers(
  'https://example.com',
  { 'Referer' => 'https://google.com', 'X-Requested-With' => 'XMLHttpRequest' }
)

Common Header Patterns for Web Scraping

Here are some commonly used header combinations for different scenarios:

# Mobile device simulation
agent.request_headers = {
  'User-Agent' => 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_7_1 like Mac OS X) AppleWebKit/605.1.15',
  'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language' => 'en-US,en;q=0.9',
  'Accept-Encoding' => 'gzip, deflate',
  'Connection' => 'keep-alive'
}

# AJAX request simulation
agent.request_headers = {
  'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Accept' => 'application/json, text/javascript, */*; q=0.01',
  'Accept-Language' => 'en-US,en;q=0.9',
  'Accept-Encoding' => 'gzip, deflate, br',
  'X-Requested-With' => 'XMLHttpRequest',
  'Referer' => 'https://example.com/page'
}

# API client simulation
agent.request_headers = {
  'User-Agent' => 'YourApp/1.0 (Ruby Mechanize)',
  'Accept' => 'application/json',
  'Content-Type' => 'application/json',
  'Cache-Control' => 'no-cache'
}

Debugging and Monitoring Headers

To verify that your headers are being sent correctly, you can use debugging techniques:

agent = Mechanize.new
agent.log = Logger.new(STDOUT)
agent.log.level = Logger::DEBUG

# This will show all HTTP communication including headers
page = agent.get('https://httpbin.org/headers')
puts page.body  # Will show the headers received by the server

Best Practices and Considerations

Header Rotation

For large-scale scraping, consider rotating headers to avoid detection:

def rotate_headers(agent)
  headers_sets = [
    { 'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' },
    { 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' },
    { 'User-Agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36' }
  ]

  agent.request_headers.merge!(headers_sets.sample)
end

Respect robots.txt

Always check the website's robots.txt file and respect rate limiting to maintain ethical scraping practices.

Error Handling

Implement proper error handling when working with custom headers:

begin
  agent.request_headers['Authorization'] = "Bearer #{token}"
  page = agent.get(url)
rescue Mechanize::UnauthorizedError => e
  puts "Authentication failed: #{e.message}"
rescue Mechanize::Error => e
  puts "Request failed: #{e.message}"
end

Conclusion

Setting custom headers in Mechanize is essential for effective web scraping. Whether you're simulating different browsers, handling authentication, or making API requests, proper header management ensures your scraping activities are successful and respectful of target websites. Remember to always follow ethical scraping practices and respect website terms of service.

For more complex scenarios involving JavaScript-heavy websites, consider exploring how to handle browser sessions in Puppeteer or learn about handling authentication in Puppeteer for headless browser automation alternatives.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon