How to Handle Cookies with Specific Domain or Path Settings in Mechanize

Managing cookies with specific domain and path settings is crucial for successful web scraping with Mechanize, especially when dealing with complex authentication systems, subdomain navigation, or applications that use path-based session management. This comprehensive guide covers everything you need to know about handling sophisticated cookie configurations in Mechanize.

Understanding Cookie Domain and Path Attributes

Before diving into implementation, it's essential to understand how cookie domains and paths work:

Domain: Specifies which hosts can receive the cookie. A cookie with domain=.example.com can be sent to www.example.com, api.example.com, etc.
Path: Defines the URL path that must exist in the requested URL for the cookie to be sent. A cookie with path=/admin will only be sent to URLs starting with /admin.

Basic Cookie Management in Mechanize

Mechanize provides robust cookie handling through its built-in cookie jar functionality:

require 'mechanize'

# Create a new Mechanize agent
agent = Mechanize.new

# Access the cookie jar
cookie_jar = agent.cookie_jar

# Enable verbose cookie logging for debugging
agent.log = Logger.new(STDOUT)
agent.log.level = Logger::DEBUG

Setting Cookies with Specific Domain Settings

Creating Domain-Specific Cookies

require 'mechanize'

agent = Mechanize.new

# Create a cookie with specific domain settings
cookie = Mechanize::Cookie.new(
  name: 'session_token',
  value: 'abc123xyz789',
  domain: '.example.com',  # Will work for all subdomains
  path: '/',
  secure: true,
  httponly: true
)

# Add the cookie to the jar
agent.cookie_jar.add(URI('https://www.example.com'), cookie)

# Verify cookie was added
puts "Cookies for example.com:"
agent.cookie_jar.cookies(URI('https://www.example.com')).each do |c|
  puts "#{c.name}: #{c.value} (Domain: #{c.domain}, Path: #{c.path})"
end

Handling Subdomain Cookies

# Set up cookies for different subdomains
subdomains = ['www', 'api', 'admin']

subdomains.each do |subdomain|
  cookie = Mechanize::Cookie.new(
    name: "#{subdomain}_session",
    value: "#{subdomain}_token_123",
    domain: "#{subdomain}.example.com",  # Specific subdomain
    path: '/',
    expires: Time.now + 3600  # 1 hour from now
  )

  agent.cookie_jar.add(URI("https://#{subdomain}.example.com"), cookie)
end

# Test cookie availability across subdomains
['www.example.com', 'api.example.com', 'shop.example.com'].each do |host|
  uri = URI("https://#{host}")
  cookies = agent.cookie_jar.cookies(uri)
  puts "#{host}: #{cookies.length} cookies available"
end

Managing Path-Specific Cookies

Setting Cookies for Specific Paths

# Create cookies for different application sections
sections = [
  { path: '/admin', name: 'admin_session', value: 'admin_token_456' },
  { path: '/api/v1', name: 'api_key', value: 'api_secret_789' },
  { path: '/user/profile', name: 'profile_prefs', value: 'theme=dark' }
]

sections.each do |section|
  cookie = Mechanize::Cookie.new(
    name: section[:name],
    value: section[:value],
    domain: 'example.com',
    path: section[:path],
    secure: true
  )

  agent.cookie_jar.add(URI('https://example.com'), cookie)
end

# Test path-specific cookie behavior
test_urls = [
  'https://example.com/',
  'https://example.com/admin',
  'https://example.com/admin/users',
  'https://example.com/api/v1/data',
  'https://example.com/user/profile/settings'
]

test_urls.each do |url|
  uri = URI(url)
  cookies = agent.cookie_jar.cookies(uri)
  puts "#{url}: #{cookies.map(&:name).join(', ')}"
end

Advanced Cookie Configuration

Creating Cookies with All Attributes

def create_advanced_cookie(agent, options = {})
  defaults = {
    name: 'advanced_session',
    value: 'secure_token_xyz',
    domain: '.example.com',
    path: '/',
    secure: true,
    httponly: true,
    expires: Time.now + 86400,  # 24 hours
    max_age: 86400,
    same_site: 'Strict'
  }

  config = defaults.merge(options)

  cookie = Mechanize::Cookie.new(
    name: config[:name],
    value: config[:value],
    domain: config[:domain],
    path: config[:path],
    secure: config[:secure],
    httponly: config[:httponly],
    expires: config[:expires]
  )

  agent.cookie_jar.add(URI("https://#{config[:domain].sub(/^\./, '')}"), cookie)
  cookie
end

# Usage examples
agent = Mechanize.new

# Production session cookie
create_advanced_cookie(agent, {
  name: 'prod_session',
  domain: '.myapp.com',
  path: '/',
  secure: true
})

# Development cookie with different settings
create_advanced_cookie(agent, {
  name: 'dev_session',
  domain: 'localhost',
  path: '/dev',
  secure: false,
  expires: Time.now + 3600
})

Extracting and Manipulating Existing Cookies

Reading Cookies from Server Responses

agent = Mechanize.new

# Make a request to get cookies from server
page = agent.get('https://example.com/login')

# Examine all cookies received
agent.cookie_jar.each do |cookie|
  puts "Cookie: #{cookie.name}"
  puts "  Value: #{cookie.value}"
  puts "  Domain: #{cookie.domain}"
  puts "  Path: #{cookie.path}"
  puts "  Secure: #{cookie.secure?}"
  puts "  HttpOnly: #{cookie.httponly?}"
  puts "  Expires: #{cookie.expires}"
  puts "---"
end

Modifying Existing Cookies

# Find and modify a specific cookie
session_cookie = agent.cookie_jar.find do |cookie|
  cookie.name == 'JSESSIONID' && cookie.domain.include?('example.com')
end

if session_cookie
  # Create a modified version
  modified_cookie = Mechanize::Cookie.new(
    name: session_cookie.name,
    value: session_cookie.value,
    domain: '.example.com',  # Change to allow subdomains
    path: '/',               # Broaden the path
    secure: true,
    httponly: true,
    expires: Time.now + 7200 # Extend expiration
  )

  # Remove old cookie and add modified one
  agent.cookie_jar.delete(session_cookie)
  agent.cookie_jar.add(URI('https://example.com'), modified_cookie)
end

Cookie Persistence and Management

Saving and Loading Cookie Files

# Save cookies to file
agent.cookie_jar.save('cookies.txt')

# Load cookies from file in a new session
new_agent = Mechanize.new
new_agent.cookie_jar.load('cookies.txt')

# Custom cookie serialization
def export_cookies_to_json(agent, filename)
  cookies_data = agent.cookie_jar.map do |cookie|
    {
      name: cookie.name,
      value: cookie.value,
      domain: cookie.domain,
      path: cookie.path,
      secure: cookie.secure?,
      httponly: cookie.httponly?,
      expires: cookie.expires&.to_i
    }
  end

  File.write(filename, JSON.pretty_generate(cookies_data))
end

def import_cookies_from_json(agent, filename)
  cookies_data = JSON.parse(File.read(filename))

  cookies_data.each do |cookie_data|
    cookie = Mechanize::Cookie.new(
      name: cookie_data['name'],
      value: cookie_data['value'],
      domain: cookie_data['domain'],
      path: cookie_data['path'],
      secure: cookie_data['secure'],
      httponly: cookie_data['httponly'],
      expires: cookie_data['expires'] ? Time.at(cookie_data['expires']) : nil
    )

    agent.cookie_jar.add(URI("https://#{cookie_data['domain'].sub(/^\./, '')}"), cookie)
  end
end

Debugging Cookie Issues

Cookie Inspection and Troubleshooting

def debug_cookie_behavior(agent, url)
  uri = URI(url)
  puts "Debugging cookies for: #{url}"
  puts "Host: #{uri.host}"
  puts "Path: #{uri.path}"
  puts ""

  # Show all cookies in jar
  puts "All cookies in jar:"
  agent.cookie_jar.each_with_index do |cookie, index|
    puts "#{index + 1}. #{cookie.name} = #{cookie.value}"
    puts "   Domain: #{cookie.domain} | Path: #{cookie.path}"
    puts "   Secure: #{cookie.secure?} | HttpOnly: #{cookie.httponly?}"
    puts ""
  end

  # Show cookies that would be sent to this URL
  applicable_cookies = agent.cookie_jar.cookies(uri)
  puts "Cookies that would be sent to #{url}:"
  if applicable_cookies.empty?
    puts "  None"
  else
    applicable_cookies.each do |cookie|
      puts "  #{cookie.name} = #{cookie.value}"
    end
  end
  puts "=" * 50
end

# Usage
agent = Mechanize.new
debug_cookie_behavior(agent, 'https://www.example.com/admin/dashboard')

Integration with Authentication Systems

Multi-Domain Authentication Flow

class MultiDomainAuthenticator
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Mac Safari'
  end

  def authenticate_main_domain(username, password)
    # Login to main domain
    login_page = @agent.get('https://auth.example.com/login')
    login_form = login_page.form_with(action: /login/)

    login_form.username = username
    login_form.password = password

    result = @agent.submit(login_form)

    # Extract authentication token from response
    auth_token = extract_auth_token(result)

    # Set cross-domain authentication cookie
    auth_cookie = Mechanize::Cookie.new(
      name: 'auth_token',
      value: auth_token,
      domain: '.example.com',  # Available to all subdomains
      path: '/',
      secure: true,
      httponly: true,
      expires: Time.now + 3600
    )

    @agent.cookie_jar.add(URI('https://example.com'), auth_cookie)
  end

  def access_protected_resource(subdomain, path)
    url = "https://#{subdomain}.example.com#{path}"

    # When working with complex authentication flows that span multiple domains,
    # you might also need to handle browser sessions in Puppeteer for JavaScript-heavy applications
    @agent.get(url)
  end

  private

  def extract_auth_token(response)
    # Extract token from response headers, body, or cookies
    response.body.match(/auth_token['"]:['"]([^'"]+)/)[1]
  rescue
    nil
  end
end

Best Practices and Security Considerations

Secure Cookie Handling

class SecureCookieManager
  def initialize(agent)
    @agent = agent
  end

  def create_secure_session_cookie(domain, session_id)
    # Always use secure settings for production
    cookie = Mechanize::Cookie.new(
      name: 'secure_session',
      value: encrypt_session_id(session_id),
      domain: domain,
      path: '/',
      secure: true,      # Only send over HTTPS
      httponly: true,    # Prevent XSS attacks
      expires: Time.now + 1800  # 30 minutes
    )

    @agent.cookie_jar.add(URI("https://#{domain}"), cookie)
  end

  def rotate_session_cookies
    # Find existing session cookies
    session_cookies = @agent.cookie_jar.select do |cookie|
      cookie.name.include?('session')
    end

    session_cookies.each do |old_cookie|
      # Create new cookie with updated value
      new_cookie = Mechanize::Cookie.new(
        name: old_cookie.name,
        value: generate_new_session_value,
        domain: old_cookie.domain,
        path: old_cookie.path,
        secure: old_cookie.secure?,
        httponly: old_cookie.httponly?,
        expires: Time.now + 1800
      )

      # Replace old with new
      @agent.cookie_jar.delete(old_cookie)
      @agent.cookie_jar.add(URI("https://#{old_cookie.domain}"), new_cookie)
    end
  end

  private

  def encrypt_session_id(session_id)
    # Implement your encryption logic here
    Base64.encode64(session_id).strip
  end

  def generate_new_session_value
    SecureRandom.hex(32)
  end
end

Conclusion

Handling cookies with specific domain and path settings in Mechanize requires understanding both the HTTP cookie specification and Mechanize's cookie jar implementation. By properly configuring domain and path attributes, you can create robust web scraping applications that maintain session state across complex multi-domain architectures.

Key takeaways for effective cookie management:

Use domain prefixes (.example.com) for subdomain compatibility
Set appropriate path restrictions for security
Always use secure flags for production environments
Implement proper cookie persistence for long-running scrapers
Debug cookie behavior thoroughly when troubleshooting authentication issues

For applications requiring JavaScript execution alongside cookie management, consider complementing Mechanize with tools that can handle browser sessions in Puppeteer for comprehensive web automation solutions.

Table of contents