Table of contents

How do I handle cookies and sessions in Ruby web scraping?

Handling cookies and sessions is crucial for Ruby web scraping, especially when dealing with authenticated websites, e-commerce platforms, or any site that tracks user state. This guide covers various Ruby libraries and techniques for managing cookies and maintaining persistent sessions throughout your scraping workflow.

Understanding Cookies and Sessions in Web Scraping

Cookies are small pieces of data stored by websites in your browser to remember information about your visit. Sessions use cookies to maintain state across multiple HTTP requests, enabling features like user authentication, shopping carts, and personalized content.

In web scraping, proper cookie and session management allows you to: - Maintain login status across requests - Preserve user preferences and settings - Handle multi-step forms and workflows - Avoid repeated authentication processes - Access session-protected content

Using HTTParty for Cookie Management

HTTParty is one of the most popular Ruby gems for HTTP requests and provides excellent cookie handling capabilities.

Basic Cookie Handling with HTTParty

require 'httparty'

class WebScraper
  include HTTParty

  # Enable cookie jar to automatically handle cookies
  cookies_enabled true
  base_uri 'https://example.com'

  def initialize
    @options = {
      headers: {
        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      }
    }
  end

  def login(username, password)
    # Perform login and automatically store session cookies
    response = self.class.post('/login', {
      body: {
        username: username,
        password: password
      }.merge(@options)
    })

    puts "Login status: #{response.code}"
    response
  end

  def scrape_protected_page
    # Session cookies are automatically included
    response = self.class.get('/protected-content', @options)
    response.body
  end
end

# Usage
scraper = WebScraper.new
scraper.login('your_username', 'your_password')
content = scraper.scrape_protected_page

Manual Cookie Management with HTTParty

For more control over cookie handling, you can manually manage cookies:

require 'httparty'

class ManualCookieScraper
  include HTTParty
  base_uri 'https://example.com'

  def initialize
    @cookie_jar = {}
    @headers = {
      'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
  end

  def extract_cookies(response)
    # Extract cookies from Set-Cookie headers
    cookies = response.headers['set-cookie']
    return unless cookies

    cookies = [cookies] unless cookies.is_a?(Array)
    cookies.each do |cookie|
      name, value = cookie.split(';').first.split('=', 2)
      @cookie_jar[name] = value if name && value
    end
  end

  def cookie_header
    @cookie_jar.map { |name, value| "#{name}=#{value}" }.join('; ')
  end

  def make_request(path, method: :get, body: nil)
    options = {
      headers: @headers.merge('Cookie' => cookie_header)
    }
    options[:body] = body if body

    response = self.class.send(method, path, options)
    extract_cookies(response)
    response
  end

  def login(username, password)
    # Get login form to extract CSRF tokens or other required fields
    login_page = make_request('/login')

    # Perform login
    login_response = make_request('/login', 
      method: :post,
      body: {
        username: username,
        password: password
      }
    )

    login_response
  end
end

Using Net::HTTP with Cookie Management

For lower-level control, you can use Ruby's built-in Net::HTTP library with manual cookie handling:

require 'net/http'
require 'uri'

class NetHTTPScraper
  def initialize
    @cookies = {}
    @user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  end

  def parse_cookies(response)
    # Parse Set-Cookie headers
    response.get_fields('Set-Cookie')&.each do |cookie|
      name, value = cookie.split(';').first.split('=', 2)
      @cookies[name] = value if name && value
    end
  end

  def cookie_string
    @cookies.map { |name, value| "#{name}=#{value}" }.join('; ')
  end

  def make_request(url, method: 'GET', data: nil)
    uri = URI(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'

    request = case method.upcase
              when 'GET'
                Net::HTTP::Get.new(uri)
              when 'POST'
                Net::HTTP::Post.new(uri)
              end

    # Set headers
    request['User-Agent'] = @user_agent
    request['Cookie'] = cookie_string unless @cookies.empty?

    # Set body for POST requests
    if data && method.upcase == 'POST'
      if data.is_a?(Hash)
        request.set_form_data(data)
      else
        request.body = data
      end
    end

    response = http.request(request)
    parse_cookies(response)
    response
  end

  def login_and_scrape
    # Step 1: Get login page
    login_page = make_request('https://example.com/login')

    # Step 2: Submit login form
    login_response = make_request(
      'https://example.com/login',
      method: 'POST',
      data: {
        'username' => 'your_username',
        'password' => 'your_password'
      }
    )

    # Step 3: Access protected content
    protected_content = make_request('https://example.com/dashboard')
    protected_content.body
  end
end

Using Mechanize for Advanced Session Management

Mechanize is a powerful Ruby library that provides automatic cookie and session management along with form handling capabilities, similar to how browser sessions work in Puppeteer.

require 'mechanize'

class MechanizeScraper
  def initialize
    @agent = Mechanize.new

    # Configure user agent and other settings
    @agent.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    @agent.follow_meta_refresh = true
    @agent.redirect_ok = true

    # Set up cookie jar (automatic by default)
    @agent.cookie_jar.clear!
  end

  def login(username, password)
    # Navigate to login page
    login_page = @agent.get('https://example.com/login')

    # Find and fill login form
    login_form = login_page.form_with(action: '/login') || login_page.forms.first
    login_form.field_with(name: 'username').value = username
    login_form.field_with(name: 'password').value = password

    # Submit form (cookies are automatically handled)
    result_page = @agent.submit(login_form)

    # Check if login was successful
    if result_page.uri.to_s.include?('dashboard') || 
       result_page.body.include?('Welcome')
      puts "Login successful"
      true
    else
      puts "Login failed"
      false
    end
  end

  def scrape_with_session
    # All subsequent requests will include session cookies
    dashboard = @agent.get('https://example.com/dashboard')
    profile = @agent.get('https://example.com/profile')

    {
      dashboard: dashboard.body,
      profile: profile.body
    }
  end

  def export_cookies
    # Export cookies for later use
    cookies = {}
    @agent.cookie_jar.each do |cookie|
      cookies[cookie.name] = cookie.value
    end
    cookies
  end

  def import_cookies(cookie_hash)
    # Import previously saved cookies
    cookie_hash.each do |name, value|
      cookie = Mechanize::Cookie.new(name, value)
      cookie.domain = '.example.com'
      cookie.path = '/'
      @agent.cookie_jar.add(cookie)
    end
  end
end

# Usage example
scraper = MechanizeScraper.new

# Login and scrape
if scraper.login('username', 'password')
  data = scraper.scrape_with_session

  # Save cookies for future sessions
  saved_cookies = scraper.export_cookies
  File.write('cookies.json', saved_cookies.to_json)
end

Persistent Cookie Storage

For long-running scraping tasks, you may want to persist cookies between script executions:

require 'json'
require 'fileutils'

class PersistentCookieScraper
  COOKIE_FILE = 'session_cookies.json'

  def initialize
    @cookies = load_cookies
  end

  def load_cookies
    if File.exist?(COOKIE_FILE)
      JSON.parse(File.read(COOKIE_FILE))
    else
      {}
    end
  rescue JSON::ParserError
    {}
  end

  def save_cookies
    File.write(COOKIE_FILE, @cookies.to_json)
  end

  def add_cookie(name, value, domain: nil, path: '/')
    @cookies[name] = {
      'value' => value,
      'domain' => domain,
      'path' => path,
      'created_at' => Time.now.to_i
    }
    save_cookies
  end

  def get_cookies_for_domain(domain)
    @cookies.select do |name, data|
      data['domain'].nil? || domain.include?(data['domain'])
    end
  end

  def cookie_header_for_domain(domain)
    relevant_cookies = get_cookies_for_domain(domain)
    relevant_cookies.map { |name, data| "#{name}=#{data['value']}" }.join('; ')
  end

  def clear_expired_cookies(max_age_days: 30)
    cutoff_time = Time.now.to_i - (max_age_days * 24 * 60 * 60)
    @cookies.delete_if { |name, data| data['created_at'] < cutoff_time }
    save_cookies
  end
end

Handling CSRF Tokens and Form Security

Many websites use CSRF tokens for security. Here's how to handle them with session management:

require 'nokogiri'
require 'httparty'

class CSRFAwareScraper
  include HTTParty
  cookies_enabled true

  def initialize(base_url)
    @base_url = base_url
    self.class.base_uri base_url
  end

  def extract_csrf_token(html_content)
    doc = Nokogiri::HTML(html_content)
    csrf_input = doc.css('input[name="csrf_token"], input[name="_token"], meta[name="csrf-token"]').first

    if csrf_input
      csrf_input['value'] || csrf_input['content']
    else
      # Try to find token in script tags
      script_content = doc.css('script').map(&:content).join(' ')
      token_match = script_content.match(/csrf[_-]?token['"]?\s*[:=]\s*['"]([^'"]+)['"]/)
      token_match ? token_match[1] : nil
    end
  end

  def login_with_csrf(username, password)
    # Get login page to extract CSRF token
    login_page = self.class.get('/login')
    csrf_token = extract_csrf_token(login_page.body)

    # Prepare login data
    login_data = {
      username: username,
      password: password
    }
    login_data[:csrf_token] = csrf_token if csrf_token

    # Perform login
    response = self.class.post('/login', body: login_data)

    if response.code == 200
      puts "Login successful with CSRF protection"
      true
    else
      puts "Login failed: #{response.code}"
      false
    end
  end
end

Best Practices and Troubleshooting

Session Management Best Practices

  1. Always respect robots.txt: Check the website's robots.txt file before scraping
  2. Implement rate limiting: Add delays between requests to avoid being blocked
  3. Handle session expiration: Implement logic to detect and handle expired sessions
  4. Use realistic headers: Set proper User-Agent and other headers to appear more legitimate
class RobustSessionScraper
  def initialize
    @max_retries = 3
    @delay_between_requests = 1
  end

  def make_request_with_retry(url, attempts: 0)
    sleep(@delay_between_requests) if attempts > 0

    response = make_request(url)

    # Check if session expired (customize based on your target site)
    if session_expired?(response)
      puts "Session expired, attempting to re-login..."
      if re_login && attempts < @max_retries
        return make_request_with_retry(url, attempts: attempts + 1)
      else
        raise "Failed to maintain session after #{@max_retries} attempts"
      end
    end

    response
  rescue StandardError => e
    if attempts < @max_retries
      puts "Request failed, retrying... (#{attempts + 1}/#{@max_retries})"
      sleep(2 ** attempts) # Exponential backoff
      make_request_with_retry(url, attempts: attempts + 1)
    else
      raise e
    end
  end

  private

  def session_expired?(response)
    response.code == 401 || 
    response.body.include?('login') ||
    response.body.include?('session expired')
  end
end

Common Issues and Solutions

Issue: Cookies not being set properly
Solution: Check that you're following redirects and that the cookie domain matches your requests

Issue: Session timing out during long scraping sessions
Solution: Implement periodic session refresh or re-authentication

Issue: Anti-bot measures detecting automated requests
Solution: Randomize request timing, use realistic headers, and consider using residential proxies

Conclusion

Effective cookie and session management is essential for successful Ruby web scraping. Whether you choose HTTParty for simplicity, Mechanize for advanced form handling, or Net::HTTP for complete control, understanding how to maintain session state will enable you to scrape authenticated content and handle complex user flows.

Remember to always scrape responsibly, respect website terms of service, and implement proper error handling and rate limiting in your scraping scripts. For sites with complex authentication flows, consider using browser automation tools that can handle authentication processes more naturally.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon