Table of contents

How do you handle websites that use complex authentication flows like OAuth?

Handling complex authentication flows like OAuth in Mechanize requires understanding the authentication process and implementing proper session management. While Mechanize excels at form-based authentication, OAuth and similar flows present unique challenges that require careful handling of redirects, tokens, and API endpoints.

Understanding OAuth Flow with Mechanize

OAuth (Open Authorization) is a standard for access delegation that allows applications to access user accounts without exposing passwords. The typical OAuth 2.0 flow involves multiple steps:

  1. Authorization Request: Redirect user to authorization server
  2. User Authorization: User grants permission
  3. Authorization Grant: Server returns authorization code
  4. Access Token Request: Exchange code for access token
  5. Protected Resource Access: Use token to access resources

Basic OAuth Implementation with Mechanize

Here's a Ruby implementation using Mechanize for OAuth authentication:

require 'mechanize'
require 'uri'
require 'json'

class OAuthScraper
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Mozilla'
    @client_id = 'your_client_id'
    @client_secret = 'your_client_secret'
    @redirect_uri = 'http://localhost:8080/callback'
  end

  def authenticate
    # Step 1: Get authorization URL
    auth_url = build_authorization_url
    puts "Visit this URL to authorize: #{auth_url}"

    # Step 2: Handle authorization (simulate or manual)
    authorization_code = get_authorization_code(auth_url)

    # Step 3: Exchange code for access token
    access_token = exchange_code_for_token(authorization_code)

    # Step 4: Store token for subsequent requests
    @access_token = access_token
    setup_authenticated_agent

    access_token
  end

  private

  def build_authorization_url
    params = {
      response_type: 'code',
      client_id: @client_id,
      redirect_uri: @redirect_uri,
      scope: 'read write',
      state: generate_state_token
    }

    "https://api.example.com/oauth/authorize?" + URI.encode_www_form(params)
  end

  def get_authorization_code(auth_url)
    # Navigate to authorization page
    page = @agent.get(auth_url)

    # Find and fill login form if present
    form = page.form_with(action: /login|signin/)
    if form
      form.username = 'your_username'
      form.password = 'your_password'
      page = @agent.submit(form)
    end

    # Look for authorization form
    auth_form = page.form_with(action: /authorize|consent/)
    if auth_form
      # Submit authorization (grant permission)
      redirect_page = @agent.submit(auth_form)

      # Extract authorization code from redirect URL
      if redirect_page.uri.to_s.include?(@redirect_uri)
        uri = URI.parse(redirect_page.uri.to_s)
        params = URI.decode_www_form(uri.query || '')
        code = params.find { |k, v| k == 'code' }&.last
        return code
      end
    end

    raise "Failed to obtain authorization code"
  end

  def exchange_code_for_token(code)
    token_url = 'https://api.example.com/oauth/token'

    response = @agent.post(token_url, {
      grant_type: 'authorization_code',
      code: code,
      redirect_uri: @redirect_uri,
      client_id: @client_id,
      client_secret: @client_secret
    })

    token_data = JSON.parse(response.body)
    token_data['access_token']
  end

  def setup_authenticated_agent
    # Add Authorization header for subsequent requests
    @agent.pre_connect_hooks << lambda do |agent, request|
      request['Authorization'] = "Bearer #{@access_token}"
    end
  end

  def generate_state_token
    SecureRandom.hex(16)
  end
end

Handling Different OAuth Providers

Google OAuth Implementation

class GoogleOAuthScraper < OAuthScraper
  def initialize
    super
    @auth_base_url = 'https://accounts.google.com/o/oauth2/v2/auth'
    @token_url = 'https://oauth2.googleapis.com/token'
    @scope = 'https://www.googleapis.com/auth/userinfo.profile'
  end

  def build_authorization_url
    params = {
      response_type: 'code',
      client_id: @client_id,
      redirect_uri: @redirect_uri,
      scope: @scope,
      access_type: 'offline',
      state: generate_state_token
    }

    "#{@auth_base_url}?" + URI.encode_www_form(params)
  end

  def get_user_profile
    profile_url = 'https://www.googleapis.com/oauth2/v2/userinfo'
    response = @agent.get(profile_url)
    JSON.parse(response.body)
  end
end

GitHub OAuth Implementation

class GitHubOAuthScraper < OAuthScraper
  def initialize
    super
    @auth_base_url = 'https://github.com/login/oauth/authorize'
    @token_url = 'https://github.com/login/oauth/access_token'
  end

  def build_authorization_url
    params = {
      client_id: @client_id,
      redirect_uri: @redirect_uri,
      scope: 'user:email',
      state: generate_state_token
    }

    "#{@auth_base_url}?" + URI.encode_www_form(params)
  end

  def exchange_code_for_token(code)
    response = @agent.post(@token_url, {
      client_id: @client_id,
      client_secret: @client_secret,
      code: code
    }, { 'Accept' => 'application/json' })

    token_data = JSON.parse(response.body)
    token_data['access_token']
  end
end

Advanced Authentication Patterns

PKCE (Proof Key for Code Exchange)

For enhanced security, implement PKCE flow:

require 'digest'
require 'base64'

class PKCEOAuthScraper < OAuthScraper
  def initialize
    super
    generate_pkce_challenge
  end

  private

  def generate_pkce_challenge
    @code_verifier = SecureRandom.urlsafe_base64(43)
    @code_challenge = Base64.urlsafe_encode64(
      Digest::SHA256.digest(@code_verifier)
    ).tr('=', '')
  end

  def build_authorization_url
    params = {
      response_type: 'code',
      client_id: @client_id,
      redirect_uri: @redirect_uri,
      scope: 'read write',
      state: generate_state_token,
      code_challenge: @code_challenge,
      code_challenge_method: 'S256'
    }

    "https://api.example.com/oauth/authorize?" + URI.encode_www_form(params)
  end

  def exchange_code_for_token(code)
    response = @agent.post('https://api.example.com/oauth/token', {
      grant_type: 'authorization_code',
      code: code,
      redirect_uri: @redirect_uri,
      client_id: @client_id,
      code_verifier: @code_verifier
    })

    token_data = JSON.parse(response.body)
    token_data['access_token']
  end
end

Token Management and Refresh

Implement proper token lifecycle management:

class TokenManager
  def initialize(agent)
    @agent = agent
    @token_file = 'oauth_tokens.json'
  end

  def save_tokens(access_token, refresh_token = nil, expires_in = nil)
    tokens = {
      access_token: access_token,
      refresh_token: refresh_token,
      expires_at: expires_in ? Time.now + expires_in : nil
    }

    File.write(@token_file, JSON.pretty_generate(tokens))
  end

  def load_tokens
    return nil unless File.exist?(@token_file)
    JSON.parse(File.read(@token_file), symbolize_names: true)
  end

  def refresh_access_token(refresh_token)
    response = @agent.post('https://api.example.com/oauth/token', {
      grant_type: 'refresh_token',
      refresh_token: refresh_token,
      client_id: @client_id,
      client_secret: @client_secret
    })

    token_data = JSON.parse(response.body)
    save_tokens(
      token_data['access_token'],
      token_data['refresh_token'] || refresh_token,
      token_data['expires_in']
    )

    token_data['access_token']
  end

  def get_valid_token
    tokens = load_tokens
    return nil unless tokens

    # Check if token is expired
    if tokens[:expires_at] && Time.now > Time.parse(tokens[:expires_at])
      if tokens[:refresh_token]
        return refresh_access_token(tokens[:refresh_token])
      else
        return nil # Need to re-authenticate
      end
    end

    tokens[:access_token]
  end
end

Handling Complex Multi-Step Flows

Some applications require additional steps beyond basic OAuth:

class ComplexAuthFlow
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Mozilla'
  end

  def authenticate_with_mfa
    # Step 1: Basic login
    login_page = @agent.get('https://example.com/login')
    form = login_page.form_with(action: /login/)
    form.username = 'your_username'
    form.password = 'your_password'

    mfa_page = @agent.submit(form)

    # Step 2: Handle MFA challenge
    if mfa_page.content.include?('verification code')
      mfa_form = mfa_page.form_with(action: /verify|mfa/)
      puts "Enter MFA code: "
      mfa_code = gets.chomp
      mfa_form.code = mfa_code

      dashboard = @agent.submit(mfa_form)
    end

    # Step 3: Extract session tokens or cookies
    extract_session_data(dashboard)
  end

  private

  def extract_session_data(page)
    # Look for session tokens in various places
    csrf_token = page.at('meta[name="csrf-token"]')&.attr('content')

    # Extract from JavaScript variables
    script_content = page.search('script').map(&:content).join
    session_match = script_content.match(/sessionToken['"]\s*:\s*['"]([^'"]+)/)
    session_token = session_match[1] if session_match

    {
      csrf_token: csrf_token,
      session_token: session_token,
      cookies: @agent.cookies
    }
  end
end

Best Practices and Security Considerations

Secure Credential Storage

class SecureCredentialManager
  def initialize
    @keyring = ENV['CREDENTIAL_STORE'] || 'keychain'
  end

  def store_credentials(service, account, password)
    case @keyring
    when 'keychain'
      system("security add-generic-password -s '#{service}' -a '#{account}' -w '#{password}'")
    when 'env'
      # Store in environment variables (less secure)
      ENV["#{service}_#{account}".upcase] = password
    end
  end

  def get_credentials(service, account)
    case @keyring
    when 'keychain'
      `security find-generic-password -s '#{service}' -a '#{account}' -w 2>/dev/null`.strip
    when 'env'
      ENV["#{service}_#{account}".upcase]
    end
  end
end

Error Handling and Retry Logic

class RobustOAuthScraper < OAuthScraper
  MAX_RETRIES = 3
  RETRY_DELAY = 2

  def authenticated_request(url, method = :get, data = nil)
    retries = 0

    begin
      case method
      when :get
        response = @agent.get(url)
      when :post
        response = @agent.post(url, data)
      end

      handle_response(response)
    rescue Net::HTTPUnauthorized => e
      if retries < MAX_RETRIES
        refresh_token_if_available
        retries += 1
        sleep(RETRY_DELAY * retries)
        retry
      else
        raise "Authentication failed after #{MAX_RETRIES} retries"
      end
    rescue Net::HTTPTooManyRequests => e
      if retries < MAX_RETRIES
        wait_time = extract_retry_after(e.response) || (RETRY_DELAY * retries)
        sleep(wait_time)
        retries += 1
        retry
      else
        raise "Rate limited after #{MAX_RETRIES} retries"
      end
    end
  end

  private

  def extract_retry_after(response)
    retry_after = response['Retry-After']
    retry_after ? retry_after.to_i : nil
  end
end

Integration with Modern Authentication

For applications requiring more sophisticated authentication handling, consider integrating Mechanize with browser automation tools. You can use authentication in Puppeteer for JavaScript-heavy OAuth flows, then transfer the session cookies to Mechanize:

def transfer_session_from_puppeteer
  # Export cookies from Puppeteer session
  puppeteer_cookies = JSON.parse(File.read('puppeteer_cookies.json'))

  puppeteer_cookies.each do |cookie_data|
    cookie = Mechanize::Cookie.new(
      cookie_data['name'],
      cookie_data['value']
    )
    cookie.domain = cookie_data['domain']
    cookie.path = cookie_data['path']

    @agent.cookie_jar.add(cookie)
  end
end

Testing OAuth Flows

Create a test harness for OAuth implementations:

require 'rspec'
require 'webmock/rspec'

RSpec.describe OAuthScraper do
  before do
    WebMock.enable!
    stub_oauth_endpoints
  end

  let(:scraper) { OAuthScraper.new }

  it 'successfully completes OAuth flow' do
    stub_request(:get, /oauth\/authorize/)
      .to_return(status: 200, body: mock_authorization_page)

    stub_request(:post, /oauth\/token/)
      .to_return(
        status: 200, 
        body: { access_token: 'test_token' }.to_json
      )

    token = scraper.authenticate
    expect(token).to eq('test_token')
  end

  private

  def stub_oauth_endpoints
    # Add WebMock stubs for OAuth endpoints
  end

  def mock_authorization_page
    '<html><body><form action="/oauth/authorize">
      <button type="submit">Authorize</button>
    </form></body></html>'
  end
end

Conclusion

Handling complex authentication flows like OAuth with Mechanize requires careful planning and implementation. While Mechanize excels at form-based interactions, OAuth flows often involve multiple redirects, token exchanges, and API calls that need proper session management.

Key considerations include secure credential storage, proper token lifecycle management, robust error handling, and understanding the specific OAuth implementation of your target service. For JavaScript-heavy authentication flows, consider combining Mechanize with browser automation tools for optimal results.

Remember to always respect rate limits, implement proper retry logic, and follow OAuth security best practices when building production scraping applications. When dealing with browser sessions in Puppeteer, you can often transfer the authenticated session to Mechanize for more efficient subsequent requests.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon