How do you handle websites that use complex authentication flows like OAuth?
Handling complex authentication flows like OAuth in Mechanize requires understanding the authentication process and implementing proper session management. While Mechanize excels at form-based authentication, OAuth and similar flows present unique challenges that require careful handling of redirects, tokens, and API endpoints.
Understanding OAuth Flow with Mechanize
OAuth (Open Authorization) is a standard for access delegation that allows applications to access user accounts without exposing passwords. The typical OAuth 2.0 flow involves multiple steps:
- Authorization Request: Redirect user to authorization server
- User Authorization: User grants permission
- Authorization Grant: Server returns authorization code
- Access Token Request: Exchange code for access token
- Protected Resource Access: Use token to access resources
Basic OAuth Implementation with Mechanize
Here's a Ruby implementation using Mechanize for OAuth authentication:
require 'mechanize'
require 'uri'
require 'json'
class OAuthScraper
def initialize
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Mozilla'
@client_id = 'your_client_id'
@client_secret = 'your_client_secret'
@redirect_uri = 'http://localhost:8080/callback'
end
def authenticate
# Step 1: Get authorization URL
auth_url = build_authorization_url
puts "Visit this URL to authorize: #{auth_url}"
# Step 2: Handle authorization (simulate or manual)
authorization_code = get_authorization_code(auth_url)
# Step 3: Exchange code for access token
access_token = exchange_code_for_token(authorization_code)
# Step 4: Store token for subsequent requests
@access_token = access_token
setup_authenticated_agent
access_token
end
private
def build_authorization_url
params = {
response_type: 'code',
client_id: @client_id,
redirect_uri: @redirect_uri,
scope: 'read write',
state: generate_state_token
}
"https://api.example.com/oauth/authorize?" + URI.encode_www_form(params)
end
def get_authorization_code(auth_url)
# Navigate to authorization page
page = @agent.get(auth_url)
# Find and fill login form if present
form = page.form_with(action: /login|signin/)
if form
form.username = 'your_username'
form.password = 'your_password'
page = @agent.submit(form)
end
# Look for authorization form
auth_form = page.form_with(action: /authorize|consent/)
if auth_form
# Submit authorization (grant permission)
redirect_page = @agent.submit(auth_form)
# Extract authorization code from redirect URL
if redirect_page.uri.to_s.include?(@redirect_uri)
uri = URI.parse(redirect_page.uri.to_s)
params = URI.decode_www_form(uri.query || '')
code = params.find { |k, v| k == 'code' }&.last
return code
end
end
raise "Failed to obtain authorization code"
end
def exchange_code_for_token(code)
token_url = 'https://api.example.com/oauth/token'
response = @agent.post(token_url, {
grant_type: 'authorization_code',
code: code,
redirect_uri: @redirect_uri,
client_id: @client_id,
client_secret: @client_secret
})
token_data = JSON.parse(response.body)
token_data['access_token']
end
def setup_authenticated_agent
# Add Authorization header for subsequent requests
@agent.pre_connect_hooks << lambda do |agent, request|
request['Authorization'] = "Bearer #{@access_token}"
end
end
def generate_state_token
SecureRandom.hex(16)
end
end
Handling Different OAuth Providers
Google OAuth Implementation
class GoogleOAuthScraper < OAuthScraper
def initialize
super
@auth_base_url = 'https://accounts.google.com/o/oauth2/v2/auth'
@token_url = 'https://oauth2.googleapis.com/token'
@scope = 'https://www.googleapis.com/auth/userinfo.profile'
end
def build_authorization_url
params = {
response_type: 'code',
client_id: @client_id,
redirect_uri: @redirect_uri,
scope: @scope,
access_type: 'offline',
state: generate_state_token
}
"#{@auth_base_url}?" + URI.encode_www_form(params)
end
def get_user_profile
profile_url = 'https://www.googleapis.com/oauth2/v2/userinfo'
response = @agent.get(profile_url)
JSON.parse(response.body)
end
end
GitHub OAuth Implementation
class GitHubOAuthScraper < OAuthScraper
def initialize
super
@auth_base_url = 'https://github.com/login/oauth/authorize'
@token_url = 'https://github.com/login/oauth/access_token'
end
def build_authorization_url
params = {
client_id: @client_id,
redirect_uri: @redirect_uri,
scope: 'user:email',
state: generate_state_token
}
"#{@auth_base_url}?" + URI.encode_www_form(params)
end
def exchange_code_for_token(code)
response = @agent.post(@token_url, {
client_id: @client_id,
client_secret: @client_secret,
code: code
}, { 'Accept' => 'application/json' })
token_data = JSON.parse(response.body)
token_data['access_token']
end
end
Advanced Authentication Patterns
PKCE (Proof Key for Code Exchange)
For enhanced security, implement PKCE flow:
require 'digest'
require 'base64'
class PKCEOAuthScraper < OAuthScraper
def initialize
super
generate_pkce_challenge
end
private
def generate_pkce_challenge
@code_verifier = SecureRandom.urlsafe_base64(43)
@code_challenge = Base64.urlsafe_encode64(
Digest::SHA256.digest(@code_verifier)
).tr('=', '')
end
def build_authorization_url
params = {
response_type: 'code',
client_id: @client_id,
redirect_uri: @redirect_uri,
scope: 'read write',
state: generate_state_token,
code_challenge: @code_challenge,
code_challenge_method: 'S256'
}
"https://api.example.com/oauth/authorize?" + URI.encode_www_form(params)
end
def exchange_code_for_token(code)
response = @agent.post('https://api.example.com/oauth/token', {
grant_type: 'authorization_code',
code: code,
redirect_uri: @redirect_uri,
client_id: @client_id,
code_verifier: @code_verifier
})
token_data = JSON.parse(response.body)
token_data['access_token']
end
end
Token Management and Refresh
Implement proper token lifecycle management:
class TokenManager
def initialize(agent)
@agent = agent
@token_file = 'oauth_tokens.json'
end
def save_tokens(access_token, refresh_token = nil, expires_in = nil)
tokens = {
access_token: access_token,
refresh_token: refresh_token,
expires_at: expires_in ? Time.now + expires_in : nil
}
File.write(@token_file, JSON.pretty_generate(tokens))
end
def load_tokens
return nil unless File.exist?(@token_file)
JSON.parse(File.read(@token_file), symbolize_names: true)
end
def refresh_access_token(refresh_token)
response = @agent.post('https://api.example.com/oauth/token', {
grant_type: 'refresh_token',
refresh_token: refresh_token,
client_id: @client_id,
client_secret: @client_secret
})
token_data = JSON.parse(response.body)
save_tokens(
token_data['access_token'],
token_data['refresh_token'] || refresh_token,
token_data['expires_in']
)
token_data['access_token']
end
def get_valid_token
tokens = load_tokens
return nil unless tokens
# Check if token is expired
if tokens[:expires_at] && Time.now > Time.parse(tokens[:expires_at])
if tokens[:refresh_token]
return refresh_access_token(tokens[:refresh_token])
else
return nil # Need to re-authenticate
end
end
tokens[:access_token]
end
end
Handling Complex Multi-Step Flows
Some applications require additional steps beyond basic OAuth:
class ComplexAuthFlow
def initialize
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Mozilla'
end
def authenticate_with_mfa
# Step 1: Basic login
login_page = @agent.get('https://example.com/login')
form = login_page.form_with(action: /login/)
form.username = 'your_username'
form.password = 'your_password'
mfa_page = @agent.submit(form)
# Step 2: Handle MFA challenge
if mfa_page.content.include?('verification code')
mfa_form = mfa_page.form_with(action: /verify|mfa/)
puts "Enter MFA code: "
mfa_code = gets.chomp
mfa_form.code = mfa_code
dashboard = @agent.submit(mfa_form)
end
# Step 3: Extract session tokens or cookies
extract_session_data(dashboard)
end
private
def extract_session_data(page)
# Look for session tokens in various places
csrf_token = page.at('meta[name="csrf-token"]')&.attr('content')
# Extract from JavaScript variables
script_content = page.search('script').map(&:content).join
session_match = script_content.match(/sessionToken['"]\s*:\s*['"]([^'"]+)/)
session_token = session_match[1] if session_match
{
csrf_token: csrf_token,
session_token: session_token,
cookies: @agent.cookies
}
end
end
Best Practices and Security Considerations
Secure Credential Storage
class SecureCredentialManager
def initialize
@keyring = ENV['CREDENTIAL_STORE'] || 'keychain'
end
def store_credentials(service, account, password)
case @keyring
when 'keychain'
system("security add-generic-password -s '#{service}' -a '#{account}' -w '#{password}'")
when 'env'
# Store in environment variables (less secure)
ENV["#{service}_#{account}".upcase] = password
end
end
def get_credentials(service, account)
case @keyring
when 'keychain'
`security find-generic-password -s '#{service}' -a '#{account}' -w 2>/dev/null`.strip
when 'env'
ENV["#{service}_#{account}".upcase]
end
end
end
Error Handling and Retry Logic
class RobustOAuthScraper < OAuthScraper
MAX_RETRIES = 3
RETRY_DELAY = 2
def authenticated_request(url, method = :get, data = nil)
retries = 0
begin
case method
when :get
response = @agent.get(url)
when :post
response = @agent.post(url, data)
end
handle_response(response)
rescue Net::HTTPUnauthorized => e
if retries < MAX_RETRIES
refresh_token_if_available
retries += 1
sleep(RETRY_DELAY * retries)
retry
else
raise "Authentication failed after #{MAX_RETRIES} retries"
end
rescue Net::HTTPTooManyRequests => e
if retries < MAX_RETRIES
wait_time = extract_retry_after(e.response) || (RETRY_DELAY * retries)
sleep(wait_time)
retries += 1
retry
else
raise "Rate limited after #{MAX_RETRIES} retries"
end
end
end
private
def extract_retry_after(response)
retry_after = response['Retry-After']
retry_after ? retry_after.to_i : nil
end
end
Integration with Modern Authentication
For applications requiring more sophisticated authentication handling, consider integrating Mechanize with browser automation tools. You can use authentication in Puppeteer for JavaScript-heavy OAuth flows, then transfer the session cookies to Mechanize:
def transfer_session_from_puppeteer
# Export cookies from Puppeteer session
puppeteer_cookies = JSON.parse(File.read('puppeteer_cookies.json'))
puppeteer_cookies.each do |cookie_data|
cookie = Mechanize::Cookie.new(
cookie_data['name'],
cookie_data['value']
)
cookie.domain = cookie_data['domain']
cookie.path = cookie_data['path']
@agent.cookie_jar.add(cookie)
end
end
Testing OAuth Flows
Create a test harness for OAuth implementations:
require 'rspec'
require 'webmock/rspec'
RSpec.describe OAuthScraper do
before do
WebMock.enable!
stub_oauth_endpoints
end
let(:scraper) { OAuthScraper.new }
it 'successfully completes OAuth flow' do
stub_request(:get, /oauth\/authorize/)
.to_return(status: 200, body: mock_authorization_page)
stub_request(:post, /oauth\/token/)
.to_return(
status: 200,
body: { access_token: 'test_token' }.to_json
)
token = scraper.authenticate
expect(token).to eq('test_token')
end
private
def stub_oauth_endpoints
# Add WebMock stubs for OAuth endpoints
end
def mock_authorization_page
'<html><body><form action="/oauth/authorize">
<button type="submit">Authorize</button>
</form></body></html>'
end
end
Conclusion
Handling complex authentication flows like OAuth with Mechanize requires careful planning and implementation. While Mechanize excels at form-based interactions, OAuth flows often involve multiple redirects, token exchanges, and API calls that need proper session management.
Key considerations include secure credential storage, proper token lifecycle management, robust error handling, and understanding the specific OAuth implementation of your target service. For JavaScript-heavy authentication flows, consider combining Mechanize with browser automation tools for optimal results.
Remember to always respect rate limits, implement proper retry logic, and follow OAuth security best practices when building production scraping applications. When dealing with browser sessions in Puppeteer, you can often transfer the authenticated session to Mechanize for more efficient subsequent requests.