How to Handle Forms with CSRF Tokens and Anti-Bot Measures in Mechanize
When scraping websites with forms, you'll often encounter security measures designed to prevent automated submissions. Cross-Site Request Forgery (CSRF) tokens are the most common protection mechanism, but websites may also implement CAPTCHAs, rate limiting, and behavioral analysis. This guide shows you how to handle these challenges using Mechanize in Ruby.
Understanding CSRF Tokens
CSRF tokens are random values embedded in forms to verify that form submissions come from legitimate users who accessed the form page. These tokens are typically:
- Generated server-side for each session or request
- Embedded as hidden form fields
- Required for successful form submission
- Single-use or time-limited
Basic CSRF Token Handling
Extracting CSRF Tokens from Forms
Mechanize automatically handles most CSRF tokens when they're embedded as hidden form fields. Here's how to work with them:
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com/login')
# Find the form containing the CSRF token
form = page.forms.first
# Mechanize automatically includes hidden fields including CSRF tokens
puts "Form fields:"
form.fields.each do |field|
puts "#{field.name}: #{field.value}"
end
# Fill in other form fields
form.username = 'your_username'
form.password = 'your_password'
# Submit the form - CSRF token is automatically included
result_page = agent.submit(form)
Manual CSRF Token Extraction
Sometimes you need to extract CSRF tokens manually, especially when they're not in standard hidden fields:
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com/protected-form')
# Extract CSRF token from meta tag
csrf_token = page.at('meta[name="csrf-token"]')['content']
# Or extract from a custom location
csrf_token = page.at('#csrf-token')['value']
# Or extract from JavaScript variables
script_content = page.at('script').text
csrf_token = script_content.match(/csrfToken:\s*['"]([^'"]+)['"]/)[1]
puts "Extracted CSRF token: #{csrf_token}"
# Use the token in form submission
form = page.forms.first
form.add_field!('_token', csrf_token)
form.username = 'user@example.com'
result_page = agent.submit(form)
Advanced CSRF Token Scenarios
Dynamic Token Generation
Some websites generate CSRF tokens through AJAX requests:
require 'mechanize'
require 'json'
agent = Mechanize.new
# First, get the main page
page = agent.get('https://example.com/form-page')
# Make an AJAX request to get the CSRF token
token_response = agent.get('https://example.com/api/csrf-token')
token_data = JSON.parse(token_response.body)
csrf_token = token_data['token']
# Now use the token in your form submission
form = page.forms.first
form.add_field!('csrf_token', csrf_token)
form.email = 'user@example.com'
result_page = agent.submit(form)
Session-Based Token Management
For multi-step forms or session-dependent tokens:
class CSRFFormHandler
def initialize
@agent = Mechanize.new
@csrf_token = nil
end
def login_and_get_token
# Login to establish session
login_page = @agent.get('https://example.com/login')
login_form = login_page.forms.first
login_form.username = 'your_username'
login_form.password = 'your_password'
dashboard = @agent.submit(login_form)
# Extract CSRF token from authenticated page
@csrf_token = dashboard.at('meta[name="csrf-token"]')['content']
end
def submit_protected_form(data)
form_page = @agent.get('https://example.com/protected-form')
form = form_page.forms.first
# Add the CSRF token
form.add_field!('_token', @csrf_token)
# Add form data
data.each do |key, value|
form.send("#{key}=", value)
end
@agent.submit(form)
end
end
# Usage
handler = CSRFFormHandler.new
handler.login_and_get_token
result = handler.submit_protected_form({
title: 'My Post',
content: 'Post content here'
})
Handling Other Anti-Bot Measures
User-Agent Rotation
Many websites block requests with default or suspicious user agents:
require 'mechanize'
# Set a realistic user agent
agent = Mechanize.new
agent.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
# Or rotate between multiple user agents
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
]
def make_request_with_rotation(agent, url, user_agents)
agent.user_agent = user_agents.sample
agent.get(url)
end
Rate Limiting and Delays
Implement delays to avoid triggering rate limits:
require 'mechanize'
class PoliteFormSubmitter
def initialize(delay_range: 1..3)
@agent = Mechanize.new
@delay_range = delay_range
end
def submit_forms(urls_and_data)
urls_and_data.each do |url, form_data|
# Random delay between requests
sleep(rand(@delay_range))
begin
page = @agent.get(url)
form = page.forms.first
# Extract CSRF token
csrf_token = extract_csrf_token(page)
form.add_field!('_token', csrf_token) if csrf_token
# Fill form data
form_data.each { |key, value| form.send("#{key}=", value) }
result = @agent.submit(form)
puts "Form submitted successfully for #{url}"
rescue StandardError => e
puts "Error submitting form for #{url}: #{e.message}"
end
end
end
private
def extract_csrf_token(page)
page.at('meta[name="csrf-token"]')&.[]('content') ||
page.at('input[name="_token"]')&.[]('value')
end
end
Cookie and Session Management
Proper cookie handling is crucial for maintaining sessions:
require 'mechanize'
agent = Mechanize.new
# Enable cookie persistence
agent.cookie_jar.load('/path/to/cookies.txt') if File.exist?('/path/to/cookies.txt')
# Set up cookie saving
at_exit do
agent.cookie_jar.save('/path/to/cookies.txt')
end
# Handle session cookies properly
page = agent.get('https://example.com/login')
form = page.forms.first
form.username = 'user'
form.password = 'pass'
# Submit and maintain session
logged_in_page = agent.submit(form)
# Now access protected resources with maintained session
protected_page = agent.get('https://example.com/protected-area')
Handling JavaScript-Heavy Protection
For websites that generate CSRF tokens or implement protection via JavaScript, you might need to combine Mechanize with JavaScript execution. While Mechanize doesn't execute JavaScript natively, you can extract the necessary values:
require 'mechanize'
def extract_js_variables(page, variable_name)
scripts = page.search('script')
scripts.each do |script|
content = script.text
if match = content.match(/#{variable_name}\s*[:=]\s*['"]([^'"]+)['"]/)
return match[1]
end
end
nil
end
agent = Mechanize.new
page = agent.get('https://example.com/js-protected-form')
# Extract JavaScript-embedded CSRF token
csrf_token = extract_js_variables(page, 'csrfToken')
puts "Extracted JS CSRF token: #{csrf_token}"
# Use in form submission
form = page.forms.first
form.add_field!('csrf_token', csrf_token)
For more complex JavaScript scenarios, consider using browser automation tools like Puppeteer which can execute JavaScript and handle dynamic content.
Best Practices and Error Handling
Robust CSRF Token Extraction
Create a robust token extraction method:
def extract_csrf_token(page)
# Try multiple common CSRF token locations
selectors = [
'meta[name="csrf-token"]',
'meta[name="_token"]',
'input[name="csrf_token"]',
'input[name="_token"]',
'input[name="authenticity_token"]'
]
selectors.each do |selector|
element = page.at(selector)
if element
return element['content'] || element['value']
end
end
# Try extracting from JavaScript
scripts = page.search('script')
scripts.each do |script|
content = script.text
patterns = [
/csrfToken['":\s]*['"]([^'"]+)['"]/,
/_token['":\s]*['"]([^'"]+)['"]/,
/authenticity_token['":\s]*['"]([^'"]+)['"]/
]
patterns.each do |pattern|
if match = content.match(pattern)
return match[1]
end
end
end
nil
end
Error Handling and Retry Logic
Implement proper error handling for failed submissions:
def submit_form_with_retry(agent, url, form_data, max_retries: 3)
retries = 0
begin
page = agent.get(url)
form = page.forms.first
# Extract and add CSRF token
csrf_token = extract_csrf_token(page)
raise "CSRF token not found" unless csrf_token
form.add_field!('_token', csrf_token)
# Fill form data
form_data.each { |key, value| form.send("#{key}=", value) }
result = agent.submit(form)
# Check for successful submission
if result.body.include?('success') || result.code == '200'
return result
else
raise "Form submission failed"
end
rescue StandardError => e
retries += 1
if retries <= max_retries
puts "Retry #{retries}/#{max_retries}: #{e.message}"
sleep(2 ** retries) # Exponential backoff
retry
else
raise "Form submission failed after #{max_retries} retries: #{e.message}"
end
end
end
Working with Complex Anti-Bot Systems
Behavioral Mimicking
Some advanced protection systems analyze user behavior:
class HumanLikeFormFiller
def initialize
@agent = Mechanize.new
setup_realistic_headers
end
def fill_form_naturally(page, form_data)
form = page.forms.first
# Simulate human-like typing delays
form_data.each do |field, value|
# Small delay before filling each field
sleep(rand(0.5..2.0))
# Simulate gradual typing for longer values
if value.length > 10
form.send("#{field}=", value[0...5])
sleep(rand(0.3..0.8))
form.send("#{field}=", value)
else
form.send("#{field}=", value)
end
end
# Wait before submitting (simulating user review)
sleep(rand(2..5))
@agent.submit(form)
end
private
def setup_realistic_headers
@agent.user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
@agent.request_headers = {
'Accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' => 'en-US,en;q=0.5',
'Accept-Encoding' => 'gzip, deflate',
'DNT' => '1',
'Connection' => 'keep-alive',
'Upgrade-Insecure-Requests' => '1'
}
end
end
Conclusion
Handling CSRF tokens and anti-bot measures with Mechanize requires understanding how these protection mechanisms work and implementing appropriate extraction and handling strategies. Key principles include:
Automatic vs Manual Token Handling: Let Mechanize handle standard hidden fields automatically, but be prepared to extract tokens manually from meta tags or JavaScript.
Session Management: Maintain proper cookie handling and session state across requests.
Respectful Scraping: Implement delays, rotate user agents, and avoid overwhelming target servers.
Error Handling: Build robust retry logic and graceful error handling.
Behavioral Mimicking: For advanced protection systems, simulate human-like behavior with realistic timing and headers.
For websites with heavy JavaScript-based protection that Mechanize cannot handle alone, consider using headless browser solutions that can execute JavaScript and handle dynamic content more effectively.
Remember to always respect robots.txt files, implement appropriate delays, and ensure your scraping activities comply with the website's terms of service and applicable laws.