How to Handle Forms with CSRF Tokens Using MechanicalSoup
Cross-Site Request Forgery (CSRF) tokens are security mechanisms used by web applications to prevent unauthorized form submissions. When scraping websites with MechanicalSoup, you'll often encounter forms protected by CSRF tokens. This guide will show you how to properly handle these tokens to successfully submit forms.
Understanding CSRF Tokens
CSRF tokens are unique, randomly generated values that web applications include in forms to verify that form submissions come from legitimate users. These tokens are typically:
- Generated server-side and embedded in HTML forms
- Required to be submitted along with form data
- Validated server-side before processing the request
- Single-use or session-based
Basic CSRF Token Handling
MechanicalSoup automatically handles many CSRF token scenarios through its form selection and submission mechanisms. Here's a basic example:
import mechanicalsoup
# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()
browser.set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
# Navigate to the login page
browser.open('https://example.com/login')
# Select the form (MechanicalSoup will automatically detect CSRF tokens)
browser.select_form('form[action="/login"]')
# Fill in the form fields
browser['username'] = 'your_username'
browser['password'] = 'your_password'
# Submit the form (CSRF token is automatically included)
response = browser.submit_selected()
print(f"Status code: {response.status_code}")
Manual CSRF Token Extraction
Sometimes you need to manually extract and handle CSRF tokens. Here's how to do it:
import mechanicalsoup
from bs4 import BeautifulSoup
browser = mechanicalsoup.StatefulBrowser()
browser.open('https://example.com/form-page')
# Get the current page
page = browser.get_current_page()
# Method 1: Extract CSRF token from hidden input
csrf_token = page.find('input', {'name': 'csrf_token'})['value']
print(f"CSRF Token: {csrf_token}")
# Method 2: Extract from meta tag
meta_csrf = page.find('meta', {'name': 'csrf-token'})
if meta_csrf:
csrf_token = meta_csrf['content']
# Method 3: Extract from data attribute
csrf_input = page.find('input', {'name': '_token'})
if csrf_input:
csrf_token = csrf_input['value']
Advanced CSRF Token Handling
For more complex scenarios, you might need to handle CSRF tokens manually:
import mechanicalsoup
import requests
class CSRFHandler:
def __init__(self):
self.browser = mechanicalsoup.StatefulBrowser()
self.browser.set_user_agent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36')
def extract_csrf_token(self, page, token_name='csrf_token'):
"""Extract CSRF token from various sources"""
# Try hidden input field
token_input = page.find('input', {'name': token_name})
if token_input and token_input.get('value'):
return token_input['value']
# Try meta tag
meta_token = page.find('meta', {'name': 'csrf-token'})
if meta_token and meta_token.get('content'):
return meta_token['content']
# Try data attributes
for attr in ['data-csrf-token', 'data-token']:
element = page.find(attrs={attr: True})
if element:
return element[attr]
return None
def submit_form_with_csrf(self, url, form_data, token_field='csrf_token'):
"""Submit form with proper CSRF token handling"""
# Navigate to the form page
response = self.browser.open(url)
page = self.browser.get_current_page()
# Extract CSRF token
csrf_token = self.extract_csrf_token(page, token_field)
if not csrf_token:
raise ValueError("CSRF token not found")
# Add CSRF token to form data
form_data[token_field] = csrf_token
# Select and fill form
self.browser.select_form()
for field, value in form_data.items():
if field in self.browser.form.find_all('input', {'name': field}):
self.browser[field] = value
# Submit form
return self.browser.submit_selected()
# Usage example
csrf_handler = CSRFHandler()
form_data = {
'username': 'john_doe',
'email': 'john@example.com',
'message': 'Hello from MechanicalSoup!'
}
response = csrf_handler.submit_form_with_csrf(
'https://example.com/contact',
form_data
)
Handling Different CSRF Token Formats
Different web frameworks use various CSRF token implementations:
Django CSRF Tokens
def handle_django_csrf(browser, url):
browser.open(url)
page = browser.get_current_page()
# Django uses 'csrfmiddlewaretoken'
csrf_token = page.find('input', {'name': 'csrfmiddlewaretoken'})['value']
browser.select_form()
browser['csrfmiddlewaretoken'] = csrf_token
# Fill other form fields...
return browser.submit_selected()
Laravel CSRF Tokens
def handle_laravel_csrf(browser, url):
browser.open(url)
page = browser.get_current_page()
# Laravel uses '_token'
csrf_token = page.find('input', {'name': '_token'})['value']
browser.select_form()
browser['_token'] = csrf_token
# Fill other form fields...
return browser.submit_selected()
Rails CSRF Tokens
def handle_rails_csrf(browser, url):
browser.open(url)
page = browser.get_current_page()
# Rails uses 'authenticity_token'
csrf_token = page.find('input', {'name': 'authenticity_token'})['value']
browser.select_form()
browser['authenticity_token'] = csrf_token
# Fill other form fields...
return browser.submit_selected()
Session Management and CSRF Tokens
CSRF tokens are often tied to user sessions. Here's how to maintain sessions properly:
import mechanicalsoup
class SessionAwareCSRFHandler:
def __init__(self):
self.browser = mechanicalsoup.StatefulBrowser()
# Enable cookie persistence
self.browser.session.cookies.clear()
def login_and_get_session(self, login_url, username, password):
"""Login and establish session"""
self.browser.open(login_url)
self.browser.select_form()
# Fill login form (CSRF handled automatically)
self.browser['username'] = username
self.browser['password'] = password
response = self.browser.submit_selected()
return response.status_code == 200
def submit_authenticated_form(self, form_url, form_data):
"""Submit form using established session"""
self.browser.open(form_url)
page = self.browser.get_current_page()
# MechanicalSoup maintains session cookies automatically
self.browser.select_form()
for field, value in form_data.items():
self.browser[field] = value
return self.browser.submit_selected()
# Usage
handler = SessionAwareCSRFHandler()
handler.login_and_get_session('https://example.com/login', 'user', 'pass')
handler.submit_authenticated_form('https://example.com/profile/edit', {
'name': 'Updated Name',
'email': 'new@example.com'
})
Error Handling and Debugging
When working with CSRF tokens, you might encounter various errors. Here's how to handle them:
import mechanicalsoup
import logging
# Set up logging for debugging
logging.basicConfig(level=logging.DEBUG)
def robust_csrf_form_submission(url, form_data, max_retries=3):
browser = mechanicalsoup.StatefulBrowser()
for attempt in range(max_retries):
try:
# Navigate to form
response = browser.open(url)
if response.status_code != 200:
raise Exception(f"Failed to load form page: {response.status_code}")
page = browser.get_current_page()
# Try to find and select form
forms = page.find_all('form')
if not forms:
raise Exception("No forms found on page")
browser.select_form()
# Fill form data
for field, value in form_data.items():
try:
browser[field] = value
except Exception as e:
print(f"Warning: Could not set field {field}: {e}")
# Submit form
submit_response = browser.submit_selected()
# Check for CSRF errors
if submit_response.status_code == 403:
print(f"CSRF error on attempt {attempt + 1}, retrying...")
continue
elif submit_response.status_code == 422:
print("Form validation error - check your data")
break
else:
print(f"Form submitted successfully: {submit_response.status_code}")
return submit_response
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt == max_retries - 1:
raise
return None
Best Practices
Always check for CSRF tokens: Before submitting any form, verify that CSRF tokens are properly handled.
Maintain sessions: Use MechanicalSoup's session management to ensure CSRF tokens remain valid.
Handle different token names: Different frameworks use different field names for CSRF tokens.
Implement retry logic: CSRF tokens can expire, so implement retry mechanisms for failed submissions.
Respect rate limits: Don't submit forms too quickly, as this might trigger anti-bot measures.
Alternative Approaches
For more complex scenarios involving JavaScript-heavy sites, you might need to consider alternatives like handling authentication with Puppeteer or using more advanced browser automation tools.
Common Issues and Solutions
Issue: CSRF Token Not Found
# Solution: Check multiple possible locations
def find_csrf_token(page):
selectors = [
'input[name="csrf_token"]',
'input[name="_token"]',
'input[name="csrfmiddlewaretoken"]',
'input[name="authenticity_token"]',
'meta[name="csrf-token"]'
]
for selector in selectors:
element = page.select_one(selector)
if element:
return element.get('value') or element.get('content')
return None
Issue: Token Expiry
# Solution: Refresh page and get new token
def refresh_csrf_token(browser, url):
browser.open(url)
page = browser.get_current_page()
return extract_csrf_token(page)
When dealing with complex web applications that require sophisticated session management, consider exploring browser session handling techniques used in other automation tools.
Conclusion
Handling CSRF tokens with MechanicalSoup requires understanding how these security tokens work and implementing proper extraction and submission techniques. By following the patterns and examples in this guide, you can successfully interact with CSRF-protected forms while maintaining good security practices and robust error handling.