How do I Debug MechanicalSoup Requests and Responses?
Debugging MechanicalSoup requests and responses is essential for building robust web scraping applications. This comprehensive guide covers various debugging techniques, from basic logging to advanced request inspection methods that will help you identify and resolve issues in your scraping workflows.
Understanding MechanicalSoup's Request-Response Cycle
MechanicalSoup builds on top of the requests library, providing a higher-level interface for browser automation. Understanding the underlying request-response cycle helps in effective debugging:
import mechanicalsoup
# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()
# The browser maintains session state and handles cookies automatically
response = browser.open("https://httpbin.org/get")
print(f"Status Code: {response.status_code}")
print(f"URL: {response.url}")
Basic Request and Response Inspection
Enabling Debug Output
The most straightforward way to debug MechanicalSoup is by enabling verbose output and inspecting response objects:
import mechanicalsoup
import logging
# Enable detailed logging
logging.basicConfig(level=logging.DEBUG)
browser = mechanicalsoup.StatefulBrowser()
# Open a page and inspect the response
response = browser.open("https://httpbin.org/get")
# Basic response inspection
print(f"Status Code: {response.status_code}")
print(f"Headers: {response.headers}")
print(f"URL: {response.url}")
print(f"Content Type: {response.headers.get('content-type')}")
print(f"Response Size: {len(response.content)} bytes")
Inspecting Request Details
Access the underlying request object to examine what was sent:
# Access the request that generated the response
request = response.request
print(f"Request Method: {request.method}")
print(f"Request URL: {request.url}")
print(f"Request Headers: {request.headers}")
# For POST requests, inspect the body
if hasattr(request, 'body') and request.body:
print(f"Request Body: {request.body}")
Advanced Debugging Techniques
Custom Request Hooks
Use request hooks to intercept and log all HTTP traffic:
import mechanicalsoup
import json
def debug_request(request, *args, **kwargs):
print(f"\n--- OUTGOING REQUEST ---")
print(f"Method: {request.method}")
print(f"URL: {request.url}")
print(f"Headers: {json.dumps(dict(request.headers), indent=2)}")
if request.body:
print(f"Body: {request.body}")
print("------------------------\n")
def debug_response(response, *args, **kwargs):
print(f"\n--- INCOMING RESPONSE ---")
print(f"Status: {response.status_code}")
print(f"Headers: {json.dumps(dict(response.headers), indent=2)}")
print(f"Content Length: {len(response.content)}")
print("-------------------------\n")
# Create browser with custom session
browser = mechanicalsoup.StatefulBrowser()
browser.session.hooks['request'] = debug_request
browser.session.hooks['response'] = debug_response
# Now all requests will be logged
response = browser.open("https://httpbin.org/json")
Session State Debugging
Monitor cookies and session state changes:
def debug_session_state(browser):
print(f"\n--- SESSION STATE ---")
print(f"Cookies: {dict(browser.session.cookies)}")
print(f"Headers: {dict(browser.session.headers)}")
print("--------------------\n")
browser = mechanicalsoup.StatefulBrowser()
debug_session_state(browser)
# After visiting a page that sets cookies
browser.open("https://httpbin.org/cookies/set/debug/true")
debug_session_state(browser)
Form Debugging Strategies
When working with forms, debugging becomes crucial for understanding submission issues:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
response = browser.open("https://httpbin.org/forms/post")
# Find and inspect forms
forms = browser.get_current_page().find_all('form')
print(f"Found {len(forms)} forms on the page")
for i, form in enumerate(forms):
print(f"\n--- FORM {i+1} ---")
print(f"Action: {form.get('action', 'Not specified')}")
print(f"Method: {form.get('method', 'GET')}")
# List all form fields
inputs = form.find_all(['input', 'select', 'textarea'])
for input_field in inputs:
field_type = input_field.get('type', 'text')
field_name = input_field.get('name', 'unnamed')
field_value = input_field.get('value', '')
print(f" {field_type}: {field_name} = '{field_value}'")
# Select and fill a form
if forms:
form = browser.select_form('form')
print(f"\nSelected form action: {form.action}")
# Fill form fields (example)
browser['custname'] = 'Test User'
browser['custtel'] = '123-456-7890'
# Debug form data before submission
print(f"Form data to be submitted: {browser.get_current_form()}")
Error Handling and Recovery
Implement comprehensive error handling with detailed logging:
import mechanicalsoup
import requests
from requests.exceptions import RequestException
def robust_page_fetch(url, max_retries=3):
browser = mechanicalsoup.StatefulBrowser()
for attempt in range(max_retries):
try:
print(f"Attempt {attempt + 1}: Fetching {url}")
response = browser.open(url)
# Check for HTTP errors
response.raise_for_status()
print(f"Success: {response.status_code} - {len(response.content)} bytes")
return response
except requests.exceptions.HTTPError as e:
print(f"HTTP Error: {e}")
print(f"Response status: {e.response.status_code}")
print(f"Response headers: {dict(e.response.headers)}")
except requests.exceptions.ConnectionError as e:
print(f"Connection Error: {e}")
except requests.exceptions.Timeout as e:
print(f"Timeout Error: {e}")
except requests.exceptions.RequestException as e:
print(f"Request Error: {e}")
if attempt < max_retries - 1:
print(f"Retrying in 2 seconds...")
import time
time.sleep(2)
print(f"Failed to fetch {url} after {max_retries} attempts")
return None
# Usage
response = robust_page_fetch("https://httpbin.org/status/500")
Network-Level Debugging
For complex debugging scenarios, capture network traffic:
import mechanicalsoup
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
def create_debug_browser():
browser = mechanicalsoup.StatefulBrowser()
# Configure retry strategy
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
browser.session.mount("http://", adapter)
browser.session.mount("https://", adapter)
# Set comprehensive headers for debugging
browser.session.headers.update({
'User-Agent': 'MechanicalSoup-Debug/1.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
})
return browser
browser = create_debug_browser()
Debugging with mitmproxy
For advanced network debugging, use mitmproxy to intercept HTTPS traffic:
# Install and start mitmproxy
pip install mitmproxy
mitmproxy -p 8080
import mechanicalsoup
# Configure MechanicalSoup to use mitmproxy
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
'http': 'http://127.0.0.1:8080',
'https': 'http://127.0.0.1:8080'
}
# Disable SSL verification for proxy debugging
browser.session.verify = False
# Now all traffic will be visible in mitmproxy
response = browser.open("https://example.com")
Performance Debugging
Monitor timing and performance metrics:
import mechanicalsoup
import time
class TimingBrowser:
def __init__(self):
self.browser = mechanicalsoup.StatefulBrowser()
self.timings = []
def timed_open(self, url):
start_time = time.time()
response = self.browser.open(url)
end_time = time.time()
timing_info = {
'url': url,
'status_code': response.status_code,
'duration': end_time - start_time,
'content_length': len(response.content)
}
self.timings.append(timing_info)
print(f"Fetched {url}: {timing_info['duration']:.2f}s, "
f"{timing_info['content_length']} bytes, "
f"Status {timing_info['status_code']}")
return response
def get_performance_summary(self):
if not self.timings:
return "No requests made yet"
total_time = sum(t['duration'] for t in self.timings)
avg_time = total_time / len(self.timings)
total_bytes = sum(t['content_length'] for t in self.timings)
return f"Total requests: {len(self.timings)}, "
f"Total time: {total_time:.2f}s, "
f"Average time: {avg_time:.2f}s, "
f"Total data: {total_bytes} bytes"
# Usage
timing_browser = TimingBrowser()
timing_browser.timed_open("https://httpbin.org/delay/1")
timing_browser.timed_open("https://httpbin.org/gzip")
print(timing_browser.get_performance_summary())
Debugging with External Tools
Using Charles Proxy or Fiddler
Configure MechanicalSoup to work with desktop proxy tools:
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
# Configure for Charles Proxy (default port 8888)
browser.session.proxies = {
'http': 'http://127.0.0.1:8888',
'https': 'http://127.0.0.1:8888'
}
# For HTTPS interception, you may need to disable SSL verification
# (only in development/testing environments)
browser.session.verify = False
Best Practices for Production Debugging
- Implement structured logging: Use Python's logging module for consistent log formatting
- Sanitize sensitive data: Never log passwords, API keys, or personal information
- Use correlation IDs: Track requests across multiple operations
- Monitor rate limits: Watch for 429 responses and implement backoff strategies
- Validate responses: Always check status codes and content before processing
Similar to how error handling in Puppeteer requires careful consideration of various failure modes, MechanicalSoup debugging benefits from a comprehensive approach that covers network issues, parsing errors, and session management problems.
For complex scenarios involving dynamic content, you might also consider monitoring network requests in Puppeteer as an alternative approach when MechanicalSoup's capabilities are insufficient for JavaScript-heavy websites.
Conclusion
Effective debugging of MechanicalSoup requests and responses requires a multi-layered approach combining built-in logging, custom instrumentation, and external tools. By implementing these debugging strategies, you can quickly identify and resolve issues in your web scraping projects, leading to more reliable and maintainable code.
Remember to always respect robots.txt files and website terms of service when debugging and testing your scraping applications. Proper debugging not only helps you build better scrapers but also ensures you're being a responsible web citizen.