Table of contents

How do I Debug MechanicalSoup Requests and Responses?

Debugging MechanicalSoup requests and responses is essential for building robust web scraping applications. This comprehensive guide covers various debugging techniques, from basic logging to advanced request inspection methods that will help you identify and resolve issues in your scraping workflows.

Understanding MechanicalSoup's Request-Response Cycle

MechanicalSoup builds on top of the requests library, providing a higher-level interface for browser automation. Understanding the underlying request-response cycle helps in effective debugging:

import mechanicalsoup

# Create a browser instance
browser = mechanicalsoup.StatefulBrowser()

# The browser maintains session state and handles cookies automatically
response = browser.open("https://httpbin.org/get")
print(f"Status Code: {response.status_code}")
print(f"URL: {response.url}")

Basic Request and Response Inspection

Enabling Debug Output

The most straightforward way to debug MechanicalSoup is by enabling verbose output and inspecting response objects:

import mechanicalsoup
import logging

# Enable detailed logging
logging.basicConfig(level=logging.DEBUG)

browser = mechanicalsoup.StatefulBrowser()

# Open a page and inspect the response
response = browser.open("https://httpbin.org/get")

# Basic response inspection
print(f"Status Code: {response.status_code}")
print(f"Headers: {response.headers}")
print(f"URL: {response.url}")
print(f"Content Type: {response.headers.get('content-type')}")
print(f"Response Size: {len(response.content)} bytes")

Inspecting Request Details

Access the underlying request object to examine what was sent:

# Access the request that generated the response
request = response.request
print(f"Request Method: {request.method}")
print(f"Request URL: {request.url}")
print(f"Request Headers: {request.headers}")

# For POST requests, inspect the body
if hasattr(request, 'body') and request.body:
    print(f"Request Body: {request.body}")

Advanced Debugging Techniques

Custom Request Hooks

Use request hooks to intercept and log all HTTP traffic:

import mechanicalsoup
import json

def debug_request(request, *args, **kwargs):
    print(f"\n--- OUTGOING REQUEST ---")
    print(f"Method: {request.method}")
    print(f"URL: {request.url}")
    print(f"Headers: {json.dumps(dict(request.headers), indent=2)}")
    if request.body:
        print(f"Body: {request.body}")
    print("------------------------\n")

def debug_response(response, *args, **kwargs):
    print(f"\n--- INCOMING RESPONSE ---")
    print(f"Status: {response.status_code}")
    print(f"Headers: {json.dumps(dict(response.headers), indent=2)}")
    print(f"Content Length: {len(response.content)}")
    print("-------------------------\n")

# Create browser with custom session
browser = mechanicalsoup.StatefulBrowser()
browser.session.hooks['request'] = debug_request
browser.session.hooks['response'] = debug_response

# Now all requests will be logged
response = browser.open("https://httpbin.org/json")

Session State Debugging

Monitor cookies and session state changes:

def debug_session_state(browser):
    print(f"\n--- SESSION STATE ---")
    print(f"Cookies: {dict(browser.session.cookies)}")
    print(f"Headers: {dict(browser.session.headers)}")
    print("--------------------\n")

browser = mechanicalsoup.StatefulBrowser()
debug_session_state(browser)

# After visiting a page that sets cookies
browser.open("https://httpbin.org/cookies/set/debug/true")
debug_session_state(browser)

Form Debugging Strategies

When working with forms, debugging becomes crucial for understanding submission issues:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()
response = browser.open("https://httpbin.org/forms/post")

# Find and inspect forms
forms = browser.get_current_page().find_all('form')
print(f"Found {len(forms)} forms on the page")

for i, form in enumerate(forms):
    print(f"\n--- FORM {i+1} ---")
    print(f"Action: {form.get('action', 'Not specified')}")
    print(f"Method: {form.get('method', 'GET')}")

    # List all form fields
    inputs = form.find_all(['input', 'select', 'textarea'])
    for input_field in inputs:
        field_type = input_field.get('type', 'text')
        field_name = input_field.get('name', 'unnamed')
        field_value = input_field.get('value', '')
        print(f"  {field_type}: {field_name} = '{field_value}'")

# Select and fill a form
if forms:
    form = browser.select_form('form')
    print(f"\nSelected form action: {form.action}")

    # Fill form fields (example)
    browser['custname'] = 'Test User'
    browser['custtel'] = '123-456-7890'

    # Debug form data before submission
    print(f"Form data to be submitted: {browser.get_current_form()}")

Error Handling and Recovery

Implement comprehensive error handling with detailed logging:

import mechanicalsoup
import requests
from requests.exceptions import RequestException

def robust_page_fetch(url, max_retries=3):
    browser = mechanicalsoup.StatefulBrowser()

    for attempt in range(max_retries):
        try:
            print(f"Attempt {attempt + 1}: Fetching {url}")
            response = browser.open(url)

            # Check for HTTP errors
            response.raise_for_status()

            print(f"Success: {response.status_code} - {len(response.content)} bytes")
            return response

        except requests.exceptions.HTTPError as e:
            print(f"HTTP Error: {e}")
            print(f"Response status: {e.response.status_code}")
            print(f"Response headers: {dict(e.response.headers)}")

        except requests.exceptions.ConnectionError as e:
            print(f"Connection Error: {e}")

        except requests.exceptions.Timeout as e:
            print(f"Timeout Error: {e}")

        except requests.exceptions.RequestException as e:
            print(f"Request Error: {e}")

        if attempt < max_retries - 1:
            print(f"Retrying in 2 seconds...")
            import time
            time.sleep(2)

    print(f"Failed to fetch {url} after {max_retries} attempts")
    return None

# Usage
response = robust_page_fetch("https://httpbin.org/status/500")

Network-Level Debugging

For complex debugging scenarios, capture network traffic:

import mechanicalsoup
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

def create_debug_browser():
    browser = mechanicalsoup.StatefulBrowser()

    # Configure retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    browser.session.mount("http://", adapter)
    browser.session.mount("https://", adapter)

    # Set comprehensive headers for debugging
    browser.session.headers.update({
        'User-Agent': 'MechanicalSoup-Debug/1.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
    })

    return browser

browser = create_debug_browser()

Debugging with mitmproxy

For advanced network debugging, use mitmproxy to intercept HTTPS traffic:

# Install and start mitmproxy
pip install mitmproxy
mitmproxy -p 8080
import mechanicalsoup

# Configure MechanicalSoup to use mitmproxy
browser = mechanicalsoup.StatefulBrowser()
browser.session.proxies = {
    'http': 'http://127.0.0.1:8080',
    'https': 'http://127.0.0.1:8080'
}

# Disable SSL verification for proxy debugging
browser.session.verify = False

# Now all traffic will be visible in mitmproxy
response = browser.open("https://example.com")

Performance Debugging

Monitor timing and performance metrics:

import mechanicalsoup
import time

class TimingBrowser:
    def __init__(self):
        self.browser = mechanicalsoup.StatefulBrowser()
        self.timings = []

    def timed_open(self, url):
        start_time = time.time()
        response = self.browser.open(url)
        end_time = time.time()

        timing_info = {
            'url': url,
            'status_code': response.status_code,
            'duration': end_time - start_time,
            'content_length': len(response.content)
        }

        self.timings.append(timing_info)
        print(f"Fetched {url}: {timing_info['duration']:.2f}s, "
              f"{timing_info['content_length']} bytes, "
              f"Status {timing_info['status_code']}")

        return response

    def get_performance_summary(self):
        if not self.timings:
            return "No requests made yet"

        total_time = sum(t['duration'] for t in self.timings)
        avg_time = total_time / len(self.timings)
        total_bytes = sum(t['content_length'] for t in self.timings)

        return f"Total requests: {len(self.timings)}, "
               f"Total time: {total_time:.2f}s, "
               f"Average time: {avg_time:.2f}s, "
               f"Total data: {total_bytes} bytes"

# Usage
timing_browser = TimingBrowser()
timing_browser.timed_open("https://httpbin.org/delay/1")
timing_browser.timed_open("https://httpbin.org/gzip")
print(timing_browser.get_performance_summary())

Debugging with External Tools

Using Charles Proxy or Fiddler

Configure MechanicalSoup to work with desktop proxy tools:

import mechanicalsoup

browser = mechanicalsoup.StatefulBrowser()

# Configure for Charles Proxy (default port 8888)
browser.session.proxies = {
    'http': 'http://127.0.0.1:8888',
    'https': 'http://127.0.0.1:8888'
}

# For HTTPS interception, you may need to disable SSL verification
# (only in development/testing environments)
browser.session.verify = False

Best Practices for Production Debugging

  1. Implement structured logging: Use Python's logging module for consistent log formatting
  2. Sanitize sensitive data: Never log passwords, API keys, or personal information
  3. Use correlation IDs: Track requests across multiple operations
  4. Monitor rate limits: Watch for 429 responses and implement backoff strategies
  5. Validate responses: Always check status codes and content before processing

Similar to how error handling in Puppeteer requires careful consideration of various failure modes, MechanicalSoup debugging benefits from a comprehensive approach that covers network issues, parsing errors, and session management problems.

For complex scenarios involving dynamic content, you might also consider monitoring network requests in Puppeteer as an alternative approach when MechanicalSoup's capabilities are insufficient for JavaScript-heavy websites.

Conclusion

Effective debugging of MechanicalSoup requests and responses requires a multi-layered approach combining built-in logging, custom instrumentation, and external tools. By implementing these debugging strategies, you can quickly identify and resolve issues in your web scraping projects, leading to more reliable and maintainable code.

Remember to always respect robots.txt files and website terms of service when debugging and testing your scraping applications. Proper debugging not only helps you build better scrapers but also ensures you're being a responsible web citizen.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon