Table of contents

What is the Best Way to Debug Python Web Scraping Scripts?

Debugging Python web scraping scripts can be challenging due to the dynamic nature of websites, network issues, and complex data extraction logic. This comprehensive guide covers the most effective debugging techniques and tools to help you identify and resolve issues in your web scraping projects.

Essential Debugging Strategies

1. Implement Comprehensive Logging

Logging is crucial for understanding what your scraper is doing and identifying where issues occur. Use Python's built-in logging module to create detailed logs:

import logging
import requests
from bs4 import BeautifulSoup

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('scraper.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

def scrape_website(url):
    try:
        logger.info(f"Starting to scrape: {url}")

        response = requests.get(url)
        logger.info(f"Response status: {response.status_code}")
        logger.info(f"Response headers: {response.headers}")

        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            logger.info(f"Successfully parsed HTML, title: {soup.title.string if soup.title else 'No title'}")

            # Extract data
            data = extract_data(soup)
            logger.info(f"Extracted {len(data)} items")

            return data
        else:
            logger.error(f"Failed to fetch page: {response.status_code}")

    except Exception as e:
        logger.error(f"Error scraping {url}: {str(e)}", exc_info=True)
        raise

def extract_data(soup):
    items = []
    elements = soup.find_all('div', class_='product')
    logger.debug(f"Found {len(elements)} product elements")

    for element in elements:
        try:
            title = element.find('h2').text.strip()
            price = element.find('span', class_='price').text.strip()
            items.append({'title': title, 'price': price})
            logger.debug(f"Extracted: {title} - {price}")
        except AttributeError as e:
            logger.warning(f"Failed to extract data from element: {e}")

    return items

2. Use Interactive Debugging with PDB

Python's built-in debugger (pdb) allows you to pause execution and inspect variables:

import pdb
import requests
from bs4 import BeautifulSoup

def debug_scraper(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Set a breakpoint
    pdb.set_trace()

    # Now you can inspect variables interactively
    # Commands: n (next), s (step), c (continue), l (list), p <variable> (print)
    products = soup.find_all('div', class_='product')

    for product in products:
        title = product.find('h2')
        if title:
            print(title.text)

3. Inspect HTTP Requests and Responses

Understanding the actual HTTP traffic is crucial for debugging scraping issues:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import json

# Enable detailed HTTP logging
import logging
import http.client as http_client

http_client.HTTPConnection.debuglevel = 1
logging.basicConfig()
logging.getLogger().setLevel(logging.DEBUG)
requests_log = logging.getLogger("requests.packages.urllib3")
requests_log.setLevel(logging.DEBUG)
requests_log.propagate = True

def debug_http_request(url):
    session = requests.Session()

    # Add retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )

    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)

    # Set headers to mimic a real browser
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
    })

    try:
        response = session.get(url, timeout=30)

        print(f"Status Code: {response.status_code}")
        print(f"Headers: {json.dumps(dict(response.headers), indent=2)}")
        print(f"Cookies: {response.cookies}")
        print(f"URL: {response.url}")
        print(f"History: {response.history}")

        return response

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

Advanced Debugging Techniques

4. Save HTML for Offline Analysis

When debugging parsing logic, save the actual HTML to analyze it offline:

import os
from datetime import datetime

def save_html_for_debug(url, html_content, identifier="debug"):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{identifier}_{timestamp}.html"

    os.makedirs("debug_html", exist_ok=True)
    filepath = os.path.join("debug_html", filename)

    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(f"<!-- URL: {url} -->\n")
        f.write(f"<!-- Saved: {datetime.now().isoformat()} -->\n")
        f.write(html_content)

    print(f"HTML saved to: {filepath}")
    return filepath

# Usage
response = requests.get(url)
if response.status_code == 200:
    save_html_for_debug(url, response.text, "homepage")

5. Validate Selectors with Browser DevTools

Before implementing selectors in your script, test them in the browser's developer console:

// Test CSS selectors in browser console
document.querySelectorAll('.product h2');

// Test XPath expressions
$x('//div[@class="product"]//h2');

// Check if elements are visible
Array.from(document.querySelectorAll('.product')).map(el => ({
    visible: el.offsetParent !== null,
    text: el.textContent.trim()
}));

6. Handle Dynamic Content with Selenium Debugging

For JavaScript-heavy sites, use Selenium with debugging capabilities:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import time

def debug_selenium_scraper(url):
    # Configure Chrome options for debugging
    chrome_options = Options()
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)

    # Keep browser open for debugging
    chrome_options.add_experimental_option("detach", True)

    driver = webdriver.Chrome(options=chrome_options)

    try:
        driver.get(url)

        # Wait for page to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )

        # Take screenshot for debugging
        driver.save_screenshot(f"debug_screenshot_{int(time.time())}.png")

        # Print page source for analysis
        with open(f"debug_page_source_{int(time.time())}.html", 'w', encoding='utf-8') as f:
            f.write(driver.page_source)

        # Debug element selection
        elements = driver.find_elements(By.CLASS_NAME, "product")
        print(f"Found {len(elements)} product elements")

        for i, element in enumerate(elements[:3]):  # Debug first 3 elements
            print(f"\nElement {i+1}:")
            print(f"Text: {element.text}")
            print(f"HTML: {element.get_attribute('outerHTML')}")

        # Pause for manual inspection
        input("Press Enter to continue...")

    except Exception as e:
        print(f"Selenium error: {e}")
        driver.save_screenshot(f"error_screenshot_{int(time.time())}.png")

    finally:
        driver.quit()

Common Debugging Scenarios

7. Debugging Rate Limiting and Bot Detection

Implement detection and handling for common scraping obstacles:

import time
import random
from urllib.parse import urljoin

class ScrapingDebugger:
    def __init__(self):
        self.session = requests.Session()
        self.last_request_time = 0

    def smart_request(self, url, delay_range=(1, 3)):
        # Implement random delays
        if self.last_request_time:
            elapsed = time.time() - self.last_request_time
            min_delay = delay_range[0]
            if elapsed < min_delay:
                sleep_time = random.uniform(min_delay, delay_range[1])
                time.sleep(sleep_time)
                print(f"Delayed {sleep_time:.2f} seconds")

        self.last_request_time = time.time()

        try:
            response = self.session.get(url)

            # Check for common bot detection patterns
            if response.status_code == 429:
                print("Rate limited! Waiting longer...")
                time.sleep(60)
                return self.smart_request(url, (30, 60))

            if "captcha" in response.text.lower():
                print("CAPTCHA detected!")
                return None

            if response.status_code == 403:
                print("Access forbidden - possible bot detection")
                return None

            return response

        except Exception as e:
            print(f"Request failed: {e}")
            return None

8. Network and Proxy Debugging

Test different network configurations and proxy setups:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def test_proxy_connection(url, proxy_config=None):
    session = requests.Session()

    if proxy_config:
        session.proxies.update(proxy_config)
        print(f"Using proxy: {proxy_config}")

    # Test basic connectivity
    try:
        # First, test with a simple request
        test_response = session.get("http://httpbin.org/ip", timeout=10)
        print(f"IP check: {test_response.json()}")

        # Then test the actual target
        response = session.get(url, timeout=30)
        print(f"Target response: {response.status_code}")

        return response

    except requests.exceptions.ProxyError as e:
        print(f"Proxy error: {e}")
    except requests.exceptions.Timeout as e:
        print(f"Timeout error: {e}")
    except requests.exceptions.ConnectionError as e:
        print(f"Connection error: {e}")

    return None

# Test different proxy configurations
proxies = [
    None,  # No proxy
    {"http": "http://proxy1:8080", "https": "http://proxy1:8080"},
    {"http": "socks5://proxy2:1080", "https": "socks5://proxy2:1080"},
]

for proxy in proxies:
    print(f"\n--- Testing with proxy: {proxy} ---")
    test_proxy_connection("https://example.com", proxy)

Debugging Tools and Libraries

9. Use Debugging-Specific Libraries

Several Python libraries can enhance your debugging capabilities:

# Install: pip install requests-toolbelt loguru rich

from requests_toolbelt.utils import dump
from loguru import logger
from rich.console import Console
from rich.table import Table

# Enhanced HTTP debugging with requests-toolbelt
def debug_http_with_toolbelt(url):
    response = requests.get(url)

    # Dump the entire HTTP exchange
    data = dump.dump_all(response)
    print(data.decode('utf-8'))

# Better logging with loguru
logger.add("scraper_{time}.log", rotation="1 day", retention="7 days")

@logger.catch
def scrape_with_loguru(url):
    logger.info(f"Starting scrape of {url}")
    response = requests.get(url)
    logger.success(f"Got response: {response.status_code}")
    return response

# Rich console output for better debugging
console = Console()

def debug_with_rich(data):
    table = Table(title="Scraped Data")
    table.add_column("Title", style="cyan")
    table.add_column("Price", style="magenta")
    table.add_column("Status", style="green")

    for item in data:
        table.add_row(item['title'], item['price'], "✓")

    console.print(table)

Best Practices for Debugging

10. Create a Debugging Checklist

Always follow this systematic approach when debugging scraping issues:

  1. Verify the target URL - Ensure it's accessible and returns expected content
  2. Check HTTP status codes - Handle redirects, errors, and rate limiting
  3. Inspect response headers - Look for content-type, encoding, and security headers
  4. Validate HTML structure - Ensure your selectors match the actual DOM
  5. Test with different user agents - Some sites serve different content to different browsers
  6. Monitor network timing - Identify slow requests and timeout issues
  7. Handle JavaScript rendering - Use browser automation for dynamic content
  8. Test error scenarios - Verify your error handling works correctly

For complex debugging scenarios involving JavaScript-heavy sites, you might want to explore browser automation tools similar to how to handle authentication in Puppeteer or learn about monitoring network requests in Puppeteer for comprehensive debugging approaches.

Conclusion

Effective debugging of Python web scraping scripts requires a combination of proper logging, interactive debugging tools, network analysis, and systematic testing approaches. By implementing these debugging strategies and using the right tools, you can quickly identify and resolve issues in your scraping projects.

Remember to always respect websites' robots.txt files and terms of service, implement appropriate delays between requests, and consider using professional web scraping APIs for production applications to avoid many of these debugging challenges altogether.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon