Table of contents

What are the common pitfalls to avoid when web scraping with Python?

Web scraping with Python can be highly effective, but several common pitfalls can lead to blocked requests, legal issues, poor performance, or unreliable data extraction. Understanding and avoiding these mistakes is crucial for building robust and sustainable scraping solutions.

1. Ignoring Rate Limiting and Making Too Many Requests

One of the most common mistakes is bombarding a website with rapid-fire requests, which can quickly get your IP address banned.

The Problem

# BAD: Aggressive scraping without delays
import requests
from bs4 import BeautifulSoup

urls = ['http://example.com/page{}'.format(i) for i in range(1000)]
for url in urls:
    response = requests.get(url)  # No delay between requests
    # Process response...

The Solution

# GOOD: Implement proper rate limiting
import requests
import time
from bs4 import BeautifulSoup

def scrape_with_rate_limit(urls, delay=1):
    results = []
    for url in urls:
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            results.append(response.text)
            time.sleep(delay)  # Respectful delay
        except requests.RequestException as e:
            print(f"Error scraping {url}: {e}")
    return results

Advanced Rate Limiting with Session Management

import requests
from time import sleep
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class RateLimitedScraper:
    def __init__(self, requests_per_second=1):
        self.delay = 1.0 / requests_per_second
        self.session = requests.Session()

        # Configure retry strategy
        retry_strategy = Retry(
            total=3,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=["HEAD", "GET", "OPTIONS"],
            backoff_factor=1
        )

        adapter = HTTPAdapter(max_retries=retry_strategy)
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)

    def get(self, url, **kwargs):
        response = self.session.get(url, **kwargs)
        sleep(self.delay)
        return response

2. Not Handling HTTP Errors and Exceptions Properly

Failing to handle network errors, timeouts, and HTTP status codes can cause your scraper to crash unexpectedly.

The Problem

# BAD: No error handling
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
data = soup.find('div', class_='content').text  # Can raise AttributeError

The Solution

# GOOD: Comprehensive error handling
import requests
from bs4 import BeautifulSoup
import logging

def robust_scrape(url, retries=3):
    for attempt in range(retries):
        try:
            response = requests.get(
                url, 
                timeout=10,
                headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
            )
            response.raise_for_status()  # Raises HTTPError for bad responses

            soup = BeautifulSoup(response.text, 'html.parser')
            content_div = soup.find('div', class_='content')

            if content_div:
                return content_div.get_text(strip=True)
            else:
                logging.warning(f"Content div not found on {url}")
                return None

        except requests.exceptions.Timeout:
            logging.error(f"Timeout on attempt {attempt + 1} for {url}")
        except requests.exceptions.ConnectionError:
            logging.error(f"Connection error on attempt {attempt + 1} for {url}")
        except requests.exceptions.HTTPError as e:
            logging.error(f"HTTP error {e.response.status_code} for {url}")
            if e.response.status_code == 404:
                return None  # Don't retry for 404s
        except Exception as e:
            logging.error(f"Unexpected error for {url}: {e}")

        if attempt < retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff

    return None

3. Using Inadequate User Agents and Headers

Many websites block requests that don't include realistic browser headers or use obvious bot user agents.

The Problem

# BAD: Default requests user agent
response = requests.get(url)  # Uses 'python-requests/2.x.x'

The Solution

# GOOD: Realistic browser headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

response = requests.get(url, headers=headers)

Rotating User Agents

import random

user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
]

def get_random_headers():
    return {
        'User-Agent': random.choice(user_agents),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive'
    }

4. Not Handling Dynamic Content and JavaScript

Many modern websites load content dynamically with JavaScript, which basic HTTP requests cannot capture.

The Problem

# BAD: Only gets initial HTML, misses JavaScript-loaded content
response = requests.get('https://spa-website.com')
soup = BeautifulSoup(response.text, 'html.parser')
# Content will be empty or incomplete

The Solution with Selenium

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

def scrape_dynamic_content(url):
    chrome_options = Options()
    chrome_options.add_argument('--headless')  # Run in background
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')

    driver = webdriver.Chrome(options=chrome_options)

    try:
        driver.get(url)

        # Wait for specific element to load
        wait = WebDriverWait(driver, 10)
        content = wait.until(
            EC.presence_of_element_located((By.CLASS_NAME, "dynamic-content"))
        )

        return content.text

    finally:
        driver.quit()

For complex JavaScript interactions, you might also consider how to handle AJAX requests using Puppeteer as an alternative approach.

5. Poor Data Validation and Cleaning

Extracted data often needs validation and cleaning, but many developers skip this crucial step.

The Problem

# BAD: No data validation
price = soup.find('span', class_='price').text
total += float(price)  # Can crash if price is "N/A" or formatted

The Solution

import re
from decimal import Decimal, InvalidOperation

def clean_price(price_text):
    """Clean and validate price data"""
    if not price_text:
        return None

    # Remove currency symbols and whitespace
    cleaned = re.sub(r'[^\d.,]', '', price_text.strip())

    # Handle different decimal separators
    if ',' in cleaned and '.' in cleaned:
        # Assume comma is thousands separator
        cleaned = cleaned.replace(',', '')
    elif ',' in cleaned:
        # Might be decimal separator in some locales
        cleaned = cleaned.replace(',', '.')

    try:
        return float(cleaned)
    except ValueError:
        return None

def safe_extract_price(soup):
    price_element = soup.find('span', class_='price')
    if price_element:
        return clean_price(price_element.get_text())
    return None

6. Not Handling Pagination Correctly

Many websites paginate their content, and missing pages or infinite loops are common mistakes.

The Problem

# BAD: Hardcoded page numbers or infinite loops
for page in range(1, 100):  # Might not have 100 pages
    url = f"https://example.com/page/{page}"
    # Scrape page...

The Solution

def scrape_all_pages(base_url):
    page = 1
    all_data = []

    while True:
        url = f"{base_url}?page={page}"
        response = requests.get(url, headers=get_random_headers())

        if response.status_code == 404:
            break  # No more pages

        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract data from current page
        page_data = extract_page_data(soup)

        if not page_data:  # No data found, likely end of pages
            break

        all_data.extend(page_data)

        # Check for "next" button or page indicator
        next_button = soup.find('a', {'class': 'next-page'})
        if not next_button or 'disabled' in next_button.get('class', []):
            break

        page += 1
        time.sleep(1)  # Rate limiting

    return all_data

7. Ignoring Legal and Ethical Considerations

Web scraping exists in a legal gray area, and violating terms of service can lead to serious consequences.

Best Practices for Legal Compliance

  1. Always check robots.txt:
import urllib.robotparser

def can_scrape(url, user_agent='*'):
    """Check if scraping is allowed according to robots.txt"""
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(f"{url}/robots.txt")
    rp.read()
    return rp.can_fetch(user_agent, url)

# Usage
if can_scrape('https://example.com'):
    # Proceed with scraping
    pass
else:
    print("Scraping not allowed according to robots.txt")
  1. Respect terms of service and copyright
  2. Avoid scraping personal or sensitive data
  3. Consider using official APIs when available

8. Not Implementing Proper Session Management

Using sessions can improve performance and help maintain state across requests.

The Problem

# BAD: New connection for each request
for url in urls:
    response = requests.get(url)  # New TCP connection each time

The Solution

# GOOD: Reuse connections with session
session = requests.Session()
session.headers.update(get_random_headers())

for url in urls:
    response = session.get(url)
    # Process response...
    time.sleep(1)

session.close()

9. Memory Management Issues with Large Datasets

Processing large amounts of data without proper memory management can cause crashes.

The Solution

import csv
from contextlib import contextmanager

@contextmanager
def csv_writer(filename):
    """Context manager for CSV writing"""
    with open(filename, 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        yield writer

def process_large_dataset(urls):
    """Process data in chunks to manage memory"""
    with csv_writer('scraped_data.csv') as writer:
        writer.writerow(['url', 'title', 'content'])  # Header

        for i, url in enumerate(urls):
            data = scrape_single_page(url)
            if data:
                writer.writerow([url, data['title'], data['content']])

            # Log progress
            if i % 100 == 0:
                print(f"Processed {i} URLs")

            time.sleep(1)

10. Not Using Appropriate Parsing Libraries

Choosing the wrong parsing library for your use case can impact performance and reliability.

Library Comparison

# For simple HTML parsing - BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')  # Built-in parser
soup = BeautifulSoup(html, 'lxml')         # Faster, requires lxml

# For large XML documents - lxml
from lxml import etree
tree = etree.fromstring(xml_data)

# For high-performance HTML parsing - selectolax
from selectolax.parser import HTMLParser
tree = HTMLParser(html)

Conclusion

Avoiding these common pitfalls will help you build more robust, efficient, and legally compliant web scrapers. Remember to always:

  • Implement proper rate limiting and error handling
  • Use realistic headers and user agents
  • Handle dynamic content appropriately
  • Validate and clean your data
  • Respect legal boundaries and terms of service
  • Manage resources efficiently

For handling complex scenarios involving dynamic content and JavaScript-heavy applications, you might also want to explore how to crawl a single page application (SPA) using Puppeteer, which provides additional strategies for modern web applications.

By following these best practices and avoiding common mistakes, you'll be well-equipped to create reliable and maintainable Python web scraping solutions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon