Table of contents

What Error Handling Strategies Should I Use When Scraping with LLMs?

Error handling is critical when using Large Language Models (LLMs) for web scraping because LLMs introduce unique challenges beyond traditional scraping errors. You'll need to handle not only network and parsing errors but also API rate limits, token limits, hallucinations, inconsistent outputs, and cost overruns.

This guide covers comprehensive error handling strategies specifically designed for LLM-based web scraping workflows.

Understanding LLM-Specific Errors

When scraping with LLMs, you'll encounter several types of errors:

  1. API Errors: Rate limits, authentication failures, timeouts
  2. Token Limit Errors: Content exceeds the LLM's context window
  3. Validation Errors: LLM returns malformed or unexpected data
  4. Hallucination Errors: LLM generates plausible but incorrect data
  5. Network Errors: Connection issues when fetching pages or calling APIs
  6. Cost Threshold Errors: Budget limits exceeded

Strategy 1: Implement Retry Logic with Exponential Backoff

Retry logic is essential for handling transient errors like rate limits and temporary API failures.

Python Example with OpenAI

import time
import openai
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

@retry(
    retry=retry_if_exception_type((openai.RateLimitError, openai.APIConnectionError)),
    wait=wait_exponential(multiplier=1, min=4, max=60),
    stop=stop_after_attempt(5)
)
def extract_data_with_llm(html_content, prompt):
    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a web scraping assistant that extracts structured data."},
                {"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
            ],
            temperature=0
        )
        return response.choices[0].message.content
    except openai.InvalidRequestError as e:
        # Don't retry invalid requests (e.g., token limit exceeded)
        raise ValueError(f"Invalid request: {e}")
    except Exception as e:
        print(f"Error calling LLM: {e}")
        raise

# Usage
try:
    result = extract_data_with_llm(html_content, "Extract product name and price as JSON")
except ValueError as e:
    print(f"Non-retryable error: {e}")
except Exception as e:
    print(f"All retries failed: {e}")

JavaScript Example with Anthropic Claude

async function callLLMWithRetry(htmlContent, prompt, maxRetries = 5) {
    const baseDelay = 1000; // 1 second

    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            const response = await fetch('https://api.anthropic.com/v1/messages', {
                method: 'POST',
                headers: {
                    'Content-Type': 'application/json',
                    'x-api-key': process.env.ANTHROPIC_API_KEY,
                    'anthropic-version': '2023-06-01'
                },
                body: JSON.stringify({
                    model: 'claude-3-opus-20240229',
                    max_tokens: 1024,
                    messages: [{
                        role: 'user',
                        content: `${prompt}\n\nHTML:\n${htmlContent}`
                    }]
                })
            });

            if (response.status === 429) {
                // Rate limited - wait and retry
                const delay = baseDelay * Math.pow(2, attempt);
                console.log(`Rate limited. Retrying in ${delay}ms...`);
                await new Promise(resolve => setTimeout(resolve, delay));
                continue;
            }

            if (!response.ok) {
                throw new Error(`API error: ${response.status} ${response.statusText}`);
            }

            return await response.json();

        } catch (error) {
            if (attempt === maxRetries - 1) {
                throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
            }

            const delay = baseDelay * Math.pow(2, attempt);
            console.log(`Attempt ${attempt + 1} failed. Retrying in ${delay}ms...`);
            await new Promise(resolve => setTimeout(resolve, delay));
        }
    }
}

Strategy 2: Handle Token Limit Errors

When HTML content exceeds the LLM's context window, you need to either truncate or chunk the content.

Python Example: Smart Content Truncation

import tiktoken

def count_tokens(text, model="gpt-4"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def truncate_html_intelligently(html_content, max_tokens=6000, model="gpt-4"):
    from bs4 import BeautifulSoup

    if count_tokens(html_content, model) <= max_tokens:
        return html_content

    # Parse and extract only relevant content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove non-content elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header', 'iframe']):
        element.decompose()

    # Extract main content
    main_content = soup.find('main') or soup.find('article') or soup.body

    if main_content:
        text = main_content.get_text(separator=' ', strip=True)
        # Truncate to fit within token limit
        encoding = tiktoken.encoding_for_model(model)
        tokens = encoding.encode(text)
        if len(tokens) > max_tokens:
            tokens = tokens[:max_tokens]
            text = encoding.decode(tokens)
        return text

    return html_content[:max_tokens * 4]  # Rough character estimate

def scrape_with_token_handling(url, prompt):
    import requests

    try:
        response = requests.get(url)
        html_content = response.text

        # Truncate if necessary
        processed_content = truncate_html_intelligently(html_content)

        # Call LLM
        result = extract_data_with_llm(processed_content, prompt)
        return result

    except ValueError as e:
        if "token" in str(e).lower():
            # Try more aggressive truncation
            processed_content = truncate_html_intelligently(html_content, max_tokens=3000)
            return extract_data_with_llm(processed_content, prompt)
        raise

Strategy 3: Validate LLM Output

Always validate that the LLM returns data in the expected format before using it.

Python Example: JSON Schema Validation

import json
from jsonschema import validate, ValidationError

def validate_llm_output(llm_response, schema):
    try:
        # Try to parse as JSON
        data = json.loads(llm_response)

        # Validate against schema
        validate(instance=data, schema=schema)

        return data
    except json.JSONDecodeError as e:
        raise ValueError(f"LLM returned invalid JSON: {e}")
    except ValidationError as e:
        raise ValueError(f"LLM output doesn't match schema: {e}")

def scrape_with_validation(html_content, prompt):
    # Define expected schema
    schema = {
        "type": "object",
        "properties": {
            "product_name": {"type": "string"},
            "price": {"type": "number"},
            "currency": {"type": "string"}
        },
        "required": ["product_name", "price"]
    }

    max_attempts = 3
    for attempt in range(max_attempts):
        try:
            # Get LLM response
            llm_response = extract_data_with_llm(html_content, prompt)

            # Validate
            validated_data = validate_llm_output(llm_response, schema)
            return validated_data

        except ValueError as e:
            print(f"Validation failed on attempt {attempt + 1}: {e}")

            if attempt < max_attempts - 1:
                # Retry with more explicit instructions
                prompt += f"\n\nIMPORTANT: Return valid JSON matching this schema: {json.dumps(schema)}"
            else:
                raise ValueError(f"Failed to get valid output after {max_attempts} attempts")

JavaScript Example: Type Checking

function validateProductData(data) {
    const errors = [];

    if (typeof data !== 'object' || data === null) {
        throw new Error('LLM response must be an object');
    }

    if (typeof data.product_name !== 'string' || !data.product_name.trim()) {
        errors.push('product_name must be a non-empty string');
    }

    if (typeof data.price !== 'number' || data.price < 0) {
        errors.push('price must be a positive number');
    }

    if (errors.length > 0) {
        throw new Error(`Validation errors: ${errors.join(', ')}`);
    }

    return data;
}

async function scrapeWithValidation(htmlContent, prompt) {
    const maxAttempts = 3;

    for (let attempt = 0; attempt < maxAttempts; attempt++) {
        try {
            const response = await callLLMWithRetry(htmlContent, prompt);
            const parsed = JSON.parse(response.content[0].text);

            // Validate the parsed data
            return validateProductData(parsed);

        } catch (error) {
            console.error(`Attempt ${attempt + 1} failed: ${error.message}`);

            if (attempt === maxAttempts - 1) {
                throw new Error(`Validation failed after ${maxAttempts} attempts`);
            }

            // Add more explicit instructions for next attempt
            prompt += '\n\nReturn ONLY valid JSON with product_name (string) and price (number).';
        }
    }
}

Strategy 4: Implement Fallback Mechanisms

When LLM extraction fails, fall back to traditional parsing methods.

Python Example: Multi-Tier Fallback

from bs4 import BeautifulSoup
import re

def extract_with_fallback(url):
    import requests
    response = requests.get(url)
    html_content = response.text

    # Tier 1: Try LLM extraction
    try:
        prompt = "Extract product name and price. Return JSON with 'product_name' and 'price' fields."
        result = scrape_with_validation(html_content, prompt)
        result['method'] = 'llm'
        return result
    except Exception as e:
        print(f"LLM extraction failed: {e}. Trying traditional parsing...")

    # Tier 2: Try CSS selectors/XPath
    try:
        soup = BeautifulSoup(html_content, 'html.parser')

        name_element = soup.select_one('.product-name, [itemprop="name"], h1')
        price_element = soup.select_one('.price, [itemprop="price"]')

        if name_element and price_element:
            price_text = price_element.get_text()
            price = float(re.search(r'[\d.]+', price_text).group())

            return {
                'product_name': name_element.get_text(strip=True),
                'price': price,
                'method': 'css_selector'
            }
    except Exception as e:
        print(f"CSS selector extraction failed: {e}. Trying regex...")

    # Tier 3: Try regex patterns
    try:
        price_match = re.search(r'\$\s*(\d+\.?\d*)', html_content)
        name_match = re.search(r'<h1[^>]*>([^<]+)</h1>', html_content)

        if price_match and name_match:
            return {
                'product_name': name_match.group(1).strip(),
                'price': float(price_match.group(1)),
                'method': 'regex'
            }
    except Exception as e:
        print(f"Regex extraction failed: {e}")

    # All methods failed
    raise ValueError("All extraction methods failed")

Strategy 5: Monitor and Log Errors

Implement comprehensive logging to track error patterns and costs, similar to how you would handle errors in traditional browser automation.

Python Example: Structured Logging

import logging
from datetime import datetime
import json

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class LLMScrapingMonitor:
    def __init__(self):
        self.errors = []
        self.costs = []
        self.requests = []

    def log_request(self, url, tokens_used, cost, success, error=None):
        log_entry = {
            'timestamp': datetime.utcnow().isoformat(),
            'url': url,
            'tokens_used': tokens_used,
            'cost': cost,
            'success': success,
            'error': str(error) if error else None
        }

        self.requests.append(log_entry)

        if not success:
            self.errors.append(log_entry)
            logger.error(f"Scraping failed for {url}: {error}")
        else:
            logger.info(f"Successfully scraped {url} - Tokens: {tokens_used}, Cost: ${cost:.4f}")

    def get_error_summary(self):
        error_types = {}
        for error in self.errors:
            error_msg = error['error']
            error_type = error_msg.split(':')[0] if error_msg else 'Unknown'
            error_types[error_type] = error_types.get(error_type, 0) + 1

        return error_types

    def get_total_cost(self):
        return sum(r['cost'] for r in self.requests)

# Usage
monitor = LLMScrapingMonitor()

def scrape_with_monitoring(url, prompt):
    tokens_used = 0
    cost = 0.0

    try:
        result = extract_with_fallback(url)

        # Calculate approximate cost (GPT-4 pricing)
        tokens_used = count_tokens(prompt) + 1000  # Estimated
        cost = (tokens_used / 1000) * 0.03  # $0.03 per 1K tokens

        monitor.log_request(url, tokens_used, cost, success=True)
        return result

    except Exception as e:
        monitor.log_request(url, tokens_used, cost, success=False, error=e)
        raise

# After scraping multiple URLs
print(f"Total cost: ${monitor.get_total_cost():.2f}")
print(f"Error summary: {monitor.get_error_summary()}")

Strategy 6: Set Budget and Rate Limits

Prevent cost overruns by implementing budget controls.

Python Example: Budget Control

class BudgetController:
    def __init__(self, max_daily_cost=10.0, max_requests_per_minute=60):
        self.max_daily_cost = max_daily_cost
        self.max_requests_per_minute = max_requests_per_minute
        self.daily_cost = 0.0
        self.requests_this_minute = []
        self.last_reset = datetime.utcnow()

    def check_budget(self, estimated_cost):
        # Reset daily counter if new day
        if (datetime.utcnow() - self.last_reset).days >= 1:
            self.daily_cost = 0.0
            self.last_reset = datetime.utcnow()

        # Check daily budget
        if self.daily_cost + estimated_cost > self.max_daily_cost:
            raise BudgetExceededError(
                f"Daily budget of ${self.max_daily_cost} would be exceeded. "
                f"Current: ${self.daily_cost:.2f}, Estimated: ${estimated_cost:.2f}"
            )

        # Check rate limit
        now = datetime.utcnow()
        self.requests_this_minute = [
            req for req in self.requests_this_minute
            if (now - req).seconds < 60
        ]

        if len(self.requests_this_minute) >= self.max_requests_per_minute:
            raise RateLimitError(
                f"Rate limit of {self.max_requests_per_minute} requests/minute exceeded"
            )

    def record_request(self, actual_cost):
        self.daily_cost += actual_cost
        self.requests_this_minute.append(datetime.utcnow())

class BudgetExceededError(Exception):
    pass

class RateLimitError(Exception):
    pass

# Usage
budget = BudgetController(max_daily_cost=50.0, max_requests_per_minute=30)

def scrape_with_budget_control(url, prompt):
    estimated_cost = 0.10  # Estimate based on content length

    try:
        budget.check_budget(estimated_cost)

        result = extract_with_fallback(url)

        # Calculate actual cost
        actual_cost = 0.08  # Actual cost after request
        budget.record_request(actual_cost)

        return result

    except BudgetExceededError as e:
        logger.error(f"Budget exceeded: {e}")
        raise
    except RateLimitError as e:
        logger.warning(f"Rate limit hit: {e}")
        time.sleep(60)  # Wait before retrying
        return scrape_with_budget_control(url, prompt)

Strategy 7: Handle Hallucinations with Cross-Validation

Validate critical data by cross-referencing with multiple sources or using different extraction methods.

Python Example: Cross-Validation

def cross_validate_extraction(html_content, prompt):
    results = []

    # Method 1: LLM extraction
    try:
        llm_result = extract_data_with_llm(html_content, prompt)
        results.append(('llm', llm_result))
    except Exception as e:
        logger.warning(f"LLM extraction failed: {e}")

    # Method 2: Traditional parsing
    try:
        soup = BeautifulSoup(html_content, 'html.parser')
        price_elem = soup.select_one('[itemprop="price"]')
        if price_elem:
            traditional_result = {'price': float(price_elem.get('content', 0))}
            results.append(('traditional', traditional_result))
    except Exception as e:
        logger.warning(f"Traditional extraction failed: {e}")

    # Compare results
    if len(results) >= 2:
        llm_price = results[0][1].get('price')
        traditional_price = results[1][1].get('price')

        # Check if prices are within 5% of each other
        if abs(llm_price - traditional_price) / traditional_price > 0.05:
            logger.warning(
                f"Price mismatch detected! LLM: {llm_price}, Traditional: {traditional_price}"
            )
            # Use traditional method when there's a discrepancy
            return results[1][1]

    return results[0][1] if results else None

Best Practices Summary

  1. Always use retry logic with exponential backoff for transient errors
  2. Validate all LLM outputs against expected schemas before using them
  3. Implement fallback mechanisms to traditional scraping when LLMs fail
  4. Monitor costs and set budgets to prevent unexpected charges
  5. Handle token limits by intelligently truncating or chunking content
  6. Log all errors systematically to identify patterns and improve reliability
  7. Cross-validate critical data to catch hallucinations
  8. Set appropriate timeouts for both network requests and LLM API calls, similar to handling timeouts in browser automation

Conclusion

Error handling for LLM-based web scraping requires a multi-layered approach that addresses both traditional scraping challenges and LLM-specific issues. By implementing robust retry logic, validation, fallback mechanisms, and monitoring, you can build reliable scraping systems that gracefully handle failures while controlling costs.

Remember that LLMs are probabilistic systems, so perfect reliability is impossible. The key is to design your error handling strategy to fail gracefully, provide useful fallbacks, and give you visibility into what's happening in your scraping pipeline.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon