What Error Handling Strategies Should I Use When Scraping with LLMs?
Error handling is critical when using Large Language Models (LLMs) for web scraping because LLMs introduce unique challenges beyond traditional scraping errors. You'll need to handle not only network and parsing errors but also API rate limits, token limits, hallucinations, inconsistent outputs, and cost overruns.
This guide covers comprehensive error handling strategies specifically designed for LLM-based web scraping workflows.
Understanding LLM-Specific Errors
When scraping with LLMs, you'll encounter several types of errors:
- API Errors: Rate limits, authentication failures, timeouts
- Token Limit Errors: Content exceeds the LLM's context window
- Validation Errors: LLM returns malformed or unexpected data
- Hallucination Errors: LLM generates plausible but incorrect data
- Network Errors: Connection issues when fetching pages or calling APIs
- Cost Threshold Errors: Budget limits exceeded
Strategy 1: Implement Retry Logic with Exponential Backoff
Retry logic is essential for handling transient errors like rate limits and temporary API failures.
Python Example with OpenAI
import time
import openai
from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type
@retry(
retry=retry_if_exception_type((openai.RateLimitError, openai.APIConnectionError)),
wait=wait_exponential(multiplier=1, min=4, max=60),
stop=stop_after_attempt(5)
)
def extract_data_with_llm(html_content, prompt):
try:
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a web scraping assistant that extracts structured data."},
{"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
],
temperature=0
)
return response.choices[0].message.content
except openai.InvalidRequestError as e:
# Don't retry invalid requests (e.g., token limit exceeded)
raise ValueError(f"Invalid request: {e}")
except Exception as e:
print(f"Error calling LLM: {e}")
raise
# Usage
try:
result = extract_data_with_llm(html_content, "Extract product name and price as JSON")
except ValueError as e:
print(f"Non-retryable error: {e}")
except Exception as e:
print(f"All retries failed: {e}")
JavaScript Example with Anthropic Claude
async function callLLMWithRetry(htmlContent, prompt, maxRetries = 5) {
const baseDelay = 1000; // 1 second
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': process.env.ANTHROPIC_API_KEY,
'anthropic-version': '2023-06-01'
},
body: JSON.stringify({
model: 'claude-3-opus-20240229',
max_tokens: 1024,
messages: [{
role: 'user',
content: `${prompt}\n\nHTML:\n${htmlContent}`
}]
})
});
if (response.status === 429) {
// Rate limited - wait and retry
const delay = baseDelay * Math.pow(2, attempt);
console.log(`Rate limited. Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
if (!response.ok) {
throw new Error(`API error: ${response.status} ${response.statusText}`);
}
return await response.json();
} catch (error) {
if (attempt === maxRetries - 1) {
throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
}
const delay = baseDelay * Math.pow(2, attempt);
console.log(`Attempt ${attempt + 1} failed. Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Strategy 2: Handle Token Limit Errors
When HTML content exceeds the LLM's context window, you need to either truncate or chunk the content.
Python Example: Smart Content Truncation
import tiktoken
def count_tokens(text, model="gpt-4"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def truncate_html_intelligently(html_content, max_tokens=6000, model="gpt-4"):
from bs4 import BeautifulSoup
if count_tokens(html_content, model) <= max_tokens:
return html_content
# Parse and extract only relevant content
soup = BeautifulSoup(html_content, 'html.parser')
# Remove non-content elements
for element in soup(['script', 'style', 'nav', 'footer', 'header', 'iframe']):
element.decompose()
# Extract main content
main_content = soup.find('main') or soup.find('article') or soup.body
if main_content:
text = main_content.get_text(separator=' ', strip=True)
# Truncate to fit within token limit
encoding = tiktoken.encoding_for_model(model)
tokens = encoding.encode(text)
if len(tokens) > max_tokens:
tokens = tokens[:max_tokens]
text = encoding.decode(tokens)
return text
return html_content[:max_tokens * 4] # Rough character estimate
def scrape_with_token_handling(url, prompt):
import requests
try:
response = requests.get(url)
html_content = response.text
# Truncate if necessary
processed_content = truncate_html_intelligently(html_content)
# Call LLM
result = extract_data_with_llm(processed_content, prompt)
return result
except ValueError as e:
if "token" in str(e).lower():
# Try more aggressive truncation
processed_content = truncate_html_intelligently(html_content, max_tokens=3000)
return extract_data_with_llm(processed_content, prompt)
raise
Strategy 3: Validate LLM Output
Always validate that the LLM returns data in the expected format before using it.
Python Example: JSON Schema Validation
import json
from jsonschema import validate, ValidationError
def validate_llm_output(llm_response, schema):
try:
# Try to parse as JSON
data = json.loads(llm_response)
# Validate against schema
validate(instance=data, schema=schema)
return data
except json.JSONDecodeError as e:
raise ValueError(f"LLM returned invalid JSON: {e}")
except ValidationError as e:
raise ValueError(f"LLM output doesn't match schema: {e}")
def scrape_with_validation(html_content, prompt):
# Define expected schema
schema = {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"}
},
"required": ["product_name", "price"]
}
max_attempts = 3
for attempt in range(max_attempts):
try:
# Get LLM response
llm_response = extract_data_with_llm(html_content, prompt)
# Validate
validated_data = validate_llm_output(llm_response, schema)
return validated_data
except ValueError as e:
print(f"Validation failed on attempt {attempt + 1}: {e}")
if attempt < max_attempts - 1:
# Retry with more explicit instructions
prompt += f"\n\nIMPORTANT: Return valid JSON matching this schema: {json.dumps(schema)}"
else:
raise ValueError(f"Failed to get valid output after {max_attempts} attempts")
JavaScript Example: Type Checking
function validateProductData(data) {
const errors = [];
if (typeof data !== 'object' || data === null) {
throw new Error('LLM response must be an object');
}
if (typeof data.product_name !== 'string' || !data.product_name.trim()) {
errors.push('product_name must be a non-empty string');
}
if (typeof data.price !== 'number' || data.price < 0) {
errors.push('price must be a positive number');
}
if (errors.length > 0) {
throw new Error(`Validation errors: ${errors.join(', ')}`);
}
return data;
}
async function scrapeWithValidation(htmlContent, prompt) {
const maxAttempts = 3;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
const response = await callLLMWithRetry(htmlContent, prompt);
const parsed = JSON.parse(response.content[0].text);
// Validate the parsed data
return validateProductData(parsed);
} catch (error) {
console.error(`Attempt ${attempt + 1} failed: ${error.message}`);
if (attempt === maxAttempts - 1) {
throw new Error(`Validation failed after ${maxAttempts} attempts`);
}
// Add more explicit instructions for next attempt
prompt += '\n\nReturn ONLY valid JSON with product_name (string) and price (number).';
}
}
}
Strategy 4: Implement Fallback Mechanisms
When LLM extraction fails, fall back to traditional parsing methods.
Python Example: Multi-Tier Fallback
from bs4 import BeautifulSoup
import re
def extract_with_fallback(url):
import requests
response = requests.get(url)
html_content = response.text
# Tier 1: Try LLM extraction
try:
prompt = "Extract product name and price. Return JSON with 'product_name' and 'price' fields."
result = scrape_with_validation(html_content, prompt)
result['method'] = 'llm'
return result
except Exception as e:
print(f"LLM extraction failed: {e}. Trying traditional parsing...")
# Tier 2: Try CSS selectors/XPath
try:
soup = BeautifulSoup(html_content, 'html.parser')
name_element = soup.select_one('.product-name, [itemprop="name"], h1')
price_element = soup.select_one('.price, [itemprop="price"]')
if name_element and price_element:
price_text = price_element.get_text()
price = float(re.search(r'[\d.]+', price_text).group())
return {
'product_name': name_element.get_text(strip=True),
'price': price,
'method': 'css_selector'
}
except Exception as e:
print(f"CSS selector extraction failed: {e}. Trying regex...")
# Tier 3: Try regex patterns
try:
price_match = re.search(r'\$\s*(\d+\.?\d*)', html_content)
name_match = re.search(r'<h1[^>]*>([^<]+)</h1>', html_content)
if price_match and name_match:
return {
'product_name': name_match.group(1).strip(),
'price': float(price_match.group(1)),
'method': 'regex'
}
except Exception as e:
print(f"Regex extraction failed: {e}")
# All methods failed
raise ValueError("All extraction methods failed")
Strategy 5: Monitor and Log Errors
Implement comprehensive logging to track error patterns and costs, similar to how you would handle errors in traditional browser automation.
Python Example: Structured Logging
import logging
from datetime import datetime
import json
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class LLMScrapingMonitor:
def __init__(self):
self.errors = []
self.costs = []
self.requests = []
def log_request(self, url, tokens_used, cost, success, error=None):
log_entry = {
'timestamp': datetime.utcnow().isoformat(),
'url': url,
'tokens_used': tokens_used,
'cost': cost,
'success': success,
'error': str(error) if error else None
}
self.requests.append(log_entry)
if not success:
self.errors.append(log_entry)
logger.error(f"Scraping failed for {url}: {error}")
else:
logger.info(f"Successfully scraped {url} - Tokens: {tokens_used}, Cost: ${cost:.4f}")
def get_error_summary(self):
error_types = {}
for error in self.errors:
error_msg = error['error']
error_type = error_msg.split(':')[0] if error_msg else 'Unknown'
error_types[error_type] = error_types.get(error_type, 0) + 1
return error_types
def get_total_cost(self):
return sum(r['cost'] for r in self.requests)
# Usage
monitor = LLMScrapingMonitor()
def scrape_with_monitoring(url, prompt):
tokens_used = 0
cost = 0.0
try:
result = extract_with_fallback(url)
# Calculate approximate cost (GPT-4 pricing)
tokens_used = count_tokens(prompt) + 1000 # Estimated
cost = (tokens_used / 1000) * 0.03 # $0.03 per 1K tokens
monitor.log_request(url, tokens_used, cost, success=True)
return result
except Exception as e:
monitor.log_request(url, tokens_used, cost, success=False, error=e)
raise
# After scraping multiple URLs
print(f"Total cost: ${monitor.get_total_cost():.2f}")
print(f"Error summary: {monitor.get_error_summary()}")
Strategy 6: Set Budget and Rate Limits
Prevent cost overruns by implementing budget controls.
Python Example: Budget Control
class BudgetController:
def __init__(self, max_daily_cost=10.0, max_requests_per_minute=60):
self.max_daily_cost = max_daily_cost
self.max_requests_per_minute = max_requests_per_minute
self.daily_cost = 0.0
self.requests_this_minute = []
self.last_reset = datetime.utcnow()
def check_budget(self, estimated_cost):
# Reset daily counter if new day
if (datetime.utcnow() - self.last_reset).days >= 1:
self.daily_cost = 0.0
self.last_reset = datetime.utcnow()
# Check daily budget
if self.daily_cost + estimated_cost > self.max_daily_cost:
raise BudgetExceededError(
f"Daily budget of ${self.max_daily_cost} would be exceeded. "
f"Current: ${self.daily_cost:.2f}, Estimated: ${estimated_cost:.2f}"
)
# Check rate limit
now = datetime.utcnow()
self.requests_this_minute = [
req for req in self.requests_this_minute
if (now - req).seconds < 60
]
if len(self.requests_this_minute) >= self.max_requests_per_minute:
raise RateLimitError(
f"Rate limit of {self.max_requests_per_minute} requests/minute exceeded"
)
def record_request(self, actual_cost):
self.daily_cost += actual_cost
self.requests_this_minute.append(datetime.utcnow())
class BudgetExceededError(Exception):
pass
class RateLimitError(Exception):
pass
# Usage
budget = BudgetController(max_daily_cost=50.0, max_requests_per_minute=30)
def scrape_with_budget_control(url, prompt):
estimated_cost = 0.10 # Estimate based on content length
try:
budget.check_budget(estimated_cost)
result = extract_with_fallback(url)
# Calculate actual cost
actual_cost = 0.08 # Actual cost after request
budget.record_request(actual_cost)
return result
except BudgetExceededError as e:
logger.error(f"Budget exceeded: {e}")
raise
except RateLimitError as e:
logger.warning(f"Rate limit hit: {e}")
time.sleep(60) # Wait before retrying
return scrape_with_budget_control(url, prompt)
Strategy 7: Handle Hallucinations with Cross-Validation
Validate critical data by cross-referencing with multiple sources or using different extraction methods.
Python Example: Cross-Validation
def cross_validate_extraction(html_content, prompt):
results = []
# Method 1: LLM extraction
try:
llm_result = extract_data_with_llm(html_content, prompt)
results.append(('llm', llm_result))
except Exception as e:
logger.warning(f"LLM extraction failed: {e}")
# Method 2: Traditional parsing
try:
soup = BeautifulSoup(html_content, 'html.parser')
price_elem = soup.select_one('[itemprop="price"]')
if price_elem:
traditional_result = {'price': float(price_elem.get('content', 0))}
results.append(('traditional', traditional_result))
except Exception as e:
logger.warning(f"Traditional extraction failed: {e}")
# Compare results
if len(results) >= 2:
llm_price = results[0][1].get('price')
traditional_price = results[1][1].get('price')
# Check if prices are within 5% of each other
if abs(llm_price - traditional_price) / traditional_price > 0.05:
logger.warning(
f"Price mismatch detected! LLM: {llm_price}, Traditional: {traditional_price}"
)
# Use traditional method when there's a discrepancy
return results[1][1]
return results[0][1] if results else None
Best Practices Summary
- Always use retry logic with exponential backoff for transient errors
- Validate all LLM outputs against expected schemas before using them
- Implement fallback mechanisms to traditional scraping when LLMs fail
- Monitor costs and set budgets to prevent unexpected charges
- Handle token limits by intelligently truncating or chunking content
- Log all errors systematically to identify patterns and improve reliability
- Cross-validate critical data to catch hallucinations
- Set appropriate timeouts for both network requests and LLM API calls, similar to handling timeouts in browser automation
Conclusion
Error handling for LLM-based web scraping requires a multi-layered approach that addresses both traditional scraping challenges and LLM-specific issues. By implementing robust retry logic, validation, fallback mechanisms, and monitoring, you can build reliable scraping systems that gracefully handle failures while controlling costs.
Remember that LLMs are probabilistic systems, so perfect reliability is impossible. The key is to design your error handling strategy to fail gracefully, provide useful fallbacks, and give you visibility into what's happening in your scraping pipeline.