How do I Handle Errors When Using GPT for Data Extraction?
Error handling is critical when using GPT models for web scraping and data extraction. GPT APIs can fail due to rate limits, network issues, invalid responses, or content policy violations. Implementing robust error handling ensures your scraping pipeline remains reliable and resilient.
Common Error Types in GPT-Based Data Extraction
When working with GPT for web scraping, you'll encounter several types of errors:
1. API-Level Errors
- Rate limiting (429): Exceeding API request limits
- Authentication errors (401): Invalid or expired API keys
- Timeout errors: Requests taking too long
- Server errors (500-series): OpenAI service issues
- Content policy violations: Input or output triggering safety filters
2. Data Quality Errors
- Malformed JSON responses: GPT returning invalid structured data
- Hallucinations: GPT generating fictitious information
- Missing fields: Incomplete data extraction
- Type mismatches: Returned data not matching expected schema
3. Network and Infrastructure Errors
- Connection failures: Network connectivity issues
- DNS resolution failures: Unable to reach API endpoints
- SSL certificate errors: Security-related connection problems
Implementing Retry Logic with Exponential Backoff
Retry logic is essential for handling transient errors. Here's a robust implementation in Python:
import time
import openai
from openai import OpenAI
import random
def extract_with_retry(prompt, html_content, max_retries=3, base_delay=1):
"""
Extract data using GPT with exponential backoff retry logic.
"""
client = OpenAI(api_key="your-api-key")
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract structured data from HTML."},
{"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
],
temperature=0.1,
max_tokens=2000
)
return response.choices[0].message.content
except openai.RateLimitError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limit hit. Retrying in {delay:.2f} seconds...")
time.sleep(delay)
except openai.APIConnectionError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"Connection error. Retrying in {delay:.2f} seconds...")
time.sleep(delay)
except openai.APIError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
print(f"API error: {e}. Retrying in {delay:.2f} seconds...")
time.sleep(delay)
except Exception as e:
print(f"Unexpected error: {e}")
raise
raise Exception("Max retries exceeded")
JavaScript developers can implement similar retry logic:
const OpenAI = require('openai');
async function extractWithRetry(prompt, htmlContent, maxRetries = 3, baseDelay = 1000) {
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'Extract structured data from HTML.' },
{ role: 'user', content: `${prompt}\n\nHTML:\n${htmlContent}` }
],
temperature: 0.1,
max_tokens: 2000
});
return response.choices[0].message.content;
} catch (error) {
const isLastAttempt = attempt === maxRetries - 1;
if (error.status === 429 && !isLastAttempt) {
// Rate limit error
const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
console.log(`Rate limit hit. Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
} else if (error.code === 'ECONNREFUSED' && !isLastAttempt) {
// Connection error
const delay = baseDelay * Math.pow(2, attempt);
console.log(`Connection error. Retrying in ${delay}ms...`);
await new Promise(resolve => setTimeout(resolve, delay));
} else {
throw error;
}
}
}
throw new Error('Max retries exceeded');
}
Validating GPT Responses
Always validate GPT responses to catch malformed data early. Here's a comprehensive validation approach:
import json
from typing import Dict, Any, Optional
from pydantic import BaseModel, ValidationError, field_validator
class ProductData(BaseModel):
"""Schema for product data extraction."""
title: str
price: float
currency: str
availability: str
rating: Optional[float] = None
@field_validator('price')
def price_must_be_positive(cls, v):
if v < 0:
raise ValueError('Price must be positive')
return v
@field_validator('rating')
def rating_must_be_valid(cls, v):
if v is not None and (v < 0 or v > 5):
raise ValueError('Rating must be between 0 and 5')
return v
def validate_and_parse_response(gpt_response: str) -> Optional[Dict[str, Any]]:
"""
Validate and parse GPT JSON response with error handling.
"""
try:
# Step 1: Parse JSON
data = json.loads(gpt_response)
# Step 2: Validate against schema
validated_data = ProductData(**data)
return validated_data.model_dump()
except json.JSONDecodeError as e:
print(f"JSON parsing error: {e}")
# Attempt to extract JSON from markdown code blocks
if '```language-json' in gpt_response:
try:
json_start = gpt_response.find('```language-json') + 7
json_end = gpt_response.find('```', json_start)
json_str = gpt_response[json_start:json_end].strip()
data = json.loads(json_str)
validated_data = ProductData(**data)
return validated_data.model_dump()
except Exception as nested_error:
print(f"Failed to extract JSON from markdown: {nested_error}")
return None
except ValidationError as e:
print(f"Validation error: {e}")
return None
except Exception as e:
print(f"Unexpected error during validation: {e}")
return None
Implementing Fallback Mechanisms
When GPT fails, having fallback mechanisms ensures continuity. Here's a multi-layered approach:
import re
from bs4 import BeautifulSoup
def extract_with_fallback(html_content: str, prompt: str) -> Dict[str, Any]:
"""
Extract data with GPT as primary method and traditional scraping as fallback.
"""
# Try GPT extraction first
try:
gpt_response = extract_with_retry(prompt, html_content)
validated_data = validate_and_parse_response(gpt_response)
if validated_data:
return {
'data': validated_data,
'method': 'gpt',
'confidence': 'high'
}
except Exception as e:
print(f"GPT extraction failed: {e}")
# Fallback to traditional scraping
try:
soup = BeautifulSoup(html_content, 'html.parser')
# Extract using CSS selectors
fallback_data = {
'title': soup.select_one('h1.product-title')?.text.strip(),
'price': extract_price(soup.select_one('.price')?.text),
'currency': 'USD', # Default or extract from page
'availability': soup.select_one('.availability')?.text.strip()
}
return {
'data': fallback_data,
'method': 'traditional',
'confidence': 'medium'
}
except Exception as e:
print(f"Fallback extraction failed: {e}")
return {
'data': None,
'method': 'none',
'confidence': 'none',
'error': str(e)
}
def extract_price(price_text: Optional[str]) -> Optional[float]:
"""Extract numeric price from text."""
if not price_text:
return None
# Remove currency symbols and extract number
price_match = re.search(r'[\d,]+\.?\d*', price_text.replace(',', ''))
return float(price_match.group()) if price_match else None
Handling Content Policy Violations
GPT may refuse to process certain content. Handle these cases gracefully:
def safe_extract(html_content: str, prompt: str) -> Dict[str, Any]:
"""
Extract data with content policy violation handling.
"""
try:
response = extract_with_retry(prompt, html_content)
return validate_and_parse_response(response)
except openai.BadRequestError as e:
# Content policy violation
if 'content_policy_violation' in str(e).lower():
print("Content policy violation detected. Sanitizing input...")
# Sanitize HTML (remove scripts, styles, etc.)
soup = BeautifulSoup(html_content, 'html.parser')
for tag in soup(['script', 'style', 'iframe']):
tag.decompose()
sanitized_html = soup.get_text(separator=' ', strip=True)
# Retry with sanitized content
try:
response = extract_with_retry(prompt, sanitized_html)
return validate_and_parse_response(response)
except Exception as nested_error:
print(f"Sanitization didn't help: {nested_error}")
return None
else:
raise
Monitoring and Logging Errors
Implement comprehensive logging to track error patterns:
import logging
from datetime import datetime
import json
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('gpt_extraction.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
class GPTExtractor:
def __init__(self, api_key: str):
self.client = OpenAI(api_key=api_key)
self.error_stats = {
'rate_limits': 0,
'timeouts': 0,
'validation_errors': 0,
'api_errors': 0
}
def extract(self, html_content: str, prompt: str) -> Optional[Dict[str, Any]]:
"""Extract with comprehensive error logging."""
start_time = datetime.now()
try:
response = extract_with_retry(prompt, html_content)
validated_data = validate_and_parse_response(response)
duration = (datetime.now() - start_time).total_seconds()
logger.info(f"Extraction successful. Duration: {duration}s")
return validated_data
except openai.RateLimitError as e:
self.error_stats['rate_limits'] += 1
logger.error(f"Rate limit error: {e}")
raise
except openai.APITimeoutError as e:
self.error_stats['timeouts'] += 1
logger.error(f"Timeout error: {e}")
raise
except ValidationError as e:
self.error_stats['validation_errors'] += 1
logger.error(f"Validation error: {e}")
raise
except Exception as e:
self.error_stats['api_errors'] += 1
logger.error(f"Unexpected error: {e}", exc_info=True)
raise
def get_error_report(self) -> str:
"""Generate error statistics report."""
return json.dumps(self.error_stats, indent=2)
Rate Limiting and Token Management
Implement token counting and rate limiting to prevent errors:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4") -> int:
"""Count tokens in text for given model."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def truncate_html(html_content: str, max_tokens: int = 8000, model: str = "gpt-4") -> str:
"""Truncate HTML to fit within token limit."""
encoding = tiktoken.encoding_for_model(model)
tokens = encoding.encode(html_content)
if len(tokens) <= max_tokens:
return html_content
# Truncate and decode
truncated_tokens = tokens[:max_tokens]
return encoding.decode(truncated_tokens)
def safe_api_call(html_content: str, prompt: str, max_tokens: int = 8000):
"""Make API call with token limit enforcement."""
# Count tokens for prompt + HTML
total_tokens = count_tokens(prompt) + count_tokens(html_content)
if total_tokens > max_tokens:
print(f"Content exceeds token limit ({total_tokens} > {max_tokens}). Truncating...")
html_content = truncate_html(html_content, max_tokens - count_tokens(prompt) - 1000)
return extract_with_retry(prompt, html_content)
Best Practices Summary
- Always implement retry logic with exponential backoff for transient errors
- Validate all responses using schema validation libraries like Pydantic
- Implement fallback mechanisms to traditional scraping when GPT fails
- Monitor and log errors to identify patterns and optimize your pipeline
- Handle token limits proactively by truncating or chunking content
- Sanitize input content to avoid content policy violations
- Use timeouts to prevent indefinite waiting on API calls
- Cache successful extractions to reduce API calls and costs
By implementing these error handling strategies, you'll build a robust GPT-powered data extraction pipeline that gracefully handles failures and maintains high reliability. Similar to how you handle errors in traditional scraping tools, defensive programming and comprehensive error handling are essential for production systems.
For complex scraping workflows that require browser automation alongside GPT extraction, consider implementing timeout handling strategies to ensure your entire pipeline remains responsive and reliable.