How do I Handle Errors When Using GPT for Data Extraction?

Error handling is critical when using GPT models for web scraping and data extraction. GPT APIs can fail due to rate limits, network issues, invalid responses, or content policy violations. Implementing robust error handling ensures your scraping pipeline remains reliable and resilient.

Common Error Types in GPT-Based Data Extraction

When working with GPT for web scraping, you'll encounter several types of errors:

1. API-Level Errors

Rate limiting (429): Exceeding API request limits
Authentication errors (401): Invalid or expired API keys
Timeout errors: Requests taking too long
Server errors (500-series): OpenAI service issues
Content policy violations: Input or output triggering safety filters

2. Data Quality Errors

Malformed JSON responses: GPT returning invalid structured data
Hallucinations: GPT generating fictitious information
Missing fields: Incomplete data extraction
Type mismatches: Returned data not matching expected schema

3. Network and Infrastructure Errors

Connection failures: Network connectivity issues
DNS resolution failures: Unable to reach API endpoints
SSL certificate errors: Security-related connection problems

Implementing Retry Logic with Exponential Backoff

Retry logic is essential for handling transient errors. Here's a robust implementation in Python:

import time
import openai
from openai import OpenAI
import random

def extract_with_retry(prompt, html_content, max_retries=3, base_delay=1):
    """
    Extract data using GPT with exponential backoff retry logic.
    """
    client = OpenAI(api_key="your-api-key")

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "Extract structured data from HTML."},
                    {"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
                ],
                temperature=0.1,
                max_tokens=2000
            )

            return response.choices[0].message.content

        except openai.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)

        except openai.APIConnectionError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Connection error. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)

        except openai.APIError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"API error: {e}. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)

        except Exception as e:
            print(f"Unexpected error: {e}")
            raise

    raise Exception("Max retries exceeded")

JavaScript developers can implement similar retry logic:

const OpenAI = require('openai');

async function extractWithRetry(prompt, htmlContent, maxRetries = 3, baseDelay = 1000) {
    const openai = new OpenAI({
        apiKey: process.env.OPENAI_API_KEY
    });

    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            const response = await openai.chat.completions.create({
                model: 'gpt-4',
                messages: [
                    { role: 'system', content: 'Extract structured data from HTML.' },
                    { role: 'user', content: `${prompt}\n\nHTML:\n${htmlContent}` }
                ],
                temperature: 0.1,
                max_tokens: 2000
            });

            return response.choices[0].message.content;

        } catch (error) {
            const isLastAttempt = attempt === maxRetries - 1;

            if (error.status === 429 && !isLastAttempt) {
                // Rate limit error
                const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
                console.log(`Rate limit hit. Retrying in ${delay}ms...`);
                await new Promise(resolve => setTimeout(resolve, delay));
            } else if (error.code === 'ECONNREFUSED' && !isLastAttempt) {
                // Connection error
                const delay = baseDelay * Math.pow(2, attempt);
                console.log(`Connection error. Retrying in ${delay}ms...`);
                await new Promise(resolve => setTimeout(resolve, delay));
            } else {
                throw error;
            }
        }
    }

    throw new Error('Max retries exceeded');
}

Validating GPT Responses

Always validate GPT responses to catch malformed data early. Here's a comprehensive validation approach:

import json
from typing import Dict, Any, Optional
from pydantic import BaseModel, ValidationError, field_validator

class ProductData(BaseModel):
    """Schema for product data extraction."""
    title: str
    price: float
    currency: str
    availability: str
    rating: Optional[float] = None

    @field_validator('price')
    def price_must_be_positive(cls, v):
        if v < 0:
            raise ValueError('Price must be positive')
        return v

    @field_validator('rating')
    def rating_must_be_valid(cls, v):
        if v is not None and (v < 0 or v > 5):
            raise ValueError('Rating must be between 0 and 5')
        return v

def validate_and_parse_response(gpt_response: str) -> Optional[Dict[str, Any]]:
    """
    Validate and parse GPT JSON response with error handling.
    """
    try:
        # Step 1: Parse JSON
        data = json.loads(gpt_response)

        # Step 2: Validate against schema
        validated_data = ProductData(**data)

        return validated_data.model_dump()

    except json.JSONDecodeError as e:
        print(f"JSON parsing error: {e}")
        # Attempt to extract JSON from markdown code blocks
        if '```language-json' in gpt_response:
            try:
                json_start = gpt_response.find('```language-json') + 7
                json_end = gpt_response.find('```', json_start)
                json_str = gpt_response[json_start:json_end].strip()
                data = json.loads(json_str)
                validated_data = ProductData(**data)
                return validated_data.model_dump()
            except Exception as nested_error:
                print(f"Failed to extract JSON from markdown: {nested_error}")
        return None

    except ValidationError as e:
        print(f"Validation error: {e}")
        return None

    except Exception as e:
        print(f"Unexpected error during validation: {e}")
        return None

Implementing Fallback Mechanisms

When GPT fails, having fallback mechanisms ensures continuity. Here's a multi-layered approach:

import re
from bs4 import BeautifulSoup

def extract_with_fallback(html_content: str, prompt: str) -> Dict[str, Any]:
    """
    Extract data with GPT as primary method and traditional scraping as fallback.
    """
    # Try GPT extraction first
    try:
        gpt_response = extract_with_retry(prompt, html_content)
        validated_data = validate_and_parse_response(gpt_response)

        if validated_data:
            return {
                'data': validated_data,
                'method': 'gpt',
                'confidence': 'high'
            }
    except Exception as e:
        print(f"GPT extraction failed: {e}")

    # Fallback to traditional scraping
    try:
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract using CSS selectors
        fallback_data = {
            'title': soup.select_one('h1.product-title')?.text.strip(),
            'price': extract_price(soup.select_one('.price')?.text),
            'currency': 'USD',  # Default or extract from page
            'availability': soup.select_one('.availability')?.text.strip()
        }

        return {
            'data': fallback_data,
            'method': 'traditional',
            'confidence': 'medium'
        }

    except Exception as e:
        print(f"Fallback extraction failed: {e}")
        return {
            'data': None,
            'method': 'none',
            'confidence': 'none',
            'error': str(e)
        }

def extract_price(price_text: Optional[str]) -> Optional[float]:
    """Extract numeric price from text."""
    if not price_text:
        return None

    # Remove currency symbols and extract number
    price_match = re.search(r'[\d,]+\.?\d*', price_text.replace(',', ''))
    return float(price_match.group()) if price_match else None

Handling Content Policy Violations

GPT may refuse to process certain content. Handle these cases gracefully:

def safe_extract(html_content: str, prompt: str) -> Dict[str, Any]:
    """
    Extract data with content policy violation handling.
    """
    try:
        response = extract_with_retry(prompt, html_content)
        return validate_and_parse_response(response)

    except openai.BadRequestError as e:
        # Content policy violation
        if 'content_policy_violation' in str(e).lower():
            print("Content policy violation detected. Sanitizing input...")

            # Sanitize HTML (remove scripts, styles, etc.)
            soup = BeautifulSoup(html_content, 'html.parser')
            for tag in soup(['script', 'style', 'iframe']):
                tag.decompose()

            sanitized_html = soup.get_text(separator=' ', strip=True)

            # Retry with sanitized content
            try:
                response = extract_with_retry(prompt, sanitized_html)
                return validate_and_parse_response(response)
            except Exception as nested_error:
                print(f"Sanitization didn't help: {nested_error}")
                return None
        else:
            raise

Monitoring and Logging Errors

Implement comprehensive logging to track error patterns:

import logging
from datetime import datetime
import json

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('gpt_extraction.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

class GPTExtractor:
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
        self.error_stats = {
            'rate_limits': 0,
            'timeouts': 0,
            'validation_errors': 0,
            'api_errors': 0
        }

    def extract(self, html_content: str, prompt: str) -> Optional[Dict[str, Any]]:
        """Extract with comprehensive error logging."""
        start_time = datetime.now()

        try:
            response = extract_with_retry(prompt, html_content)
            validated_data = validate_and_parse_response(response)

            duration = (datetime.now() - start_time).total_seconds()

            logger.info(f"Extraction successful. Duration: {duration}s")

            return validated_data

        except openai.RateLimitError as e:
            self.error_stats['rate_limits'] += 1
            logger.error(f"Rate limit error: {e}")
            raise

        except openai.APITimeoutError as e:
            self.error_stats['timeouts'] += 1
            logger.error(f"Timeout error: {e}")
            raise

        except ValidationError as e:
            self.error_stats['validation_errors'] += 1
            logger.error(f"Validation error: {e}")
            raise

        except Exception as e:
            self.error_stats['api_errors'] += 1
            logger.error(f"Unexpected error: {e}", exc_info=True)
            raise

    def get_error_report(self) -> str:
        """Generate error statistics report."""
        return json.dumps(self.error_stats, indent=2)

Rate Limiting and Token Management

Implement token counting and rate limiting to prevent errors:

import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    """Count tokens in text for given model."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def truncate_html(html_content: str, max_tokens: int = 8000, model: str = "gpt-4") -> str:
    """Truncate HTML to fit within token limit."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(html_content)

    if len(tokens) <= max_tokens:
        return html_content

    # Truncate and decode
    truncated_tokens = tokens[:max_tokens]
    return encoding.decode(truncated_tokens)

def safe_api_call(html_content: str, prompt: str, max_tokens: int = 8000):
    """Make API call with token limit enforcement."""
    # Count tokens for prompt + HTML
    total_tokens = count_tokens(prompt) + count_tokens(html_content)

    if total_tokens > max_tokens:
        print(f"Content exceeds token limit ({total_tokens} > {max_tokens}). Truncating...")
        html_content = truncate_html(html_content, max_tokens - count_tokens(prompt) - 1000)

    return extract_with_retry(prompt, html_content)

Best Practices Summary

Always implement retry logic with exponential backoff for transient errors
Validate all responses using schema validation libraries like Pydantic
Implement fallback mechanisms to traditional scraping when GPT fails
Monitor and log errors to identify patterns and optimize your pipeline
Handle token limits proactively by truncating or chunking content
Sanitize input content to avoid content policy violations
Use timeouts to prevent indefinite waiting on API calls
Cache successful extractions to reduce API calls and costs

By implementing these error handling strategies, you'll build a robust GPT-powered data extraction pipeline that gracefully handles failures and maintains high reliability. Similar to how you handle errors in traditional scraping tools, defensive programming and comprehensive error handling are essential for production systems.

For complex scraping workflows that require browser automation alongside GPT extraction, consider implementing timeout handling strategies to ensure your entire pipeline remains responsive and reliable.

Table of contents

How do I Handle Errors When Using GPT for Data Extraction?

Common Error Types in GPT-Based Data Extraction

1. API-Level Errors

2. Data Quality Errors

3. Network and Infrastructure Errors

Implementing Retry Logic with Exponential Backoff

Validating GPT Responses

Implementing Fallback Mechanisms

Handling Content Policy Violations

Monitoring and Logging Errors

Rate Limiting and Token Management

Best Practices Summary

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are examples of successful web scraping with AI?

How can I scrape product data using ChatGPT?

What is the difference between GPT-4 API and ChatGPT API for scraping?

Get Started Now

Support