Table of contents

How do I Handle Errors When Using GPT for Data Extraction?

Error handling is critical when using GPT models for web scraping and data extraction. GPT APIs can fail due to rate limits, network issues, invalid responses, or content policy violations. Implementing robust error handling ensures your scraping pipeline remains reliable and resilient.

Common Error Types in GPT-Based Data Extraction

When working with GPT for web scraping, you'll encounter several types of errors:

1. API-Level Errors

  • Rate limiting (429): Exceeding API request limits
  • Authentication errors (401): Invalid or expired API keys
  • Timeout errors: Requests taking too long
  • Server errors (500-series): OpenAI service issues
  • Content policy violations: Input or output triggering safety filters

2. Data Quality Errors

  • Malformed JSON responses: GPT returning invalid structured data
  • Hallucinations: GPT generating fictitious information
  • Missing fields: Incomplete data extraction
  • Type mismatches: Returned data not matching expected schema

3. Network and Infrastructure Errors

  • Connection failures: Network connectivity issues
  • DNS resolution failures: Unable to reach API endpoints
  • SSL certificate errors: Security-related connection problems

Implementing Retry Logic with Exponential Backoff

Retry logic is essential for handling transient errors. Here's a robust implementation in Python:

import time
import openai
from openai import OpenAI
import random

def extract_with_retry(prompt, html_content, max_retries=3, base_delay=1):
    """
    Extract data using GPT with exponential backoff retry logic.
    """
    client = OpenAI(api_key="your-api-key")

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {"role": "system", "content": "Extract structured data from HTML."},
                    {"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
                ],
                temperature=0.1,
                max_tokens=2000
            )

            return response.choices[0].message.content

        except openai.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limit hit. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)

        except openai.APIConnectionError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"Connection error. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)

        except openai.APIError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt)
            print(f"API error: {e}. Retrying in {delay:.2f} seconds...")
            time.sleep(delay)

        except Exception as e:
            print(f"Unexpected error: {e}")
            raise

    raise Exception("Max retries exceeded")

JavaScript developers can implement similar retry logic:

const OpenAI = require('openai');

async function extractWithRetry(prompt, htmlContent, maxRetries = 3, baseDelay = 1000) {
    const openai = new OpenAI({
        apiKey: process.env.OPENAI_API_KEY
    });

    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            const response = await openai.chat.completions.create({
                model: 'gpt-4',
                messages: [
                    { role: 'system', content: 'Extract structured data from HTML.' },
                    { role: 'user', content: `${prompt}\n\nHTML:\n${htmlContent}` }
                ],
                temperature: 0.1,
                max_tokens: 2000
            });

            return response.choices[0].message.content;

        } catch (error) {
            const isLastAttempt = attempt === maxRetries - 1;

            if (error.status === 429 && !isLastAttempt) {
                // Rate limit error
                const delay = baseDelay * Math.pow(2, attempt) + Math.random() * 1000;
                console.log(`Rate limit hit. Retrying in ${delay}ms...`);
                await new Promise(resolve => setTimeout(resolve, delay));
            } else if (error.code === 'ECONNREFUSED' && !isLastAttempt) {
                // Connection error
                const delay = baseDelay * Math.pow(2, attempt);
                console.log(`Connection error. Retrying in ${delay}ms...`);
                await new Promise(resolve => setTimeout(resolve, delay));
            } else {
                throw error;
            }
        }
    }

    throw new Error('Max retries exceeded');
}

Validating GPT Responses

Always validate GPT responses to catch malformed data early. Here's a comprehensive validation approach:

import json
from typing import Dict, Any, Optional
from pydantic import BaseModel, ValidationError, field_validator

class ProductData(BaseModel):
    """Schema for product data extraction."""
    title: str
    price: float
    currency: str
    availability: str
    rating: Optional[float] = None

    @field_validator('price')
    def price_must_be_positive(cls, v):
        if v < 0:
            raise ValueError('Price must be positive')
        return v

    @field_validator('rating')
    def rating_must_be_valid(cls, v):
        if v is not None and (v < 0 or v > 5):
            raise ValueError('Rating must be between 0 and 5')
        return v

def validate_and_parse_response(gpt_response: str) -> Optional[Dict[str, Any]]:
    """
    Validate and parse GPT JSON response with error handling.
    """
    try:
        # Step 1: Parse JSON
        data = json.loads(gpt_response)

        # Step 2: Validate against schema
        validated_data = ProductData(**data)

        return validated_data.model_dump()

    except json.JSONDecodeError as e:
        print(f"JSON parsing error: {e}")
        # Attempt to extract JSON from markdown code blocks
        if '```language-json' in gpt_response:
            try:
                json_start = gpt_response.find('```language-json') + 7
                json_end = gpt_response.find('```', json_start)
                json_str = gpt_response[json_start:json_end].strip()
                data = json.loads(json_str)
                validated_data = ProductData(**data)
                return validated_data.model_dump()
            except Exception as nested_error:
                print(f"Failed to extract JSON from markdown: {nested_error}")
        return None

    except ValidationError as e:
        print(f"Validation error: {e}")
        return None

    except Exception as e:
        print(f"Unexpected error during validation: {e}")
        return None

Implementing Fallback Mechanisms

When GPT fails, having fallback mechanisms ensures continuity. Here's a multi-layered approach:

import re
from bs4 import BeautifulSoup

def extract_with_fallback(html_content: str, prompt: str) -> Dict[str, Any]:
    """
    Extract data with GPT as primary method and traditional scraping as fallback.
    """
    # Try GPT extraction first
    try:
        gpt_response = extract_with_retry(prompt, html_content)
        validated_data = validate_and_parse_response(gpt_response)

        if validated_data:
            return {
                'data': validated_data,
                'method': 'gpt',
                'confidence': 'high'
            }
    except Exception as e:
        print(f"GPT extraction failed: {e}")

    # Fallback to traditional scraping
    try:
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract using CSS selectors
        fallback_data = {
            'title': soup.select_one('h1.product-title')?.text.strip(),
            'price': extract_price(soup.select_one('.price')?.text),
            'currency': 'USD',  # Default or extract from page
            'availability': soup.select_one('.availability')?.text.strip()
        }

        return {
            'data': fallback_data,
            'method': 'traditional',
            'confidence': 'medium'
        }

    except Exception as e:
        print(f"Fallback extraction failed: {e}")
        return {
            'data': None,
            'method': 'none',
            'confidence': 'none',
            'error': str(e)
        }

def extract_price(price_text: Optional[str]) -> Optional[float]:
    """Extract numeric price from text."""
    if not price_text:
        return None

    # Remove currency symbols and extract number
    price_match = re.search(r'[\d,]+\.?\d*', price_text.replace(',', ''))
    return float(price_match.group()) if price_match else None

Handling Content Policy Violations

GPT may refuse to process certain content. Handle these cases gracefully:

def safe_extract(html_content: str, prompt: str) -> Dict[str, Any]:
    """
    Extract data with content policy violation handling.
    """
    try:
        response = extract_with_retry(prompt, html_content)
        return validate_and_parse_response(response)

    except openai.BadRequestError as e:
        # Content policy violation
        if 'content_policy_violation' in str(e).lower():
            print("Content policy violation detected. Sanitizing input...")

            # Sanitize HTML (remove scripts, styles, etc.)
            soup = BeautifulSoup(html_content, 'html.parser')
            for tag in soup(['script', 'style', 'iframe']):
                tag.decompose()

            sanitized_html = soup.get_text(separator=' ', strip=True)

            # Retry with sanitized content
            try:
                response = extract_with_retry(prompt, sanitized_html)
                return validate_and_parse_response(response)
            except Exception as nested_error:
                print(f"Sanitization didn't help: {nested_error}")
                return None
        else:
            raise

Monitoring and Logging Errors

Implement comprehensive logging to track error patterns:

import logging
from datetime import datetime
import json

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('gpt_extraction.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

class GPTExtractor:
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
        self.error_stats = {
            'rate_limits': 0,
            'timeouts': 0,
            'validation_errors': 0,
            'api_errors': 0
        }

    def extract(self, html_content: str, prompt: str) -> Optional[Dict[str, Any]]:
        """Extract with comprehensive error logging."""
        start_time = datetime.now()

        try:
            response = extract_with_retry(prompt, html_content)
            validated_data = validate_and_parse_response(response)

            duration = (datetime.now() - start_time).total_seconds()

            logger.info(f"Extraction successful. Duration: {duration}s")

            return validated_data

        except openai.RateLimitError as e:
            self.error_stats['rate_limits'] += 1
            logger.error(f"Rate limit error: {e}")
            raise

        except openai.APITimeoutError as e:
            self.error_stats['timeouts'] += 1
            logger.error(f"Timeout error: {e}")
            raise

        except ValidationError as e:
            self.error_stats['validation_errors'] += 1
            logger.error(f"Validation error: {e}")
            raise

        except Exception as e:
            self.error_stats['api_errors'] += 1
            logger.error(f"Unexpected error: {e}", exc_info=True)
            raise

    def get_error_report(self) -> str:
        """Generate error statistics report."""
        return json.dumps(self.error_stats, indent=2)

Rate Limiting and Token Management

Implement token counting and rate limiting to prevent errors:

import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    """Count tokens in text for given model."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def truncate_html(html_content: str, max_tokens: int = 8000, model: str = "gpt-4") -> str:
    """Truncate HTML to fit within token limit."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(html_content)

    if len(tokens) <= max_tokens:
        return html_content

    # Truncate and decode
    truncated_tokens = tokens[:max_tokens]
    return encoding.decode(truncated_tokens)

def safe_api_call(html_content: str, prompt: str, max_tokens: int = 8000):
    """Make API call with token limit enforcement."""
    # Count tokens for prompt + HTML
    total_tokens = count_tokens(prompt) + count_tokens(html_content)

    if total_tokens > max_tokens:
        print(f"Content exceeds token limit ({total_tokens} > {max_tokens}). Truncating...")
        html_content = truncate_html(html_content, max_tokens - count_tokens(prompt) - 1000)

    return extract_with_retry(prompt, html_content)

Best Practices Summary

  1. Always implement retry logic with exponential backoff for transient errors
  2. Validate all responses using schema validation libraries like Pydantic
  3. Implement fallback mechanisms to traditional scraping when GPT fails
  4. Monitor and log errors to identify patterns and optimize your pipeline
  5. Handle token limits proactively by truncating or chunking content
  6. Sanitize input content to avoid content policy violations
  7. Use timeouts to prevent indefinite waiting on API calls
  8. Cache successful extractions to reduce API calls and costs

By implementing these error handling strategies, you'll build a robust GPT-powered data extraction pipeline that gracefully handles failures and maintains high reliability. Similar to how you handle errors in traditional scraping tools, defensive programming and comprehensive error handling are essential for production systems.

For complex scraping workflows that require browser automation alongside GPT extraction, consider implementing timeout handling strategies to ensure your entire pipeline remains responsive and reliable.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon