Table of contents

What is the Accuracy of ChatGPT for Data Extraction?

ChatGPT's accuracy for data extraction typically ranges from 85-95% depending on data complexity, prompt engineering quality, and the specific GPT model used. While traditional parsing methods like XPath or CSS selectors achieve near-100% accuracy on structured data, ChatGPT excels at handling unstructured content, varying layouts, and complex semantic understanding where rule-based approaches fail.

Understanding ChatGPT Extraction Accuracy

The accuracy of ChatGPT for data extraction isn't a single fixed number—it varies significantly based on several factors:

Factors Affecting Accuracy

  1. Data Structure Complexity: Simple, well-formatted data (like product prices or titles) can achieve 95%+ accuracy, while extracting nuanced information (like sentiment or implied relationships) typically achieves 85-90% accuracy.

  2. Prompt Quality: Well-engineered prompts with clear examples and constraints can improve accuracy by 10-20% compared to basic prompts.

  3. Model Version: GPT-4 generally achieves 10-15% higher accuracy than GPT-3.5-turbo for complex extraction tasks.

  4. Content Consistency: Pages with consistent formatting yield higher accuracy than highly variable layouts.

Accuracy by Data Type

Different types of data extraction show varying accuracy levels:

| Data Type | Typical Accuracy | Best Use Case | |-----------|------------------|---------------| | Product names/titles | 90-95% | E-commerce scraping | | Numerical data (prices, dates) | 85-92% | Financial data, pricing | | Contact information | 88-94% | Business directories | | Article summaries | 85-90% | Content aggregation | | Sentiment/opinion | 80-88% | Review analysis | | Complex relationships | 75-85% | Graph/network data |

Practical Implementation

Here's how to implement ChatGPT-based data extraction with error handling to maximize accuracy:

Python Example with OpenAI API

import openai
import json
from typing import Dict, List

openai.api_key = "your-api-key"

def extract_product_data(html_content: str) -> Dict:
    """
    Extract product information from HTML using ChatGPT.
    Returns structured data with confidence indicators.
    """
    prompt = f"""
    Extract product information from the following HTML.
    Return ONLY valid JSON with these exact fields:
    - name: product name
    - price: numerical price value only
    - currency: currency code (USD, EUR, etc.)
    - availability: in_stock or out_of_stock
    - rating: numerical rating (0-5)

    HTML:
    {html_content[:4000]}  # Limit context to avoid token limits

    JSON output:
    """

    try:
        response = openai.ChatCompletion.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "You are a precise data extraction tool. Always return valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,  # Low temperature for consistency
            max_tokens=500
        )

        extracted_data = json.loads(response.choices[0].message.content)
        return {
            "data": extracted_data,
            "confidence": "high" if response.choices[0].finish_reason == "stop" else "low"
        }
    except json.JSONDecodeError:
        return {"error": "Invalid JSON response", "confidence": "none"}
    except Exception as e:
        return {"error": str(e), "confidence": "none"}

# Example usage
html = """
<div class="product">
    <h1>Wireless Headphones Pro</h1>
    <span class="price">$149.99</span>
    <div class="stock">In Stock</div>
    <div class="rating">4.5 stars</div>
</div>
"""

result = extract_product_data(html)
print(json.dumps(result, indent=2))

JavaScript Example with OpenAI SDK

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function extractStructuredData(htmlContent) {
  const prompt = `
    Extract all contact information from this HTML.
    Return ONLY valid JSON array with objects containing:
    - name: person/company name
    - email: email address
    - phone: phone number
    - role: job title or role

    If a field is not found, use null.

    HTML:
    ${htmlContent.substring(0, 4000)}

    JSON:
  `;

  try {
    const response = await openai.chat.completions.create({
      model: "gpt-4",
      messages: [
        {
          role: "system",
          content: "You extract data with perfect JSON formatting. Never add explanatory text."
        },
        {
          role: "user",
          content: prompt
        }
      ],
      temperature: 0.2,
      response_format: { type: "json_object" } // Ensures valid JSON
    });

    const extracted = JSON.parse(response.choices[0].message.content);

    return {
      data: extracted,
      tokensUsed: response.usage.total_tokens,
      accuracy: estimateAccuracy(extracted)
    };
  } catch (error) {
    console.error('Extraction error:', error);
    return { error: error.message };
  }
}

function estimateAccuracy(data) {
  // Simple heuristic: completeness indicates accuracy
  const fields = Object.values(data).flat();
  const filledFields = fields.filter(f => f !== null && f !== '');
  return fields.length > 0 ? (filledFields.length / fields.length * 100).toFixed(1) : 0;
}

// Usage
const html = `
  <div class="team">
    <div class="member">
      <h3>John Smith</h3>
      <p>CEO</p>
      <a href="mailto:john@example.com">john@example.com</a>
      <span>+1-555-0123</span>
    </div>
  </div>
`;

const result = await extractStructuredData(html);
console.log(result);

Improving Extraction Accuracy

1. Use Structured Output Formatting

When working with AI-powered web scraping tools, force JSON output to improve parsing reliability:

# Use function calling for guaranteed structure
functions = [
    {
        "name": "extract_product",
        "description": "Extract product information",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "price": {"type": "number"},
                "currency": {"type": "string"},
                "in_stock": {"type": "boolean"}
            },
            "required": ["name", "price"]
        }
    }
]

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": html_content}],
    functions=functions,
    function_call={"name": "extract_product"}
)

2. Implement Validation and Retry Logic

def validated_extraction(html: str, max_retries: int = 3) -> Dict:
    """Extract data with validation and automatic retries."""
    for attempt in range(max_retries):
        result = extract_product_data(html)

        if validate_extraction(result):
            return result

        # Adjust prompt for retry
        html = f"RETRY {attempt+1}: Please ensure all fields are filled accurately.\n{html}"

    return {"error": "Max retries exceeded", "partial_data": result}

def validate_extraction(result: Dict) -> bool:
    """Validate extracted data meets quality thresholds."""
    if "error" in result:
        return False

    data = result.get("data", {})

    # Check required fields exist
    required = ["name", "price"]
    if not all(field in data for field in required):
        return False

    # Validate data types
    if not isinstance(data.get("price"), (int, float)):
        return False

    # Check for reasonable values
    if data.get("price", 0) <= 0 or data.get("price", 0) > 1000000:
        return False

    return True

3. Use Few-Shot Examples

Providing examples in your prompt dramatically improves accuracy:

few_shot_prompt = """
Extract product data following these examples:

Example 1:
HTML: <div><h2>Blue Widget</h2><p>$29.99</p></div>
Output: {"name": "Blue Widget", "price": 29.99, "currency": "USD"}

Example 2:
HTML: <div><span>Red Gadget</span><span>€45.50</span></div>
Output: {"name": "Red Gadget", "price": 45.50, "currency": "EUR"}

Now extract from this HTML:
{actual_html}

Output:
"""

4. Combine with Traditional Methods

For maximum accuracy, use ChatGPT alongside traditional scraping when dealing with dynamic websites and AJAX requests:

import puppeteer from 'puppeteer';

async function hybridExtraction(url) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  // Extract with traditional selectors where possible
  const structuredData = await page.evaluate(() => {
    return {
      title: document.querySelector('h1')?.textContent,
      price: document.querySelector('.price')?.textContent,
      // Traditional selectors for simple, reliable data
    };
  });

  // Get full HTML for complex extraction with ChatGPT
  const html = await page.content();
  await browser.close();

  // Use ChatGPT only for complex/unstructured parts
  const aiExtracted = await extractWithChatGPT(html, [
    'product_description',
    'key_features',
    'customer_sentiment'
  ]);

  return { ...structuredData, ...aiExtracted };
}

Measuring and Monitoring Accuracy

Track extraction accuracy over time:

import logging
from datetime import datetime

class ExtractionMonitor:
    def __init__(self):
        self.results = []

    def log_extraction(self, url: str, extracted: Dict, ground_truth: Dict = None):
        """Log extraction results for accuracy analysis."""
        entry = {
            "timestamp": datetime.now().isoformat(),
            "url": url,
            "extracted": extracted,
            "has_error": "error" in extracted
        }

        if ground_truth:
            entry["accuracy"] = self.calculate_accuracy(extracted, ground_truth)

        self.results.append(entry)
        logging.info(f"Extraction logged: {entry}")

    def calculate_accuracy(self, extracted: Dict, ground_truth: Dict) -> float:
        """Calculate field-level accuracy."""
        matches = sum(
            1 for k, v in ground_truth.items()
            if extracted.get(k) == v
        )
        return (matches / len(ground_truth)) * 100 if ground_truth else 0

    def get_statistics(self) -> Dict:
        """Get accuracy statistics."""
        accuracies = [r["accuracy"] for r in self.results if "accuracy" in r]
        errors = sum(1 for r in self.results if r["has_error"])

        return {
            "total_extractions": len(self.results),
            "average_accuracy": sum(accuracies) / len(accuracies) if accuracies else 0,
            "error_rate": (errors / len(self.results) * 100) if self.results else 0
        }

# Usage
monitor = ExtractionMonitor()

extracted = extract_product_data(html)
ground_truth = {"name": "Wireless Headphones Pro", "price": 149.99}
monitor.log_extraction("https://example.com/product", extracted, ground_truth)

print(monitor.get_statistics())

Cost vs. Accuracy Trade-offs

When implementing ChatGPT for web scraping, consider:

  • GPT-4: Higher accuracy (90-95%) but 10-15x more expensive
  • GPT-3.5-turbo: Lower accuracy (80-88%) but much cheaper for bulk extraction
  • Hybrid approach: Use GPT-3.5 with GPT-4 validation for failed extractions
def cost_optimized_extraction(html: str) -> Dict:
    """Try cheaper model first, escalate to GPT-4 if validation fails."""
    # Try GPT-3.5 first
    result = extract_with_model(html, "gpt-3.5-turbo")

    if validate_extraction(result) and result.get("confidence") == "high":
        return result

    # Escalate to GPT-4 for difficult cases
    return extract_with_model(html, "gpt-4")

Conclusion

ChatGPT achieves 85-95% accuracy for data extraction tasks, with the exact figure depending on data complexity, prompt engineering, and model choice. While not as precise as traditional selectors for simple structured data, ChatGPT excels at handling varied layouts, understanding context, and extracting semantic information that would be difficult or impossible with rule-based approaches.

For production use, combine ChatGPT with validation, retry logic, and traditional scraping methods to achieve the reliability needed while leveraging AI's flexibility for complex extraction scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon