How do I Handle LLM Hallucinations When Extracting Data from Web Pages?

LLM hallucinations—when large language models generate plausible but incorrect or fabricated information—pose a significant challenge in AI-powered web scraping. While AI web scraping offers powerful advantages over traditional methods, ensuring data accuracy requires implementing validation strategies, prompt engineering techniques, and verification workflows to minimize hallucinations and maintain data integrity.

Understanding LLM Hallucinations in Web Scraping

Hallucinations occur when an LLM generates information that wasn't present in the source data. In web scraping contexts, this might manifest as:

Invented values: Creating prices, dates, or numbers that don't exist on the page
Assumed information: Inferring details based on patterns rather than actual content
Merged data: Combining information from multiple elements incorrectly
Format inconsistencies: Converting data to formats that weren't in the original source
Missing data fabrication: Making up values when the requested field is absent

Understanding when and why hallucinations occur is the first step toward preventing them.

Why LLMs Hallucinate During Data Extraction

Several factors contribute to hallucinations in web scraping scenarios:

1. Ambiguous Instructions

Vague prompts or field descriptions lead LLMs to make assumptions:

# Problematic - too vague
fields = {
    'price': 'price'  # What if multiple prices exist?
}

# Better - explicit and specific
fields = {
    'price': 'The final checkout price in USD, after all discounts, excluding shipping'
}

2. Missing Data Handling

When requested information doesn't exist, LLMs may fabricate plausible values instead of returning null:

// The LLM might invent a shipping cost if none is displayed
const data = await client.getFields(url, {
    'shipping_cost': 'shipping cost'  // Risky if not always present
});

3. Pattern Recognition Over-reliance

LLMs might extrapolate patterns from training data rather than extracting actual content:

# If the page doesn't show a rating, the LLM might assume "4.5 stars"
# based on common e-commerce patterns
result = client.get_question(
    url,
    'What is the product rating?'
)

4. Context Window Limitations

Large pages may exceed the LLM's context window, causing incomplete analysis and potential hallucinations about unseen content.

Strategies to Prevent and Detect Hallucinations

1. Explicit Null Handling Instructions

Always instruct the LLM to return null or specific values for missing data:

from webscraping_ai import WebScrapingAI

client = WebScrapingAI(api_key='YOUR_API_KEY')

# Clear instructions for handling missing data
fields = {
    'product_name': 'The exact product name as displayed. Return null if not found.',
    'list_price': 'The original price before discounts. Return null if no original price is shown.',
    'sale_price': 'The current discounted price. Return null if no sale is active.',
    'discount_percentage': 'The discount percentage. Return null if not explicitly stated.',
    'shipping_cost': 'The shipping cost in USD. Return "FREE" if free shipping, null if not mentioned.',
    'availability': 'In stock status. Return exactly "In Stock", "Out of Stock", or null if unclear.'
}

result = client.get_fields(
    url='https://example.com/product',
    fields=fields
)

2. Validation Against Source HTML

Cross-reference extracted data with the original HTML to verify presence:

from bs4 import BeautifulSoup
import re

def validate_extraction(url, extracted_data):
    """Verify extracted data actually exists in source HTML"""
    # Get the raw HTML
    html = client.get_html(url)
    soup = BeautifulSoup(html, 'html.parser')
    text_content = soup.get_text().lower()

    validation_results = {}

    for field, value in extracted_data.items():
        if value is None:
            validation_results[field] = {'valid': True, 'reason': 'Null value accepted'}
            continue

        # Check if the value appears in the source
        value_str = str(value).lower()

        # For numeric values, check if they exist in some form
        if isinstance(value, (int, float)):
            # Look for the number with various formatting
            patterns = [
                str(value),
                f"{value:,.2f}",
                f"${value}",
                str(value).replace('.', ',')
            ]
            found = any(pattern.lower() in text_content for pattern in patterns)
        else:
            # For text, check for exact or partial matches
            found = value_str in text_content

        validation_results[field] = {
            'valid': found,
            'reason': 'Found in source' if found else 'NOT FOUND - potential hallucination'
        }

    return validation_results

# Example usage
extracted = client.get_fields(
    url='https://example.com/product',
    fields=fields
)

validation = validate_extraction('https://example.com/product', extracted)

for field, result in validation.items():
    if not result['valid']:
        print(f"WARNING: {field} may be hallucinated - {result['reason']}")
        # Set hallucinated values to null
        extracted[field] = None

3. Multiple Extraction Attempts with Consistency Checking

Run the same extraction multiple times and verify consistency:

def extract_with_consistency_check(url, fields, attempts=3, threshold=0.8):
    """Extract data multiple times and flag inconsistencies"""
    results = []

    for i in range(attempts):
        result = client.get_fields(url, fields)
        results.append(result)

    # Check consistency across attempts
    consensus_data = {}

    for field in fields.keys():
        values = [r.get(field) for r in results]

        # Count occurrences of each value
        from collections import Counter
        value_counts = Counter(str(v) for v in values)
        most_common_value, count = value_counts.most_common(1)[0]

        # Calculate agreement percentage
        agreement = count / attempts

        if agreement >= threshold:
            # Convert back from string representation
            consensus_data[field] = results[0][field] if str(results[0][field]) == most_common_value else results[1][field]
        else:
            # Flag as inconsistent - likely hallucination
            consensus_data[field] = None
            print(f"WARNING: Inconsistent results for {field}: {values}")

    return consensus_data

# Usage
reliable_data = extract_with_consistency_check(
    url='https://example.com/product',
    fields=fields,
    attempts=3
)

4. Structured Output with Schema Validation

Use strict schema definitions to constrain LLM outputs:

const WebScrapingAI = require('webscraping.ai');
const Joi = require('joi');  // Schema validation library

const client = new WebScrapingAI('YOUR_API_KEY');

// Define strict schema
const productSchema = Joi.object({
    product_name: Joi.string().required(),
    price: Joi.number().min(0).max(1000000).allow(null),
    currency: Joi.string().valid('USD', 'EUR', 'GBP').allow(null),
    in_stock: Joi.boolean().allow(null),
    rating: Joi.number().min(0).max(5).allow(null),
    review_count: Joi.number().integer().min(0).allow(null),
    shipping_days: Joi.number().integer().min(0).max(365).allow(null)
});

async function extractAndValidate(url) {
    const fields = {
        'product_name': 'Exact product name. Return null if not found.',
        'price': 'Numeric price value only, no currency symbols. Return null if not shown.',
        'currency': 'Currency code (USD, EUR, or GBP). Return null if unclear.',
        'in_stock': 'Boolean true if in stock, false if out of stock, null if unclear.',
        'rating': 'Average rating as a number between 0 and 5. Return null if not shown.',
        'review_count': 'Total number of reviews as an integer. Return null if not shown.',
        'shipping_days': 'Estimated shipping time in days as an integer. Return null if not mentioned.'
    };

    const extracted = await client.getFields(url, fields);

    // Validate against schema
    const { error, value } = productSchema.validate(extracted, {
        abortEarly: false,
        stripUnknown: true
    });

    if (error) {
        console.error('Validation errors (potential hallucinations):');
        error.details.forEach(detail => {
            console.error(`  - ${detail.path}: ${detail.message}`);
            // Set invalid fields to null
            const field = detail.path[0];
            extracted[field] = null;
        });
    }

    return extracted;
}

// Usage
extractAndValidate('https://example.com/product')
    .then(data => console.log('Validated data:', data))
    .catch(err => console.error('Extraction failed:', err));

5. Grounding with Source Attribution

Request that the LLM provide source references for extracted data:

def extract_with_attribution(url):
    """Extract data with source attribution to verify against hallucinations"""

    # Modified prompt requesting attribution
    question = """
    Extract the following information from this product page.
    For each piece of information, provide:
    1. The value
    2. The exact text snippet from the page where you found it
    3. Return null if the information is not present

    Extract:
    - Product name
    - Current price
    - Original price (if on sale)
    - Availability status

    Format your response as JSON.
    """

    result = client.get_question(url, question)

    # Parse and validate the attributed response
    import json
    try:
        data = json.loads(result)

        # Verify each attributed value
        html = client.get_html(url)

        for field, info in data.items():
            if info.get('source_text'):
                if info['source_text'] not in html:
                    print(f"WARNING: Source text for {field} not found in HTML - possible hallucination")
                    data[field]['value'] = None

        return {k: v.get('value') for k, v in data.items()}
    except json.JSONDecodeError:
        print("ERROR: Could not parse attributed response")
        return None

result = extract_with_attribution('https://example.com/product')

6. Confidence Scores and Uncertainty Detection

Explicitly ask the LLM to indicate confidence levels:

def extract_with_confidence(url, fields):
    """Extract data with confidence scores to identify potential hallucinations"""

    enhanced_fields = {}
    for field, description in fields.items():
        enhanced_fields[field] = f"{description}. Also rate your confidence (0-100) that this information is explicitly stated on the page."

    result = client.get_fields(url, enhanced_fields)

    # Parse confidence scores
    high_confidence_data = {}
    flagged_fields = []

    for field, value in result.items():
        # Check if value includes confidence indicator
        if isinstance(value, str) and 'confidence:' in value.lower():
            # Parse value and confidence
            parts = value.split('confidence:', 1)
            actual_value = parts[0].strip()
            confidence = int(''.join(filter(str.isdigit, parts[1])))

            if confidence >= 80:
                high_confidence_data[field] = actual_value
            else:
                flagged_fields.append((field, actual_value, confidence))
                high_confidence_data[field] = None
        else:
            high_confidence_data[field] = value

    if flagged_fields:
        print("Low confidence fields (potential hallucinations):")
        for field, value, conf in flagged_fields:
            print(f"  {field}: {value} (confidence: {conf}%)")

    return high_confidence_data

7. Dual-Model Verification

Use multiple LLM models and compare results:

from openai import OpenAI
from anthropic import Anthropic

def multi_model_extraction(url, fields):
    """Extract using multiple LLMs and flag discrepancies"""

    # WebScrapingAI (uses various models)
    ws_result = client.get_fields(url, fields)

    # For critical applications, verify with direct LLM calls
    html = client.get_html(url)

    # Using OpenAI GPT-4
    openai_client = OpenAI(api_key='YOUR_OPENAI_KEY')
    gpt_prompt = f"""
    Extract the following fields from this HTML:
    {json.dumps(fields, indent=2)}

    Return only exact information found in the HTML. Use null for missing data.

    HTML:
    {html[:8000]}  # Truncate if needed
    """

    gpt_response = openai_client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": gpt_prompt}],
        response_format={"type": "json_object"}
    )
    gpt_result = json.loads(gpt_response.choices[0].message.content)

    # Compare results
    consensus = {}
    discrepancies = []

    for field in fields.keys():
        ws_val = ws_result.get(field)
        gpt_val = gpt_result.get(field)

        if ws_val == gpt_val:
            consensus[field] = ws_val
        else:
            discrepancies.append({
                'field': field,
                'webscraping_ai': ws_val,
                'gpt4': gpt_val
            })
            # Default to null on discrepancy
            consensus[field] = None

    if discrepancies:
        print("Model discrepancies detected (potential hallucinations):")
        print(json.dumps(discrepancies, indent=2))

    return consensus

Best Practices for Hallucination Prevention

1. Design Hallucination-Resistant Prompts

# Poor prompt - encourages hallucination
bad_fields = {
    'specifications': 'product specs'
}

# Good prompt - discourages hallucination
good_fields = {
    'specifications': '''
    List only the technical specifications explicitly shown in a specifications
    table or list on this page. Do not infer specifications from the description.
    Return null if no specifications table exists. Format as a dictionary with
    exact specification names as keys.
    '''
}

2. Implement Fallback to Traditional Scraping

For critical data, use traditional HTML parsing as a verification layer:

from bs4 import BeautifulSoup
import re

def hybrid_extraction(url):
    """Combine AI extraction with traditional parsing for validation"""

    # AI extraction
    ai_result = client.get_fields(
        url,
        {
            'price': 'Current price as a number',
            'title': 'Product title'
        }
    )

    # Traditional parsing as validation
    html = client.get_html(url)
    soup = BeautifulSoup(html, 'html.parser')

    # Look for common price patterns
    price_patterns = [
        r'\$\s*(\d+\.?\d*)',
        r'(\d+\.?\d*)\s*USD',
        r'Price:\s*\$?(\d+\.?\d*)'
    ]

    traditional_price = None
    for pattern in price_patterns:
        match = re.search(pattern, soup.get_text())
        if match:
            traditional_price = float(match.group(1))
            break

    # Validate AI result against traditional parsing
    if ai_result['price'] and traditional_price:
        if abs(ai_result['price'] - traditional_price) > 0.01:
            print(f"WARNING: Price mismatch - AI: {ai_result['price']}, Traditional: {traditional_price}")
            # Use traditional result when there's conflict
            ai_result['price'] = traditional_price

    return ai_result

3. Maintain Audit Trails

Log extraction attempts with timestamps and source URLs for traceability:

import logging
from datetime import datetime

# Configure logging
logging.basicConfig(
    filename='scraping_audit.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def extract_with_audit(url, fields):
    """Extract data with comprehensive audit logging"""

    start_time = datetime.now()

    try:
        result = client.get_fields(url, fields)

        # Log successful extraction
        logging.info(f"""
        Extraction successful
        URL: {url}
        Fields requested: {list(fields.keys())}
        Fields returned: {list(result.keys())}
        Null fields: {[k for k, v in result.items() if v is None]}
        Duration: {(datetime.now() - start_time).total_seconds()}s
        """)

        return result

    except Exception as e:
        logging.error(f"""
        Extraction failed
        URL: {url}
        Error: {str(e)}
        Duration: {(datetime.now() - start_time).total_seconds()}s
        """)
        raise

4. Set Realistic Expectations

Understand that some hallucination risk is inherent to LLM-based extraction:

def extract_with_risk_assessment(url, fields):
    """Classify fields by hallucination risk"""

    # Categorize field types by risk
    high_risk_fields = []  # Subjective or inferrable
    medium_risk_fields = []  # Semi-structured data
    low_risk_fields = []  # Highly structured, easy to verify

    risk_keywords = {
        'high': ['summary', 'description', 'opinion', 'recommendation'],
        'medium': ['features', 'specifications', 'details'],
        'low': ['price', 'date', 'quantity', 'stock']
    }

    for field, description in fields.items():
        desc_lower = description.lower()
        if any(kw in desc_lower for kw in risk_keywords['high']):
            high_risk_fields.append(field)
        elif any(kw in desc_lower for kw in risk_keywords['medium']):
            medium_risk_fields.append(field)
        else:
            low_risk_fields.append(field)

    result = client.get_fields(url, fields)

    # Add risk metadata
    result['_metadata'] = {
        'high_risk_fields': high_risk_fields,
        'medium_risk_fields': medium_risk_fields,
        'low_risk_fields': low_risk_fields,
        'note': 'High risk fields should be manually verified'
    }

    return result

Monitoring and Detection in Production

Real-time Hallucination Detection

class HallucinationDetector:
    def __init__(self, client):
        self.client = client
        self.hallucination_patterns = []

    def detect_anomalies(self, extracted_data, historical_data):
        """Compare current extraction against historical patterns"""
        anomalies = []

        for field, value in extracted_data.items():
            if value is None:
                continue

            # Check if value is statistical outlier
            historical_values = [d.get(field) for d in historical_data if d.get(field) is not None]

            if historical_values and isinstance(value, (int, float)):
                import statistics
                mean = statistics.mean(historical_values)
                stdev = statistics.stdev(historical_values) if len(historical_values) > 1 else 0

                # Flag values more than 3 standard deviations from mean
                if stdev > 0 and abs(value - mean) > 3 * stdev:
                    anomalies.append({
                        'field': field,
                        'value': value,
                        'mean': mean,
                        'stdev': stdev,
                        'reason': 'Statistical outlier - possible hallucination'
                    })

        return anomalies

# Usage
detector = HallucinationDetector(client)

# Store historical extractions
historical_products = [
    {'price': 29.99, 'rating': 4.5},
    {'price': 31.50, 'rating': 4.3},
    {'price': 28.75, 'rating': 4.7}
]

current_extraction = client.get_fields(url, fields)
anomalies = detector.detect_anomalies(current_extraction, historical_products)

if anomalies:
    print("Potential hallucinations detected:")
    for anomaly in anomalies:
        print(f"  {anomaly['field']}: {anomaly['value']} (expected ~{anomaly['mean']:.2f})")

Handling Hallucinations When Detected

When you detect potential hallucinations:

Flag for manual review: Queue suspicious extractions for human verification
Fall back to traditional methods: Use CSS selectors or XPath as a backup
Request re-extraction: Retry with improved prompts
Return null values: Better to have missing data than incorrect data
Log for model improvement: Report patterns to improve future extractions

def handle_suspected_hallucination(url, field, suspected_value):
    """Decision tree for handling detected hallucinations"""

    # Step 1: Verify against source
    html = client.get_html(url)
    if str(suspected_value) not in html:
        print(f"Value '{suspected_value}' not found in source - likely hallucination")

        # Step 2: Attempt traditional extraction
        soup = BeautifulSoup(html, 'html.parser')
        # ... traditional extraction logic ...

        # Step 3: If traditional extraction fails, return null
        return None

    # Step 4: Re-extract with stricter prompt
    strict_result = client.get_fields(
        url,
        {
            field: f'ONLY extract {field} if explicitly shown. Return exact value or null. Do not infer or estimate.'
        }
    )

    return strict_result.get(field)

Conclusion

Handling LLM hallucinations in web scraping requires a multi-layered approach combining clear prompt engineering, validation against source HTML, consistency checking, schema validation, and monitoring. While hallucinations cannot be eliminated entirely, implementing these strategies significantly reduces their occurrence and impact.

The key is to treat AI extraction as part of a larger validation pipeline rather than a complete solution. By combining AI-powered data extraction with traditional parsing methods, schema validation, and human oversight where critical, you can leverage the power of LLMs while maintaining data accuracy and reliability.

Always remember: it's better to return null for uncertain data than to propagate hallucinated information into your systems. Implement robust validation, maintain audit trails, and continuously monitor extraction quality to ensure your AI-powered web scraping remains accurate and trustworthy.

Table of contents