What is LLM Hallucination and How Can I Prevent It in Web Scraping?

LLM hallucination occurs when large language models generate information that appears plausible but is not actually present in the source data. In web scraping, this means the AI might extract or create data that doesn't exist on the scraped page, leading to inaccurate results and unreliable data pipelines.

Understanding and preventing hallucinations is critical for anyone using AI-powered web scraping tools, as data accuracy is paramount in production systems.

Understanding LLM Hallucination in Web Scraping

When you use an LLM to extract data from HTML content, the model processes the text and attempts to identify the requested information. However, LLMs are trained to be helpful and complete, which can lead them to:

Fill in missing data with plausible-sounding values
Infer relationships that don't explicitly exist in the source
Generate synthetic examples when patterns are ambiguous
Confuse similar entities from their training data with actual page content

For example, if you ask an LLM to extract a product price from a page that doesn't list one, it might generate a reasonable-looking price based on similar products it has seen during training, rather than returning null or indicating the data is unavailable.

Common Hallucination Scenarios in Web Scraping

1. Missing Data Fabrication

When a field is absent from the HTML, LLMs may generate plausible values:

# Example: LLM hallucinating missing phone numbers
html_content = """
<div class="contact">
    <h3>John's Restaurant</h3>
    <p>123 Main Street</p>
    <!-- No phone number listed -->
</div>
"""

# Hallucinated output might include:
# {"name": "John's Restaurant", "phone": "(555) 123-4567"}
# The phone number was fabricated by the LLM

2. Structural Assumptions

LLMs may impose structure that doesn't exist in the source:

// JavaScript example using OpenAI API
const openai = require('openai');

const htmlSnippet = `
<ul>
    <li>Red apples - fresh</li>
    <li>Green grapes</li>
    <li>Yellow bananas - organic</li>
</ul>
`;

// LLM might hallucinate consistent structure:
// [
//   {"name": "Red apples", "quality": "fresh", "organic": false},
//   {"name": "Green grapes", "quality": "fresh", "organic": false}, // hallucinated "quality"
//   {"name": "Yellow bananas", "quality": null, "organic": true}
// ]

3. Date and Number Inference

LLMs sometimes calculate or infer numerical data:

html = '<p>Product launched 5 years ago</p>'
# LLM might hallucinate: {"launch_date": "2019-01-15"}
# Instead of preserving the relative description

Proven Prevention Techniques

1. Use Structured Output Formats

When working with OpenAI function calling or similar structured output features, define strict schemas:

import openai
import json

# Define strict schema with required fields
function_schema = {
    "name": "extract_product_data",
    "description": "Extract product information from HTML",
    "parameters": {
        "type": "object",
        "properties": {
            "name": {
                "type": "string",
                "description": "Product name as it appears on page"
            },
            "price": {
                "type": ["number", "null"],
                "description": "Numeric price if explicitly stated, null otherwise"
            },
            "availability": {
                "type": ["string", "null"],
                "enum": ["in_stock", "out_of_stock", None],
                "description": "Availability status only if explicitly mentioned"
            }
        },
        "required": ["name"]  # Only require fields that must exist
    }
}

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Extract ONLY information explicitly present in the HTML. Use null for missing data. Never infer or generate data not in the source."},
        {"role": "user", "content": f"Extract product data from: {html_content}"}
    ],
    functions=[function_schema],
    function_call={"name": "extract_product_data"}
)

2. Implement Explicit Null Handling

Train the LLM to use null values for missing data:

const systemPrompt = `You are a precise data extraction tool. Rules:
1. Extract ONLY data explicitly present in the HTML
2. Return null for any missing fields
3. Do not infer, calculate, or generate any data
4. If a field is ambiguous, return null
5. Preserve exact text as it appears, do not normalize or standardize`;

const extractData = async (html) => {
    const response = await openai.chat.completions.create({
        model: "gpt-4-turbo",
        messages: [
            { role: "system", content: systemPrompt },
            {
                role: "user",
                content: `Extract data as JSON from this HTML:\n${html}\n\nReturn: {name: string|null, price: number|null, description: string|null}`
            }
        ],
        response_format: { type: "json_object" }
    });

    return JSON.parse(response.choices[0].message.content);
};

3. Add Verification Layers

Implement multi-step verification to catch hallucinations:

def verify_extraction(html_content, extracted_data):
    """Verify extracted data against source HTML"""
    verification_prompt = f"""
    HTML Content: {html_content}

    Extracted Data: {json.dumps(extracted_data)}

    For each field in the extracted data:
    1. Find the exact location in the HTML where it appears
    2. Quote the relevant HTML snippet
    3. Mark as VERIFIED if found, HALLUCINATED if not found

    Return JSON: {{"field_name": {{"status": "VERIFIED|HALLUCINATED", "evidence": "html snippet or null"}}}}
    """

    verification = llm_call(verification_prompt)

    # Filter out hallucinated fields
    verified_data = {
        key: value
        for key, value in extracted_data.items()
        if verification[key]["status"] == "VERIFIED"
    }

    return verified_data

4. Use Confidence Scores

Request confidence levels for each extracted field:

extraction_schema = {
    "type": "object",
    "properties": {
        "data": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "price": {"type": "number"},
                "rating": {"type": "number"}
            }
        },
        "confidence": {
            "type": "object",
            "description": "Confidence score 0-1 for each field",
            "properties": {
                "title": {"type": "number", "minimum": 0, "maximum": 1},
                "price": {"type": "number", "minimum": 0, "maximum": 1},
                "rating": {"type": "number", "minimum": 0, "maximum": 1}
            }
        }
    }
}

# Filter results based on confidence threshold
def filter_by_confidence(result, threshold=0.8):
    return {
        key: value
        for key, value in result["data"].items()
        if result["confidence"].get(key, 0) >= threshold
    }

5. Hybrid Approach: LLM + Traditional Parsing

Combine traditional selectors with LLM extraction to validate results:

from bs4 import BeautifulSoup

def hybrid_extraction(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Traditional extraction
    traditional_price = soup.select_one('.price')
    traditional_title = soup.select_one('h1.product-title')

    # LLM extraction
    llm_data = extract_with_llm(html)

    # Cross-validate
    validated_data = {}

    # If traditional parsing finds it, trust that over LLM
    if traditional_title:
        validated_data['title'] = traditional_title.text.strip()
    elif llm_data.get('title') and traditional_title is None:
        # LLM found something when traditional parsing didn't - potential hallucination
        validated_data['title'] = None  # Mark as uncertain

    if traditional_price:
        validated_data['price'] = float(traditional_price.text.strip('$'))
    elif llm_data.get('price'):
        # Verify LLM price appears in the HTML text
        if str(llm_data['price']) in html:
            validated_data['price'] = llm_data['price']
        else:
            validated_data['price'] = None  # Likely hallucinated

    return validated_data

6. Few-Shot Examples with Null Cases

Provide examples that include missing data:

const fewShotPrompt = `Extract product data from HTML. Examples:

Example 1:
HTML: <div><h2>Widget Pro</h2><span class="price">$29.99</span></div>
Output: {"name": "Widget Pro", "price": 29.99, "rating": null}

Example 2:
HTML: <div><h2>Gadget</h2><p>Currently unavailable</p></div>
Output: {"name": "Gadget", "price": null, "rating": null}

Example 3:
HTML: <div><h2>Tool</h2><span>$15</span><span>4.5 stars</span></div>
Output: {"name": "Tool", "price": 15, "rating": 4.5}

Notice: Return null for any data not explicitly present.

Now extract from: ${htmlContent}`;

7. Temperature and Parameter Tuning

Lower temperature settings reduce creative hallucinations:

# More deterministic, less creative = fewer hallucinations
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.0,  # Most deterministic
    top_p=0.1,        # Limit token sampling
    frequency_penalty=0.0,
    presence_penalty=0.0
)

Testing and Monitoring

Create Hallucination Test Cases

Build a test suite specifically for hallucination detection:

import pytest

def test_missing_price_returns_null():
    html = '<div class="product"><h1>Test Product</h1></div>'
    result = extract_product(html)
    assert result['price'] is None, "Should return null for missing price, not hallucinate"

def test_no_fabricated_fields():
    html = '<div><p>Simple text</p></div>'
    result = extract_product(html)
    expected_fields = {'name', 'price', 'description'}
    actual_fields = set(result.keys())
    assert actual_fields == expected_fields, f"Should not add extra fields: {actual_fields - expected_fields}"

def test_partial_data_not_completed():
    html = '<div><h1>Product X</h1><p>Price coming soon</p></div>'
    result = extract_product(html)
    assert result['name'] == 'Product X'
    assert result['price'] is None, "Should not generate price from 'coming soon' text"

Monitor Extraction Accuracy

Implement logging to track potential hallucinations in production:

import logging

def extract_with_monitoring(html, expected_fields):
    extracted = llm_extract(html)

    # Check for suspicious patterns
    warnings = []

    # Check if extracted data contains values not in source
    for field, value in extracted.items():
        if value is not None and str(value) not in html:
            warnings.append(f"Field '{field}' value '{value}' not found in source HTML")

    # Check for overly consistent data (possible pattern hallucination)
    if all(v is not None for v in extracted.values()):
        if count_missing_selectors(html) > 0:
            warnings.append("All fields populated despite missing HTML elements")

    if warnings:
        logging.warning(f"Potential hallucination detected: {warnings}")
        logging.debug(f"HTML: {html[:200]}...")
        logging.debug(f"Extracted: {extracted}")

    return extracted, warnings

Best Practices Summary

Explicit instructions: Always instruct the LLM to return null for missing data
Structured schemas: Use strict type definitions with nullable fields
Low temperature: Set temperature to 0 for maximum determinism
Verification layer: Implement a second LLM call to verify extractions
Hybrid validation: Cross-check LLM results with traditional parsing
Confidence scoring: Request and filter by confidence scores
Few-shot examples: Include examples with null/missing values
Comprehensive testing: Build test suites targeting hallucination scenarios
Production monitoring: Log and alert on suspicious extraction patterns
Regular auditing: Manually review a sample of extractions periodically

When to Use Traditional vs. LLM Scraping

Given the hallucination risks, consider using traditional CSS selectors or XPath when:

The page structure is consistent and predictable
Exact accuracy is critical (financial data, medical information)
The data format is standardized
You're scraping high volumes where verification overhead is costly

Use LLM-powered extraction when:

Page structures vary significantly across sources
Data is embedded in natural language text
You need semantic understanding of context
Traditional selectors frequently break due to HTML changes

Conclusion

LLM hallucination is a significant challenge in AI-powered web scraping, but it can be effectively managed through careful prompt engineering, structured output validation, and hybrid verification approaches. By implementing the techniques outlined above, you can build reliable scraping systems that leverage the flexibility of LLMs while maintaining data accuracy.

The key is to treat LLMs as powerful but fallible tools that require guardrails. Always validate extracted data, use null values for missing information, and combine AI extraction with traditional parsing methods when possible. With proper safeguards, you can harness the power of LLM-based scraping while minimizing the risk of hallucinated data compromising your applications.

Table of contents

What is LLM Hallucination and How Can I Prevent It in Web Scraping?

Understanding LLM Hallucination in Web Scraping

Common Hallucination Scenarios in Web Scraping

1. Missing Data Fabrication

2. Structural Assumptions

3. Date and Number Inference

Proven Prevention Techniques

1. Use Structured Output Formats

2. Implement Explicit Null Handling

3. Add Verification Layers

4. Use Confidence Scores

5. Hybrid Approach: LLM + Traditional Parsing

6. Few-Shot Examples with Null Cases

7. Temperature and Parameter Tuning

Testing and Monitoring

Create Hallucination Test Cases

Monitor Extraction Accuracy

Best Practices Summary

When to Use Traditional vs. LLM Scraping

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How accurate are LLMs for web scraping compared to traditional parsers?

When should I use an LLM for web scraping instead of XPath or CSS selectors?

What are the advantages of using LLMs for web data extraction?

Get Started Now

Support