What is LLM Hallucination and How Can I Prevent It in Web Scraping?
LLM hallucination occurs when large language models generate information that appears plausible but is not actually present in the source data. In web scraping, this means the AI might extract or create data that doesn't exist on the scraped page, leading to inaccurate results and unreliable data pipelines.
Understanding and preventing hallucinations is critical for anyone using AI-powered web scraping tools, as data accuracy is paramount in production systems.
Understanding LLM Hallucination in Web Scraping
When you use an LLM to extract data from HTML content, the model processes the text and attempts to identify the requested information. However, LLMs are trained to be helpful and complete, which can lead them to:
- Fill in missing data with plausible-sounding values
- Infer relationships that don't explicitly exist in the source
- Generate synthetic examples when patterns are ambiguous
- Confuse similar entities from their training data with actual page content
For example, if you ask an LLM to extract a product price from a page that doesn't list one, it might generate a reasonable-looking price based on similar products it has seen during training, rather than returning null or indicating the data is unavailable.
Common Hallucination Scenarios in Web Scraping
1. Missing Data Fabrication
When a field is absent from the HTML, LLMs may generate plausible values:
# Example: LLM hallucinating missing phone numbers
html_content = """
<div class="contact">
<h3>John's Restaurant</h3>
<p>123 Main Street</p>
<!-- No phone number listed -->
</div>
"""
# Hallucinated output might include:
# {"name": "John's Restaurant", "phone": "(555) 123-4567"}
# The phone number was fabricated by the LLM
2. Structural Assumptions
LLMs may impose structure that doesn't exist in the source:
// JavaScript example using OpenAI API
const openai = require('openai');
const htmlSnippet = `
<ul>
<li>Red apples - fresh</li>
<li>Green grapes</li>
<li>Yellow bananas - organic</li>
</ul>
`;
// LLM might hallucinate consistent structure:
// [
// {"name": "Red apples", "quality": "fresh", "organic": false},
// {"name": "Green grapes", "quality": "fresh", "organic": false}, // hallucinated "quality"
// {"name": "Yellow bananas", "quality": null, "organic": true}
// ]
3. Date and Number Inference
LLMs sometimes calculate or infer numerical data:
html = '<p>Product launched 5 years ago</p>'
# LLM might hallucinate: {"launch_date": "2019-01-15"}
# Instead of preserving the relative description
Proven Prevention Techniques
1. Use Structured Output Formats
When working with OpenAI function calling or similar structured output features, define strict schemas:
import openai
import json
# Define strict schema with required fields
function_schema = {
"name": "extract_product_data",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name as it appears on page"
},
"price": {
"type": ["number", "null"],
"description": "Numeric price if explicitly stated, null otherwise"
},
"availability": {
"type": ["string", "null"],
"enum": ["in_stock", "out_of_stock", None],
"description": "Availability status only if explicitly mentioned"
}
},
"required": ["name"] # Only require fields that must exist
}
}
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract ONLY information explicitly present in the HTML. Use null for missing data. Never infer or generate data not in the source."},
{"role": "user", "content": f"Extract product data from: {html_content}"}
],
functions=[function_schema],
function_call={"name": "extract_product_data"}
)
2. Implement Explicit Null Handling
Train the LLM to use null values for missing data:
const systemPrompt = `You are a precise data extraction tool. Rules:
1. Extract ONLY data explicitly present in the HTML
2. Return null for any missing fields
3. Do not infer, calculate, or generate any data
4. If a field is ambiguous, return null
5. Preserve exact text as it appears, do not normalize or standardize`;
const extractData = async (html) => {
const response = await openai.chat.completions.create({
model: "gpt-4-turbo",
messages: [
{ role: "system", content: systemPrompt },
{
role: "user",
content: `Extract data as JSON from this HTML:\n${html}\n\nReturn: {name: string|null, price: number|null, description: string|null}`
}
],
response_format: { type: "json_object" }
});
return JSON.parse(response.choices[0].message.content);
};
3. Add Verification Layers
Implement multi-step verification to catch hallucinations:
def verify_extraction(html_content, extracted_data):
"""Verify extracted data against source HTML"""
verification_prompt = f"""
HTML Content: {html_content}
Extracted Data: {json.dumps(extracted_data)}
For each field in the extracted data:
1. Find the exact location in the HTML where it appears
2. Quote the relevant HTML snippet
3. Mark as VERIFIED if found, HALLUCINATED if not found
Return JSON: {{"field_name": {{"status": "VERIFIED|HALLUCINATED", "evidence": "html snippet or null"}}}}
"""
verification = llm_call(verification_prompt)
# Filter out hallucinated fields
verified_data = {
key: value
for key, value in extracted_data.items()
if verification[key]["status"] == "VERIFIED"
}
return verified_data
4. Use Confidence Scores
Request confidence levels for each extracted field:
extraction_schema = {
"type": "object",
"properties": {
"data": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"}
}
},
"confidence": {
"type": "object",
"description": "Confidence score 0-1 for each field",
"properties": {
"title": {"type": "number", "minimum": 0, "maximum": 1},
"price": {"type": "number", "minimum": 0, "maximum": 1},
"rating": {"type": "number", "minimum": 0, "maximum": 1}
}
}
}
}
# Filter results based on confidence threshold
def filter_by_confidence(result, threshold=0.8):
return {
key: value
for key, value in result["data"].items()
if result["confidence"].get(key, 0) >= threshold
}
5. Hybrid Approach: LLM + Traditional Parsing
Combine traditional selectors with LLM extraction to validate results:
from bs4 import BeautifulSoup
def hybrid_extraction(html):
soup = BeautifulSoup(html, 'html.parser')
# Traditional extraction
traditional_price = soup.select_one('.price')
traditional_title = soup.select_one('h1.product-title')
# LLM extraction
llm_data = extract_with_llm(html)
# Cross-validate
validated_data = {}
# If traditional parsing finds it, trust that over LLM
if traditional_title:
validated_data['title'] = traditional_title.text.strip()
elif llm_data.get('title') and traditional_title is None:
# LLM found something when traditional parsing didn't - potential hallucination
validated_data['title'] = None # Mark as uncertain
if traditional_price:
validated_data['price'] = float(traditional_price.text.strip('$'))
elif llm_data.get('price'):
# Verify LLM price appears in the HTML text
if str(llm_data['price']) in html:
validated_data['price'] = llm_data['price']
else:
validated_data['price'] = None # Likely hallucinated
return validated_data
6. Few-Shot Examples with Null Cases
Provide examples that include missing data:
const fewShotPrompt = `Extract product data from HTML. Examples:
Example 1:
HTML: <div><h2>Widget Pro</h2><span class="price">$29.99</span></div>
Output: {"name": "Widget Pro", "price": 29.99, "rating": null}
Example 2:
HTML: <div><h2>Gadget</h2><p>Currently unavailable</p></div>
Output: {"name": "Gadget", "price": null, "rating": null}
Example 3:
HTML: <div><h2>Tool</h2><span>$15</span><span>4.5 stars</span></div>
Output: {"name": "Tool", "price": 15, "rating": 4.5}
Notice: Return null for any data not explicitly present.
Now extract from: ${htmlContent}`;
7. Temperature and Parameter Tuning
Lower temperature settings reduce creative hallucinations:
# More deterministic, less creative = fewer hallucinations
response = openai.ChatCompletion.create(
model="gpt-4",
messages=messages,
temperature=0.0, # Most deterministic
top_p=0.1, # Limit token sampling
frequency_penalty=0.0,
presence_penalty=0.0
)
Testing and Monitoring
Create Hallucination Test Cases
Build a test suite specifically for hallucination detection:
import pytest
def test_missing_price_returns_null():
html = '<div class="product"><h1>Test Product</h1></div>'
result = extract_product(html)
assert result['price'] is None, "Should return null for missing price, not hallucinate"
def test_no_fabricated_fields():
html = '<div><p>Simple text</p></div>'
result = extract_product(html)
expected_fields = {'name', 'price', 'description'}
actual_fields = set(result.keys())
assert actual_fields == expected_fields, f"Should not add extra fields: {actual_fields - expected_fields}"
def test_partial_data_not_completed():
html = '<div><h1>Product X</h1><p>Price coming soon</p></div>'
result = extract_product(html)
assert result['name'] == 'Product X'
assert result['price'] is None, "Should not generate price from 'coming soon' text"
Monitor Extraction Accuracy
Implement logging to track potential hallucinations in production:
import logging
def extract_with_monitoring(html, expected_fields):
extracted = llm_extract(html)
# Check for suspicious patterns
warnings = []
# Check if extracted data contains values not in source
for field, value in extracted.items():
if value is not None and str(value) not in html:
warnings.append(f"Field '{field}' value '{value}' not found in source HTML")
# Check for overly consistent data (possible pattern hallucination)
if all(v is not None for v in extracted.values()):
if count_missing_selectors(html) > 0:
warnings.append("All fields populated despite missing HTML elements")
if warnings:
logging.warning(f"Potential hallucination detected: {warnings}")
logging.debug(f"HTML: {html[:200]}...")
logging.debug(f"Extracted: {extracted}")
return extracted, warnings
Best Practices Summary
- Explicit instructions: Always instruct the LLM to return null for missing data
- Structured schemas: Use strict type definitions with nullable fields
- Low temperature: Set temperature to 0 for maximum determinism
- Verification layer: Implement a second LLM call to verify extractions
- Hybrid validation: Cross-check LLM results with traditional parsing
- Confidence scoring: Request and filter by confidence scores
- Few-shot examples: Include examples with null/missing values
- Comprehensive testing: Build test suites targeting hallucination scenarios
- Production monitoring: Log and alert on suspicious extraction patterns
- Regular auditing: Manually review a sample of extractions periodically
When to Use Traditional vs. LLM Scraping
Given the hallucination risks, consider using traditional CSS selectors or XPath when:
- The page structure is consistent and predictable
- Exact accuracy is critical (financial data, medical information)
- The data format is standardized
- You're scraping high volumes where verification overhead is costly
Use LLM-powered extraction when:
- Page structures vary significantly across sources
- Data is embedded in natural language text
- You need semantic understanding of context
- Traditional selectors frequently break due to HTML changes
Conclusion
LLM hallucination is a significant challenge in AI-powered web scraping, but it can be effectively managed through careful prompt engineering, structured output validation, and hybrid verification approaches. By implementing the techniques outlined above, you can build reliable scraping systems that leverage the flexibility of LLMs while maintaining data accuracy.
The key is to treat LLMs as powerful but fallible tools that require guardrails. Always validate extracted data, use null values for missing information, and combine AI extraction with traditional parsing methods when possible. With proper safeguards, you can harness the power of LLM-based scraping while minimizing the risk of hallucinated data compromising your applications.