How do I Handle LLM Hallucinations When Extracting Data from Web Pages?
LLM hallucinations—when large language models generate plausible but incorrect or fabricated information—pose a significant challenge in AI-powered web scraping. While AI web scraping offers powerful advantages over traditional methods, ensuring data accuracy requires implementing validation strategies, prompt engineering techniques, and verification workflows to minimize hallucinations and maintain data integrity.
Understanding LLM Hallucinations in Web Scraping
Hallucinations occur when an LLM generates information that wasn't present in the source data. In web scraping contexts, this might manifest as:
- Invented values: Creating prices, dates, or numbers that don't exist on the page
- Assumed information: Inferring details based on patterns rather than actual content
- Merged data: Combining information from multiple elements incorrectly
- Format inconsistencies: Converting data to formats that weren't in the original source
- Missing data fabrication: Making up values when the requested field is absent
Understanding when and why hallucinations occur is the first step toward preventing them.
Why LLMs Hallucinate During Data Extraction
Several factors contribute to hallucinations in web scraping scenarios:
1. Ambiguous Instructions
Vague prompts or field descriptions lead LLMs to make assumptions:
# Problematic - too vague
fields = {
'price': 'price' # What if multiple prices exist?
}
# Better - explicit and specific
fields = {
'price': 'The final checkout price in USD, after all discounts, excluding shipping'
}
2. Missing Data Handling
When requested information doesn't exist, LLMs may fabricate plausible values instead of returning null:
// The LLM might invent a shipping cost if none is displayed
const data = await client.getFields(url, {
'shipping_cost': 'shipping cost' // Risky if not always present
});
3. Pattern Recognition Over-reliance
LLMs might extrapolate patterns from training data rather than extracting actual content:
# If the page doesn't show a rating, the LLM might assume "4.5 stars"
# based on common e-commerce patterns
result = client.get_question(
url,
'What is the product rating?'
)
4. Context Window Limitations
Large pages may exceed the LLM's context window, causing incomplete analysis and potential hallucinations about unseen content.
Strategies to Prevent and Detect Hallucinations
1. Explicit Null Handling Instructions
Always instruct the LLM to return null or specific values for missing data:
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
# Clear instructions for handling missing data
fields = {
'product_name': 'The exact product name as displayed. Return null if not found.',
'list_price': 'The original price before discounts. Return null if no original price is shown.',
'sale_price': 'The current discounted price. Return null if no sale is active.',
'discount_percentage': 'The discount percentage. Return null if not explicitly stated.',
'shipping_cost': 'The shipping cost in USD. Return "FREE" if free shipping, null if not mentioned.',
'availability': 'In stock status. Return exactly "In Stock", "Out of Stock", or null if unclear.'
}
result = client.get_fields(
url='https://example.com/product',
fields=fields
)
2. Validation Against Source HTML
Cross-reference extracted data with the original HTML to verify presence:
from bs4 import BeautifulSoup
import re
def validate_extraction(url, extracted_data):
"""Verify extracted data actually exists in source HTML"""
# Get the raw HTML
html = client.get_html(url)
soup = BeautifulSoup(html, 'html.parser')
text_content = soup.get_text().lower()
validation_results = {}
for field, value in extracted_data.items():
if value is None:
validation_results[field] = {'valid': True, 'reason': 'Null value accepted'}
continue
# Check if the value appears in the source
value_str = str(value).lower()
# For numeric values, check if they exist in some form
if isinstance(value, (int, float)):
# Look for the number with various formatting
patterns = [
str(value),
f"{value:,.2f}",
f"${value}",
str(value).replace('.', ',')
]
found = any(pattern.lower() in text_content for pattern in patterns)
else:
# For text, check for exact or partial matches
found = value_str in text_content
validation_results[field] = {
'valid': found,
'reason': 'Found in source' if found else 'NOT FOUND - potential hallucination'
}
return validation_results
# Example usage
extracted = client.get_fields(
url='https://example.com/product',
fields=fields
)
validation = validate_extraction('https://example.com/product', extracted)
for field, result in validation.items():
if not result['valid']:
print(f"WARNING: {field} may be hallucinated - {result['reason']}")
# Set hallucinated values to null
extracted[field] = None
3. Multiple Extraction Attempts with Consistency Checking
Run the same extraction multiple times and verify consistency:
def extract_with_consistency_check(url, fields, attempts=3, threshold=0.8):
"""Extract data multiple times and flag inconsistencies"""
results = []
for i in range(attempts):
result = client.get_fields(url, fields)
results.append(result)
# Check consistency across attempts
consensus_data = {}
for field in fields.keys():
values = [r.get(field) for r in results]
# Count occurrences of each value
from collections import Counter
value_counts = Counter(str(v) for v in values)
most_common_value, count = value_counts.most_common(1)[0]
# Calculate agreement percentage
agreement = count / attempts
if agreement >= threshold:
# Convert back from string representation
consensus_data[field] = results[0][field] if str(results[0][field]) == most_common_value else results[1][field]
else:
# Flag as inconsistent - likely hallucination
consensus_data[field] = None
print(f"WARNING: Inconsistent results for {field}: {values}")
return consensus_data
# Usage
reliable_data = extract_with_consistency_check(
url='https://example.com/product',
fields=fields,
attempts=3
)
4. Structured Output with Schema Validation
Use strict schema definitions to constrain LLM outputs:
const WebScrapingAI = require('webscraping.ai');
const Joi = require('joi'); // Schema validation library
const client = new WebScrapingAI('YOUR_API_KEY');
// Define strict schema
const productSchema = Joi.object({
product_name: Joi.string().required(),
price: Joi.number().min(0).max(1000000).allow(null),
currency: Joi.string().valid('USD', 'EUR', 'GBP').allow(null),
in_stock: Joi.boolean().allow(null),
rating: Joi.number().min(0).max(5).allow(null),
review_count: Joi.number().integer().min(0).allow(null),
shipping_days: Joi.number().integer().min(0).max(365).allow(null)
});
async function extractAndValidate(url) {
const fields = {
'product_name': 'Exact product name. Return null if not found.',
'price': 'Numeric price value only, no currency symbols. Return null if not shown.',
'currency': 'Currency code (USD, EUR, or GBP). Return null if unclear.',
'in_stock': 'Boolean true if in stock, false if out of stock, null if unclear.',
'rating': 'Average rating as a number between 0 and 5. Return null if not shown.',
'review_count': 'Total number of reviews as an integer. Return null if not shown.',
'shipping_days': 'Estimated shipping time in days as an integer. Return null if not mentioned.'
};
const extracted = await client.getFields(url, fields);
// Validate against schema
const { error, value } = productSchema.validate(extracted, {
abortEarly: false,
stripUnknown: true
});
if (error) {
console.error('Validation errors (potential hallucinations):');
error.details.forEach(detail => {
console.error(` - ${detail.path}: ${detail.message}`);
// Set invalid fields to null
const field = detail.path[0];
extracted[field] = null;
});
}
return extracted;
}
// Usage
extractAndValidate('https://example.com/product')
.then(data => console.log('Validated data:', data))
.catch(err => console.error('Extraction failed:', err));
5. Grounding with Source Attribution
Request that the LLM provide source references for extracted data:
def extract_with_attribution(url):
"""Extract data with source attribution to verify against hallucinations"""
# Modified prompt requesting attribution
question = """
Extract the following information from this product page.
For each piece of information, provide:
1. The value
2. The exact text snippet from the page where you found it
3. Return null if the information is not present
Extract:
- Product name
- Current price
- Original price (if on sale)
- Availability status
Format your response as JSON.
"""
result = client.get_question(url, question)
# Parse and validate the attributed response
import json
try:
data = json.loads(result)
# Verify each attributed value
html = client.get_html(url)
for field, info in data.items():
if info.get('source_text'):
if info['source_text'] not in html:
print(f"WARNING: Source text for {field} not found in HTML - possible hallucination")
data[field]['value'] = None
return {k: v.get('value') for k, v in data.items()}
except json.JSONDecodeError:
print("ERROR: Could not parse attributed response")
return None
result = extract_with_attribution('https://example.com/product')
6. Confidence Scores and Uncertainty Detection
Explicitly ask the LLM to indicate confidence levels:
def extract_with_confidence(url, fields):
"""Extract data with confidence scores to identify potential hallucinations"""
enhanced_fields = {}
for field, description in fields.items():
enhanced_fields[field] = f"{description}. Also rate your confidence (0-100) that this information is explicitly stated on the page."
result = client.get_fields(url, enhanced_fields)
# Parse confidence scores
high_confidence_data = {}
flagged_fields = []
for field, value in result.items():
# Check if value includes confidence indicator
if isinstance(value, str) and 'confidence:' in value.lower():
# Parse value and confidence
parts = value.split('confidence:', 1)
actual_value = parts[0].strip()
confidence = int(''.join(filter(str.isdigit, parts[1])))
if confidence >= 80:
high_confidence_data[field] = actual_value
else:
flagged_fields.append((field, actual_value, confidence))
high_confidence_data[field] = None
else:
high_confidence_data[field] = value
if flagged_fields:
print("Low confidence fields (potential hallucinations):")
for field, value, conf in flagged_fields:
print(f" {field}: {value} (confidence: {conf}%)")
return high_confidence_data
7. Dual-Model Verification
Use multiple LLM models and compare results:
from openai import OpenAI
from anthropic import Anthropic
def multi_model_extraction(url, fields):
"""Extract using multiple LLMs and flag discrepancies"""
# WebScrapingAI (uses various models)
ws_result = client.get_fields(url, fields)
# For critical applications, verify with direct LLM calls
html = client.get_html(url)
# Using OpenAI GPT-4
openai_client = OpenAI(api_key='YOUR_OPENAI_KEY')
gpt_prompt = f"""
Extract the following fields from this HTML:
{json.dumps(fields, indent=2)}
Return only exact information found in the HTML. Use null for missing data.
HTML:
{html[:8000]} # Truncate if needed
"""
gpt_response = openai_client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": gpt_prompt}],
response_format={"type": "json_object"}
)
gpt_result = json.loads(gpt_response.choices[0].message.content)
# Compare results
consensus = {}
discrepancies = []
for field in fields.keys():
ws_val = ws_result.get(field)
gpt_val = gpt_result.get(field)
if ws_val == gpt_val:
consensus[field] = ws_val
else:
discrepancies.append({
'field': field,
'webscraping_ai': ws_val,
'gpt4': gpt_val
})
# Default to null on discrepancy
consensus[field] = None
if discrepancies:
print("Model discrepancies detected (potential hallucinations):")
print(json.dumps(discrepancies, indent=2))
return consensus
Best Practices for Hallucination Prevention
1. Design Hallucination-Resistant Prompts
# Poor prompt - encourages hallucination
bad_fields = {
'specifications': 'product specs'
}
# Good prompt - discourages hallucination
good_fields = {
'specifications': '''
List only the technical specifications explicitly shown in a specifications
table or list on this page. Do not infer specifications from the description.
Return null if no specifications table exists. Format as a dictionary with
exact specification names as keys.
'''
}
2. Implement Fallback to Traditional Scraping
For critical data, use traditional HTML parsing as a verification layer:
from bs4 import BeautifulSoup
import re
def hybrid_extraction(url):
"""Combine AI extraction with traditional parsing for validation"""
# AI extraction
ai_result = client.get_fields(
url,
{
'price': 'Current price as a number',
'title': 'Product title'
}
)
# Traditional parsing as validation
html = client.get_html(url)
soup = BeautifulSoup(html, 'html.parser')
# Look for common price patterns
price_patterns = [
r'\$\s*(\d+\.?\d*)',
r'(\d+\.?\d*)\s*USD',
r'Price:\s*\$?(\d+\.?\d*)'
]
traditional_price = None
for pattern in price_patterns:
match = re.search(pattern, soup.get_text())
if match:
traditional_price = float(match.group(1))
break
# Validate AI result against traditional parsing
if ai_result['price'] and traditional_price:
if abs(ai_result['price'] - traditional_price) > 0.01:
print(f"WARNING: Price mismatch - AI: {ai_result['price']}, Traditional: {traditional_price}")
# Use traditional result when there's conflict
ai_result['price'] = traditional_price
return ai_result
3. Maintain Audit Trails
Log extraction attempts with timestamps and source URLs for traceability:
import logging
from datetime import datetime
# Configure logging
logging.basicConfig(
filename='scraping_audit.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
def extract_with_audit(url, fields):
"""Extract data with comprehensive audit logging"""
start_time = datetime.now()
try:
result = client.get_fields(url, fields)
# Log successful extraction
logging.info(f"""
Extraction successful
URL: {url}
Fields requested: {list(fields.keys())}
Fields returned: {list(result.keys())}
Null fields: {[k for k, v in result.items() if v is None]}
Duration: {(datetime.now() - start_time).total_seconds()}s
""")
return result
except Exception as e:
logging.error(f"""
Extraction failed
URL: {url}
Error: {str(e)}
Duration: {(datetime.now() - start_time).total_seconds()}s
""")
raise
4. Set Realistic Expectations
Understand that some hallucination risk is inherent to LLM-based extraction:
def extract_with_risk_assessment(url, fields):
"""Classify fields by hallucination risk"""
# Categorize field types by risk
high_risk_fields = [] # Subjective or inferrable
medium_risk_fields = [] # Semi-structured data
low_risk_fields = [] # Highly structured, easy to verify
risk_keywords = {
'high': ['summary', 'description', 'opinion', 'recommendation'],
'medium': ['features', 'specifications', 'details'],
'low': ['price', 'date', 'quantity', 'stock']
}
for field, description in fields.items():
desc_lower = description.lower()
if any(kw in desc_lower for kw in risk_keywords['high']):
high_risk_fields.append(field)
elif any(kw in desc_lower for kw in risk_keywords['medium']):
medium_risk_fields.append(field)
else:
low_risk_fields.append(field)
result = client.get_fields(url, fields)
# Add risk metadata
result['_metadata'] = {
'high_risk_fields': high_risk_fields,
'medium_risk_fields': medium_risk_fields,
'low_risk_fields': low_risk_fields,
'note': 'High risk fields should be manually verified'
}
return result
Monitoring and Detection in Production
Real-time Hallucination Detection
class HallucinationDetector:
def __init__(self, client):
self.client = client
self.hallucination_patterns = []
def detect_anomalies(self, extracted_data, historical_data):
"""Compare current extraction against historical patterns"""
anomalies = []
for field, value in extracted_data.items():
if value is None:
continue
# Check if value is statistical outlier
historical_values = [d.get(field) for d in historical_data if d.get(field) is not None]
if historical_values and isinstance(value, (int, float)):
import statistics
mean = statistics.mean(historical_values)
stdev = statistics.stdev(historical_values) if len(historical_values) > 1 else 0
# Flag values more than 3 standard deviations from mean
if stdev > 0 and abs(value - mean) > 3 * stdev:
anomalies.append({
'field': field,
'value': value,
'mean': mean,
'stdev': stdev,
'reason': 'Statistical outlier - possible hallucination'
})
return anomalies
# Usage
detector = HallucinationDetector(client)
# Store historical extractions
historical_products = [
{'price': 29.99, 'rating': 4.5},
{'price': 31.50, 'rating': 4.3},
{'price': 28.75, 'rating': 4.7}
]
current_extraction = client.get_fields(url, fields)
anomalies = detector.detect_anomalies(current_extraction, historical_products)
if anomalies:
print("Potential hallucinations detected:")
for anomaly in anomalies:
print(f" {anomaly['field']}: {anomaly['value']} (expected ~{anomaly['mean']:.2f})")
Handling Hallucinations When Detected
When you detect potential hallucinations:
- Flag for manual review: Queue suspicious extractions for human verification
- Fall back to traditional methods: Use CSS selectors or XPath as a backup
- Request re-extraction: Retry with improved prompts
- Return null values: Better to have missing data than incorrect data
- Log for model improvement: Report patterns to improve future extractions
def handle_suspected_hallucination(url, field, suspected_value):
"""Decision tree for handling detected hallucinations"""
# Step 1: Verify against source
html = client.get_html(url)
if str(suspected_value) not in html:
print(f"Value '{suspected_value}' not found in source - likely hallucination")
# Step 2: Attempt traditional extraction
soup = BeautifulSoup(html, 'html.parser')
# ... traditional extraction logic ...
# Step 3: If traditional extraction fails, return null
return None
# Step 4: Re-extract with stricter prompt
strict_result = client.get_fields(
url,
{
field: f'ONLY extract {field} if explicitly shown. Return exact value or null. Do not infer or estimate.'
}
)
return strict_result.get(field)
Conclusion
Handling LLM hallucinations in web scraping requires a multi-layered approach combining clear prompt engineering, validation against source HTML, consistency checking, schema validation, and monitoring. While hallucinations cannot be eliminated entirely, implementing these strategies significantly reduces their occurrence and impact.
The key is to treat AI extraction as part of a larger validation pipeline rather than a complete solution. By combining AI-powered data extraction with traditional parsing methods, schema validation, and human oversight where critical, you can leverage the power of LLMs while maintaining data accuracy and reliability.
Always remember: it's better to return null for uncertain data than to propagate hallucinated information into your systems. Implement robust validation, maintain audit trails, and continuously monitor extraction quality to ensure your AI-powered web scraping remains accurate and trustworthy.