How do I validate data extracted by an LLM from a web page?
Validating data extracted by Large Language Models (LLMs) from web pages is crucial for ensuring accuracy and reliability in your web scraping workflows. While LLMs offer powerful data extraction capabilities, they can occasionally produce errors, hallucinations, or inconsistent outputs. This guide covers comprehensive validation strategies to ensure the quality of LLM-extracted data.
Why Validation Matters for LLM-Based Web Scraping
LLMs can sometimes "hallucinate" information, meaning they generate plausible-sounding but incorrect data. Additionally, they may misinterpret HTML structure, extract incomplete information, or format data inconsistently. Proper validation helps you:
- Detect hallucinations and fabricated data
- Ensure data completeness and consistency
- Verify data type and format correctness
- Maintain data quality across large-scale scraping operations
- Build reliable automated pipelines
Schema Validation
Schema validation is the first line of defense for ensuring LLM-extracted data matches your expected structure.
Using JSON Schema Validation (Python)
import jsonschema
from jsonschema import validate
import json
# Define your expected schema
schema = {
"type": "object",
"required": ["title", "price", "rating"],
"properties": {
"title": {"type": "string", "minLength": 1},
"price": {
"type": "number",
"minimum": 0
},
"rating": {
"type": "number",
"minimum": 0,
"maximum": 5
},
"description": {"type": "string"},
"availability": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "preorder"]
}
}
}
def validate_llm_output(data):
"""Validate LLM-extracted data against schema"""
try:
validate(instance=data, schema=schema)
print("✓ Data validation passed")
return True
except jsonschema.exceptions.ValidationError as e:
print(f"✗ Validation error: {e.message}")
return False
# Example LLM output
llm_extracted_data = {
"title": "Wireless Headphones",
"price": 79.99,
"rating": 4.5,
"description": "High-quality wireless headphones with noise cancellation",
"availability": "in_stock"
}
validate_llm_output(llm_extracted_data)
Using Pydantic for Type Safety (Python)
Pydantic provides runtime type checking and data validation:
from pydantic import BaseModel, Field, validator, ValidationError
from typing import Optional
from datetime import datetime
class ProductData(BaseModel):
title: str = Field(..., min_length=1, max_length=500)
price: float = Field(..., gt=0)
rating: Optional[float] = Field(None, ge=0, le=5)
review_count: Optional[int] = Field(None, ge=0)
url: str
scraped_at: datetime = Field(default_factory=datetime.now)
@validator('title')
def title_not_empty(cls, v):
if not v.strip():
raise ValueError('Title cannot be empty or whitespace')
return v.strip()
@validator('url')
def validate_url(cls, v):
if not v.startswith(('http://', 'https://')):
raise ValueError('Invalid URL format')
return v
# Validate LLM output
try:
product = ProductData(
title="Wireless Mouse",
price=29.99,
rating=4.2,
review_count=150,
url="https://example.com/product/123"
)
print(f"✓ Valid product: {product.title}")
except ValidationError as e:
print(f"✗ Validation errors:\n{e}")
TypeScript/JavaScript Schema Validation
const Ajv = require('ajv');
const ajv = new Ajv();
// Define schema
const productSchema = {
type: 'object',
required: ['title', 'price', 'rating'],
properties: {
title: {
type: 'string',
minLength: 1,
maxLength: 500
},
price: {
type: 'number',
minimum: 0
},
rating: {
type: 'number',
minimum: 0,
maximum: 5
},
availability: {
type: 'string',
enum: ['in_stock', 'out_of_stock', 'preorder']
}
}
};
const validate = ajv.compile(productSchema);
function validateLLMOutput(data) {
const valid = validate(data);
if (!valid) {
console.error('Validation errors:', validate.errors);
return false;
}
console.log('✓ Data validation passed');
return true;
}
// Example usage
const llmData = {
title: 'Smart Watch',
price: 199.99,
rating: 4.7,
availability: 'in_stock'
};
validateLLMOutput(llmData);
Cross-Verification Techniques
Cross-verification involves validating LLM-extracted data against the original source or alternative extraction methods.
Dual Extraction Strategy
Extract data using both LLM and traditional methods, then compare:
from bs4 import BeautifulSoup
import openai
import difflib
def traditional_extraction(html):
"""Extract data using traditional parsing"""
soup = BeautifulSoup(html, 'html.parser')
return {
'title': soup.select_one('h1.product-title').text.strip(),
'price': float(soup.select_one('.price').text.replace('$', '')),
'rating': float(soup.select_one('.rating')['data-rating'])
}
def llm_extraction(html):
"""Extract data using LLM"""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract product data as JSON"},
{"role": "user", "content": f"Extract title, price, and rating from:\n{html}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
def cross_verify(html):
"""Compare both extraction methods"""
traditional = traditional_extraction(html)
llm = llm_extraction(html)
discrepancies = []
for key in traditional.keys():
if key in llm:
# Compare with tolerance for floats
if isinstance(traditional[key], float):
if abs(traditional[key] - llm[key]) > 0.01:
discrepancies.append(f"{key}: {traditional[key]} vs {llm[key]}")
elif traditional[key] != llm[key]:
similarity = difflib.SequenceMatcher(
None, str(traditional[key]), str(llm[key])
).ratio()
if similarity < 0.95: # 95% similarity threshold
discrepancies.append(f"{key}: {traditional[key]} vs {llm[key]}")
if discrepancies:
print(f"⚠ Discrepancies found: {discrepancies}")
return False
print("✓ Cross-verification passed")
return True
Detecting LLM Hallucinations
LLM hallucinations are a critical concern in web scraping. Here are techniques to detect them:
Source Text Verification
Verify that extracted data actually appears in the source HTML:
def verify_data_in_source(html_content, extracted_data):
"""Verify extracted data exists in source HTML"""
html_lower = html_content.lower()
warnings = []
for key, value in extracted_data.items():
if isinstance(value, str) and len(value) > 3:
# Check if string value appears in HTML
if value.lower() not in html_lower:
# Check for close matches
words = value.split()
word_matches = sum(1 for word in words if word.lower() in html_lower)
match_ratio = word_matches / len(words) if words else 0
if match_ratio < 0.5:
warnings.append(f"{key}: '{value}' not found in source (possible hallucination)")
if warnings:
print("⚠ Potential hallucinations detected:")
for warning in warnings:
print(f" - {warning}")
return False
return True
# Example usage
html = "<html><body><h1>Wireless Keyboard</h1><p>Price: $45.99</p></body></html>"
extracted = {
"title": "Wireless Keyboard",
"price": 45.99,
"description": "RGB backlit gaming keyboard" # This might be a hallucination!
}
verify_data_in_source(html, extracted)
Confidence Scoring with Multiple Passes
Use multiple LLM extractions and compare results:
import asyncio
from collections import Counter
async def extract_with_confidence(html, num_attempts=3):
"""Extract data multiple times and calculate confidence"""
results = []
for i in range(num_attempts):
# Make LLM extraction call
data = await llm_extraction(html)
results.append(json.dumps(data, sort_keys=True))
# Count identical results
result_counts = Counter(results)
most_common = result_counts.most_common(1)[0]
confidence = most_common[1] / num_attempts
if confidence < 0.66: # Less than 2/3 agreement
print(f"⚠ Low confidence: {confidence:.0%}")
return None
print(f"✓ High confidence: {confidence:.0%}")
return json.loads(most_common[0])
Business Rule Validation
Implement domain-specific validation rules:
class ProductValidator:
"""Domain-specific validation for product data"""
@staticmethod
def validate_price_range(price, min_price=0.01, max_price=100000):
"""Validate price is within reasonable range"""
if not min_price <= price <= max_price:
raise ValueError(f"Price {price} outside expected range")
return True
@staticmethod
def validate_rating_reviews_correlation(rating, review_count):
"""High ratings with no reviews might be suspicious"""
if rating and review_count is not None:
if rating > 4.5 and review_count < 5:
raise ValueError("Suspicious: high rating with few reviews")
return True
@staticmethod
def validate_text_quality(text, min_words=3):
"""Ensure text fields have minimum quality"""
if not text or len(text.split()) < min_words:
raise ValueError("Text field too short or empty")
return True
@staticmethod
def validate_date_logic(published_date, scraped_date):
"""Ensure dates make logical sense"""
if published_date > scraped_date:
raise ValueError("Published date cannot be in the future")
return True
# Usage
def validate_business_rules(product_data):
"""Apply all business rules"""
try:
validator = ProductValidator()
validator.validate_price_range(product_data['price'])
validator.validate_rating_reviews_correlation(
product_data.get('rating'),
product_data.get('review_count')
)
validator.validate_text_quality(product_data['title'])
print("✓ Business rule validation passed")
return True
except ValueError as e:
print(f"✗ Business rule violation: {e}")
return False
Implementing a Complete Validation Pipeline
Here's a comprehensive validation pipeline combining multiple techniques:
class LLMDataValidator:
"""Complete validation pipeline for LLM-extracted data"""
def __init__(self, schema, business_rules=None):
self.schema = schema
self.business_rules = business_rules or []
self.validation_results = {}
def validate(self, html_content, extracted_data):
"""Run complete validation pipeline"""
checks = [
('schema', self._validate_schema),
('source_verification', self._verify_in_source),
('business_rules', self._validate_business_rules),
('data_quality', self._validate_data_quality)
]
all_passed = True
for check_name, check_func in checks:
try:
result = check_func(html_content, extracted_data)
self.validation_results[check_name] = {
'passed': result,
'error': None
}
if not result:
all_passed = False
except Exception as e:
self.validation_results[check_name] = {
'passed': False,
'error': str(e)
}
all_passed = False
return all_passed
def _validate_schema(self, html_content, data):
validate(instance=data, schema=self.schema)
return True
def _verify_in_source(self, html_content, data):
return verify_data_in_source(html_content, data)
def _validate_business_rules(self, html_content, data):
for rule in self.business_rules:
rule(data)
return True
def _validate_data_quality(self, html_content, data):
# Check for suspiciously short or generic values
for key, value in data.items():
if isinstance(value, str):
if len(value) < 2 or value.lower() in ['n/a', 'null', 'none']:
raise ValueError(f"Low quality data in field: {key}")
return True
def get_report(self):
"""Generate validation report"""
return {
'overall_passed': all(r['passed'] for r in self.validation_results.values()),
'checks': self.validation_results
}
# Usage example
validator = LLMDataValidator(
schema=schema,
business_rules=[
lambda d: ProductValidator.validate_price_range(d['price']),
lambda d: ProductValidator.validate_text_quality(d['title'])
]
)
if validator.validate(html_content, llm_extracted_data):
print("✓ All validations passed")
print(json.dumps(validator.get_report(), indent=2))
else:
print("✗ Validation failed")
print(json.dumps(validator.get_report(), indent=2))
Monitoring and Logging
Implement monitoring to track validation failures over time:
import logging
from datetime import datetime
class ValidationMonitor:
"""Monitor and log validation results"""
def __init__(self, log_file='validation.log'):
logging.basicConfig(
filename=log_file,
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
self.logger = logging.getLogger(__name__)
self.stats = {
'total': 0,
'passed': 0,
'failed': 0,
'errors': {}
}
def log_validation(self, url, validation_result, errors=None):
"""Log validation result"""
self.stats['total'] += 1
if validation_result:
self.stats['passed'] += 1
self.logger.info(f"Validation passed for {url}")
else:
self.stats['failed'] += 1
error_type = type(errors).__name__ if errors else 'Unknown'
self.stats['errors'][error_type] = self.stats['errors'].get(error_type, 0) + 1
self.logger.error(f"Validation failed for {url}: {errors}")
def get_success_rate(self):
"""Calculate validation success rate"""
if self.stats['total'] == 0:
return 0
return (self.stats['passed'] / self.stats['total']) * 100
def get_stats(self):
"""Get validation statistics"""
return {
**self.stats,
'success_rate': f"{self.get_success_rate():.2f}%"
}
Best Practices
- Use Structured Outputs: When working with OpenAI function calling or similar features, define strict schemas upfront
- Implement Multi-Layer Validation: Combine schema, business rule, and source verification
- Log Everything: Keep detailed logs of validation failures for continuous improvement
- Set Confidence Thresholds: Reject or flag data that doesn't meet minimum confidence levels
- Use Fallback Mechanisms: Have traditional parsing as a backup when LLM extraction fails validation
- Monitor Trends: Track validation failure rates over time to detect systemic issues
- Test with Edge Cases: Include unusual but valid data in your test suite
Conclusion
Validating LLM-extracted web data requires a multi-faceted approach combining schema validation, cross-verification, hallucination detection, and business rule enforcement. By implementing a comprehensive validation pipeline, you can harness the power of LLMs for web scraping while maintaining high data quality standards. Regular monitoring and continuous refinement of your validation rules will ensure reliable automated data extraction at scale.