Table of contents

How do I validate data extracted by an LLM from a web page?

Validating data extracted by Large Language Models (LLMs) from web pages is crucial for ensuring accuracy and reliability in your web scraping workflows. While LLMs offer powerful data extraction capabilities, they can occasionally produce errors, hallucinations, or inconsistent outputs. This guide covers comprehensive validation strategies to ensure the quality of LLM-extracted data.

Why Validation Matters for LLM-Based Web Scraping

LLMs can sometimes "hallucinate" information, meaning they generate plausible-sounding but incorrect data. Additionally, they may misinterpret HTML structure, extract incomplete information, or format data inconsistently. Proper validation helps you:

  • Detect hallucinations and fabricated data
  • Ensure data completeness and consistency
  • Verify data type and format correctness
  • Maintain data quality across large-scale scraping operations
  • Build reliable automated pipelines

Schema Validation

Schema validation is the first line of defense for ensuring LLM-extracted data matches your expected structure.

Using JSON Schema Validation (Python)

import jsonschema
from jsonschema import validate
import json

# Define your expected schema
schema = {
    "type": "object",
    "required": ["title", "price", "rating"],
    "properties": {
        "title": {"type": "string", "minLength": 1},
        "price": {
            "type": "number",
            "minimum": 0
        },
        "rating": {
            "type": "number",
            "minimum": 0,
            "maximum": 5
        },
        "description": {"type": "string"},
        "availability": {
            "type": "string",
            "enum": ["in_stock", "out_of_stock", "preorder"]
        }
    }
}

def validate_llm_output(data):
    """Validate LLM-extracted data against schema"""
    try:
        validate(instance=data, schema=schema)
        print("✓ Data validation passed")
        return True
    except jsonschema.exceptions.ValidationError as e:
        print(f"✗ Validation error: {e.message}")
        return False

# Example LLM output
llm_extracted_data = {
    "title": "Wireless Headphones",
    "price": 79.99,
    "rating": 4.5,
    "description": "High-quality wireless headphones with noise cancellation",
    "availability": "in_stock"
}

validate_llm_output(llm_extracted_data)

Using Pydantic for Type Safety (Python)

Pydantic provides runtime type checking and data validation:

from pydantic import BaseModel, Field, validator, ValidationError
from typing import Optional
from datetime import datetime

class ProductData(BaseModel):
    title: str = Field(..., min_length=1, max_length=500)
    price: float = Field(..., gt=0)
    rating: Optional[float] = Field(None, ge=0, le=5)
    review_count: Optional[int] = Field(None, ge=0)
    url: str
    scraped_at: datetime = Field(default_factory=datetime.now)

    @validator('title')
    def title_not_empty(cls, v):
        if not v.strip():
            raise ValueError('Title cannot be empty or whitespace')
        return v.strip()

    @validator('url')
    def validate_url(cls, v):
        if not v.startswith(('http://', 'https://')):
            raise ValueError('Invalid URL format')
        return v

# Validate LLM output
try:
    product = ProductData(
        title="Wireless Mouse",
        price=29.99,
        rating=4.2,
        review_count=150,
        url="https://example.com/product/123"
    )
    print(f"✓ Valid product: {product.title}")
except ValidationError as e:
    print(f"✗ Validation errors:\n{e}")

TypeScript/JavaScript Schema Validation

const Ajv = require('ajv');
const ajv = new Ajv();

// Define schema
const productSchema = {
  type: 'object',
  required: ['title', 'price', 'rating'],
  properties: {
    title: {
      type: 'string',
      minLength: 1,
      maxLength: 500
    },
    price: {
      type: 'number',
      minimum: 0
    },
    rating: {
      type: 'number',
      minimum: 0,
      maximum: 5
    },
    availability: {
      type: 'string',
      enum: ['in_stock', 'out_of_stock', 'preorder']
    }
  }
};

const validate = ajv.compile(productSchema);

function validateLLMOutput(data) {
  const valid = validate(data);

  if (!valid) {
    console.error('Validation errors:', validate.errors);
    return false;
  }

  console.log('✓ Data validation passed');
  return true;
}

// Example usage
const llmData = {
  title: 'Smart Watch',
  price: 199.99,
  rating: 4.7,
  availability: 'in_stock'
};

validateLLMOutput(llmData);

Cross-Verification Techniques

Cross-verification involves validating LLM-extracted data against the original source or alternative extraction methods.

Dual Extraction Strategy

Extract data using both LLM and traditional methods, then compare:

from bs4 import BeautifulSoup
import openai
import difflib

def traditional_extraction(html):
    """Extract data using traditional parsing"""
    soup = BeautifulSoup(html, 'html.parser')
    return {
        'title': soup.select_one('h1.product-title').text.strip(),
        'price': float(soup.select_one('.price').text.replace('$', '')),
        'rating': float(soup.select_one('.rating')['data-rating'])
    }

def llm_extraction(html):
    """Extract data using LLM"""
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Extract product data as JSON"},
            {"role": "user", "content": f"Extract title, price, and rating from:\n{html}"}
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

def cross_verify(html):
    """Compare both extraction methods"""
    traditional = traditional_extraction(html)
    llm = llm_extraction(html)

    discrepancies = []

    for key in traditional.keys():
        if key in llm:
            # Compare with tolerance for floats
            if isinstance(traditional[key], float):
                if abs(traditional[key] - llm[key]) > 0.01:
                    discrepancies.append(f"{key}: {traditional[key]} vs {llm[key]}")
            elif traditional[key] != llm[key]:
                similarity = difflib.SequenceMatcher(
                    None, str(traditional[key]), str(llm[key])
                ).ratio()
                if similarity < 0.95:  # 95% similarity threshold
                    discrepancies.append(f"{key}: {traditional[key]} vs {llm[key]}")

    if discrepancies:
        print(f"⚠ Discrepancies found: {discrepancies}")
        return False

    print("✓ Cross-verification passed")
    return True

Detecting LLM Hallucinations

LLM hallucinations are a critical concern in web scraping. Here are techniques to detect them:

Source Text Verification

Verify that extracted data actually appears in the source HTML:

def verify_data_in_source(html_content, extracted_data):
    """Verify extracted data exists in source HTML"""
    html_lower = html_content.lower()
    warnings = []

    for key, value in extracted_data.items():
        if isinstance(value, str) and len(value) > 3:
            # Check if string value appears in HTML
            if value.lower() not in html_lower:
                # Check for close matches
                words = value.split()
                word_matches = sum(1 for word in words if word.lower() in html_lower)
                match_ratio = word_matches / len(words) if words else 0

                if match_ratio < 0.5:
                    warnings.append(f"{key}: '{value}' not found in source (possible hallucination)")

    if warnings:
        print("⚠ Potential hallucinations detected:")
        for warning in warnings:
            print(f"  - {warning}")
        return False

    return True

# Example usage
html = "<html><body><h1>Wireless Keyboard</h1><p>Price: $45.99</p></body></html>"
extracted = {
    "title": "Wireless Keyboard",
    "price": 45.99,
    "description": "RGB backlit gaming keyboard"  # This might be a hallucination!
}

verify_data_in_source(html, extracted)

Confidence Scoring with Multiple Passes

Use multiple LLM extractions and compare results:

import asyncio
from collections import Counter

async def extract_with_confidence(html, num_attempts=3):
    """Extract data multiple times and calculate confidence"""
    results = []

    for i in range(num_attempts):
        # Make LLM extraction call
        data = await llm_extraction(html)
        results.append(json.dumps(data, sort_keys=True))

    # Count identical results
    result_counts = Counter(results)
    most_common = result_counts.most_common(1)[0]
    confidence = most_common[1] / num_attempts

    if confidence < 0.66:  # Less than 2/3 agreement
        print(f"⚠ Low confidence: {confidence:.0%}")
        return None

    print(f"✓ High confidence: {confidence:.0%}")
    return json.loads(most_common[0])

Business Rule Validation

Implement domain-specific validation rules:

class ProductValidator:
    """Domain-specific validation for product data"""

    @staticmethod
    def validate_price_range(price, min_price=0.01, max_price=100000):
        """Validate price is within reasonable range"""
        if not min_price <= price <= max_price:
            raise ValueError(f"Price {price} outside expected range")
        return True

    @staticmethod
    def validate_rating_reviews_correlation(rating, review_count):
        """High ratings with no reviews might be suspicious"""
        if rating and review_count is not None:
            if rating > 4.5 and review_count < 5:
                raise ValueError("Suspicious: high rating with few reviews")
        return True

    @staticmethod
    def validate_text_quality(text, min_words=3):
        """Ensure text fields have minimum quality"""
        if not text or len(text.split()) < min_words:
            raise ValueError("Text field too short or empty")
        return True

    @staticmethod
    def validate_date_logic(published_date, scraped_date):
        """Ensure dates make logical sense"""
        if published_date > scraped_date:
            raise ValueError("Published date cannot be in the future")
        return True

# Usage
def validate_business_rules(product_data):
    """Apply all business rules"""
    try:
        validator = ProductValidator()
        validator.validate_price_range(product_data['price'])
        validator.validate_rating_reviews_correlation(
            product_data.get('rating'),
            product_data.get('review_count')
        )
        validator.validate_text_quality(product_data['title'])
        print("✓ Business rule validation passed")
        return True
    except ValueError as e:
        print(f"✗ Business rule violation: {e}")
        return False

Implementing a Complete Validation Pipeline

Here's a comprehensive validation pipeline combining multiple techniques:

class LLMDataValidator:
    """Complete validation pipeline for LLM-extracted data"""

    def __init__(self, schema, business_rules=None):
        self.schema = schema
        self.business_rules = business_rules or []
        self.validation_results = {}

    def validate(self, html_content, extracted_data):
        """Run complete validation pipeline"""
        checks = [
            ('schema', self._validate_schema),
            ('source_verification', self._verify_in_source),
            ('business_rules', self._validate_business_rules),
            ('data_quality', self._validate_data_quality)
        ]

        all_passed = True

        for check_name, check_func in checks:
            try:
                result = check_func(html_content, extracted_data)
                self.validation_results[check_name] = {
                    'passed': result,
                    'error': None
                }
                if not result:
                    all_passed = False
            except Exception as e:
                self.validation_results[check_name] = {
                    'passed': False,
                    'error': str(e)
                }
                all_passed = False

        return all_passed

    def _validate_schema(self, html_content, data):
        validate(instance=data, schema=self.schema)
        return True

    def _verify_in_source(self, html_content, data):
        return verify_data_in_source(html_content, data)

    def _validate_business_rules(self, html_content, data):
        for rule in self.business_rules:
            rule(data)
        return True

    def _validate_data_quality(self, html_content, data):
        # Check for suspiciously short or generic values
        for key, value in data.items():
            if isinstance(value, str):
                if len(value) < 2 or value.lower() in ['n/a', 'null', 'none']:
                    raise ValueError(f"Low quality data in field: {key}")
        return True

    def get_report(self):
        """Generate validation report"""
        return {
            'overall_passed': all(r['passed'] for r in self.validation_results.values()),
            'checks': self.validation_results
        }

# Usage example
validator = LLMDataValidator(
    schema=schema,
    business_rules=[
        lambda d: ProductValidator.validate_price_range(d['price']),
        lambda d: ProductValidator.validate_text_quality(d['title'])
    ]
)

if validator.validate(html_content, llm_extracted_data):
    print("✓ All validations passed")
    print(json.dumps(validator.get_report(), indent=2))
else:
    print("✗ Validation failed")
    print(json.dumps(validator.get_report(), indent=2))

Monitoring and Logging

Implement monitoring to track validation failures over time:

import logging
from datetime import datetime

class ValidationMonitor:
    """Monitor and log validation results"""

    def __init__(self, log_file='validation.log'):
        logging.basicConfig(
            filename=log_file,
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s'
        )
        self.logger = logging.getLogger(__name__)
        self.stats = {
            'total': 0,
            'passed': 0,
            'failed': 0,
            'errors': {}
        }

    def log_validation(self, url, validation_result, errors=None):
        """Log validation result"""
        self.stats['total'] += 1

        if validation_result:
            self.stats['passed'] += 1
            self.logger.info(f"Validation passed for {url}")
        else:
            self.stats['failed'] += 1
            error_type = type(errors).__name__ if errors else 'Unknown'
            self.stats['errors'][error_type] = self.stats['errors'].get(error_type, 0) + 1
            self.logger.error(f"Validation failed for {url}: {errors}")

    def get_success_rate(self):
        """Calculate validation success rate"""
        if self.stats['total'] == 0:
            return 0
        return (self.stats['passed'] / self.stats['total']) * 100

    def get_stats(self):
        """Get validation statistics"""
        return {
            **self.stats,
            'success_rate': f"{self.get_success_rate():.2f}%"
        }

Best Practices

  1. Use Structured Outputs: When working with OpenAI function calling or similar features, define strict schemas upfront
  2. Implement Multi-Layer Validation: Combine schema, business rule, and source verification
  3. Log Everything: Keep detailed logs of validation failures for continuous improvement
  4. Set Confidence Thresholds: Reject or flag data that doesn't meet minimum confidence levels
  5. Use Fallback Mechanisms: Have traditional parsing as a backup when LLM extraction fails validation
  6. Monitor Trends: Track validation failure rates over time to detect systemic issues
  7. Test with Edge Cases: Include unusual but valid data in your test suite

Conclusion

Validating LLM-extracted web data requires a multi-faceted approach combining schema validation, cross-verification, hallucination detection, and business rule enforcement. By implementing a comprehensive validation pipeline, you can harness the power of LLMs for web scraping while maintaining high data quality standards. Regular monitoring and continuous refinement of your validation rules will ensure reliable automated data extraction at scale.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon