Table of contents

How to Minimize LLM Hallucination When Using Deepseek for Data Extraction

LLM hallucination—when language models generate false or fabricated information—is a critical challenge when using Deepseek or any large language model for web scraping and data extraction. This guide provides practical strategies to minimize hallucinations and ensure accurate, reliable data extraction with Deepseek.

Understanding LLM Hallucination in Data Extraction

Hallucination occurs when an LLM "fills in the gaps" with plausible-sounding but inaccurate data. In web scraping contexts, this might mean:

  • Inventing prices, dates, or numbers that don't exist on the page
  • Creating product descriptions from generic knowledge instead of actual content
  • Fabricating links, emails, or contact information
  • Making assumptions about missing data rather than returning null values

1. Use Structured Output Schemas

The most effective way to reduce hallucinations is to enforce strict output schemas using JSON mode or function calling. This constrains Deepseek to only return data in predefined formats.

Python Example with JSON Schema

import requests
import json

def extract_product_data(html_content):
    # Define strict schema
    schema = {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "price": {"type": ["number", "null"]},
            "in_stock": {"type": ["boolean", "null"]},
            "description": {"type": "string"}
        },
        "required": ["name", "price", "in_stock", "description"],
        "additionalProperties": False
    }

    prompt = f"""Extract product information from this HTML.

CRITICAL RULES:
- Only extract information that is explicitly present in the HTML
- If a field is not found, set it to null
- Do not infer, assume, or generate any information
- Return ONLY the JSON object with no additional text

HTML:
{html_content}

Return JSON matching this schema:
{json.dumps(schema, indent=2)}"""

    response = requests.post(
        'https://api.deepseek.com/v1/chat/completions',
        headers={
            'Authorization': f'Bearer YOUR_API_KEY',
            'Content-Type': 'application/json'
        },
        json={
            "model": "deepseek-chat",
            "messages": [{"role": "user", "content": prompt}],
            "response_format": {"type": "json_object"},
            "temperature": 0.0  # Lower temperature reduces creativity/hallucination
        }
    )

    return json.loads(response.json()['choices'][0]['message']['content'])

JavaScript Example with Strict Typing

const axios = require('axios');

async function extractProductData(htmlContent) {
  const systemPrompt = `You are a precise data extraction tool.
Rules:
1. Extract ONLY information explicitly present in the provided HTML
2. Never infer, assume, or generate missing information
3. Use null for missing values
4. Return valid JSON only`;

  const userPrompt = `Extract product data from this HTML:

${htmlContent}

Return JSON with these exact fields:
{
  "name": string,
  "price": number | null,
  "currency": string | null,
  "availability": string | null,
  "sku": string | null
}`;

  const response = await axios.post(
    'https://api.deepseek.com/v1/chat/completions',
    {
      model: 'deepseek-chat',
      messages: [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: userPrompt }
      ],
      response_format: { type: 'json_object' },
      temperature: 0.0,
      max_tokens: 1000
    },
    {
      headers: {
        'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
        'Content-Type': 'application/json'
      }
    }
  );

  return JSON.parse(response.data.choices[0].message.content);
}

2. Provide Explicit Anti-Hallucination Instructions

Your prompts should explicitly forbid hallucination with clear, direct instructions:

prompt = f"""Extract contact information from this webpage HTML.

STRICT RULES - READ CAREFULLY:
1. Extract ONLY information that appears verbatim in the HTML below
2. If you cannot find a field, return null - DO NOT guess or generate
3. DO NOT use your training knowledge to fill in missing information
4. DO NOT make assumptions about standard formats
5. If you're uncertain about any value, set it to null
6. Respond ONLY with valid JSON, no explanatory text

HTML Content:
{html_content}

Expected JSON format:
{{
  "email": "string or null",
  "phone": "string or null",
  "address": "string or null",
  "company_name": "string or null"
}}"""

3. Set Temperature to Zero

Temperature controls randomness in LLM outputs. For data extraction, always use temperature: 0.0 to get deterministic, consistent results:

response = requests.post(
    'https://api.deepseek.com/v1/chat/completions',
    json={
        "model": "deepseek-chat",
        "messages": messages,
        "temperature": 0.0,  # Maximum determinism
        "top_p": 1.0,
        "frequency_penalty": 0.0,
        "presence_penalty": 0.0
    }
)

4. Use Few-Shot Examples with Null Values

Provide examples that explicitly show how to handle missing data:

few_shot_prompt = """Extract pricing information from HTML snippets.

Example 1:
HTML: <div class="price">$29.99</div><span class="currency">USD</span>
Output: {"price": 29.99, "currency": "USD", "discount": null}

Example 2:
HTML: <div class="product">Great item</div>
Output: {"price": null, "currency": null, "discount": null}

Example 3:
HTML: <span class="sale">Was $50, now $35</span>
Output: {"price": 35.00, "currency": null, "discount": 15.00}

Now extract from this HTML:
{actual_html}
"""

5. Implement Multi-Step Validation

Extract data in stages with validation between steps:

def validated_extraction(html_content):
    # Step 1: Extract raw data
    extraction_prompt = f"""Extract all visible text content and prices from this HTML.
    Return JSON with 'text_content' and 'numeric_values' arrays.
    HTML: {html_content}"""

    raw_data = call_deepseek(extraction_prompt)

    # Step 2: Validate against source
    validation_prompt = f"""Given this HTML and extracted data, verify accuracy.

HTML:
{html_content}

Extracted Data:
{json.dumps(raw_data)}

For each extracted field, respond with:
- 'confirmed': value appears in HTML
- 'not_found': value not in HTML (potential hallucination)
- 'uncertain': unclear

Return JSON validation report."""

    validation = call_deepseek(validation_prompt)

    # Step 3: Filter out unconfirmed data
    cleaned_data = {
        k: v for k, v in raw_data.items()
        if validation.get(k) == 'confirmed'
    }

    return cleaned_data

6. Limit Context Window Size

Avoid overwhelming Deepseek with excessive HTML. Preprocess to extract relevant sections:

from bs4 import BeautifulSoup

def extract_relevant_html(full_html, target_selectors):
    """Extract only relevant HTML sections to reduce noise"""
    soup = BeautifulSoup(full_html, 'html.parser')

    relevant_parts = []
    for selector in target_selectors:
        elements = soup.select(selector)
        relevant_parts.extend([str(el) for el in elements])

    # Send only relevant HTML to Deepseek
    condensed_html = '\n'.join(relevant_parts)
    return condensed_html

# Usage
html_subset = extract_relevant_html(
    full_page_html,
    ['.product-info', '.pricing', '.description']
)

7. Use Regex Post-Validation

Validate extracted data against expected patterns:

import re

def validate_extracted_data(data):
    """Validate common data types and flag suspicious values"""
    validation_rules = {
        'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
        'phone': r'^\+?[\d\s\-\(\)]{10,}$',
        'url': r'^https?://[^\s]+$',
        'price': r'^\d+(\.\d{2})?$'
    }

    validated = {}
    for field, value in data.items():
        if value is None:
            validated[field] = None
            continue

        if field in validation_rules:
            pattern = validation_rules[field]
            if re.match(pattern, str(value)):
                validated[field] = value
            else:
                print(f"Warning: {field} value '{value}' failed validation")
                validated[field] = None  # Reject invalid data
        else:
            validated[field] = value

    return validated

8. Implement Confidence Scoring

Ask Deepseek to rate its confidence in extracted values:

confidence_prompt = f"""Extract product information and rate your confidence for each field.

HTML:
{html_content}

Return JSON format:
{{
  "data": {{
    "name": "value",
    "price": value
  }},
  "confidence": {{
    "name": 0.0-1.0,
    "price": 0.0-1.0
  }}
}}

Confidence rules:
- 1.0: Value explicitly present in HTML
- 0.5-0.9: Value present but requires interpretation
- <0.5: Value uncertain or inferred
"""

# Filter out low-confidence extractions
result = call_deepseek(confidence_prompt)
high_confidence_data = {
    k: v for k, v in result['data'].items()
    if result['confidence'].get(k, 0) >= 0.8
}

9. Use Deterministic Fallbacks

Combine Deepseek with traditional parsing methods for validation:

from bs4 import BeautifulSoup
import json

def hybrid_extraction(html_content):
    # Traditional extraction
    soup = BeautifulSoup(html_content, 'html.parser')
    traditional_price = soup.select_one('.price')
    traditional_price = traditional_price.text if traditional_price else None

    # LLM extraction
    llm_result = extract_with_deepseek(html_content)

    # Cross-validate
    if traditional_price and llm_result.get('price'):
        # Compare traditional vs LLM extraction
        traditional_clean = re.sub(r'[^\d.]', '', traditional_price)
        if abs(float(traditional_clean) - llm_result['price']) > 0.01:
            print("Warning: Price mismatch between methods")
            return None  # Reject conflicting data

    return llm_result

10. Monitor and Log Hallucinations

Track extraction quality over time:

import logging
from datetime import datetime

class HallucinationMonitor:
    def __init__(self):
        self.logger = logging.getLogger('hallucination_monitor')

    def log_extraction(self, html, extracted_data, source_url):
        """Log extractions for later review"""
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'url': source_url,
            'extracted': extracted_data,
            'html_length': len(html),
            'null_fields': [k for k, v in extracted_data.items() if v is None]
        }

        self.logger.info(json.dumps(log_entry))

        # Flag suspicious patterns
        if len(log_entry['null_fields']) == 0:
            self.logger.warning("No null fields - possible hallucination")

    def validate_against_samples(self, extracted_data, known_good_samples):
        """Compare against manually verified samples"""
        for sample in known_good_samples:
            if sample['url'] == extracted_data.get('url'):
                mismatches = [
                    k for k in sample.keys()
                    if sample[k] != extracted_data.get(k)
                ]
                if mismatches:
                    self.logger.error(f"Hallucination detected: {mismatches}")

Best Practices Summary

  1. Always use temperature: 0.0 for deterministic outputs
  2. Enforce JSON schemas with strict validation
  3. Explicitly forbid hallucination in your prompts
  4. Provide examples with null values to train proper handling of missing data
  5. Validate outputs using regex, traditional parsing, or multi-step verification
  6. Limit input size by preprocessing HTML to relevant sections
  7. Monitor extraction quality and log suspicious results
  8. Use confidence scoring to filter uncertain extractions

Conclusion

Minimizing hallucination in Deepseek-powered data extraction requires a multi-layered approach: strict prompting, schema validation, low temperature settings, and traditional verification methods. By implementing these strategies, you can significantly improve the accuracy and reliability of your LLM-based web scraping workflows.

For complex scenarios involving dynamic content, consider combining Deepseek with browser automation tools for more reliable data extraction. Remember that while LLMs are powerful for handling unstructured data, they should always be paired with validation mechanisms to ensure data integrity.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon