How to Minimize LLM Hallucination When Using Deepseek for Data Extraction
LLM hallucination—when language models generate false or fabricated information—is a critical challenge when using Deepseek or any large language model for web scraping and data extraction. This guide provides practical strategies to minimize hallucinations and ensure accurate, reliable data extraction with Deepseek.
Understanding LLM Hallucination in Data Extraction
Hallucination occurs when an LLM "fills in the gaps" with plausible-sounding but inaccurate data. In web scraping contexts, this might mean:
- Inventing prices, dates, or numbers that don't exist on the page
- Creating product descriptions from generic knowledge instead of actual content
- Fabricating links, emails, or contact information
- Making assumptions about missing data rather than returning null values
1. Use Structured Output Schemas
The most effective way to reduce hallucinations is to enforce strict output schemas using JSON mode or function calling. This constrains Deepseek to only return data in predefined formats.
Python Example with JSON Schema
import requests
import json
def extract_product_data(html_content):
# Define strict schema
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": ["number", "null"]},
"in_stock": {"type": ["boolean", "null"]},
"description": {"type": "string"}
},
"required": ["name", "price", "in_stock", "description"],
"additionalProperties": False
}
prompt = f"""Extract product information from this HTML.
CRITICAL RULES:
- Only extract information that is explicitly present in the HTML
- If a field is not found, set it to null
- Do not infer, assume, or generate any information
- Return ONLY the JSON object with no additional text
HTML:
{html_content}
Return JSON matching this schema:
{json.dumps(schema, indent=2)}"""
response = requests.post(
'https://api.deepseek.com/v1/chat/completions',
headers={
'Authorization': f'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
json={
"model": "deepseek-chat",
"messages": [{"role": "user", "content": prompt}],
"response_format": {"type": "json_object"},
"temperature": 0.0 # Lower temperature reduces creativity/hallucination
}
)
return json.loads(response.json()['choices'][0]['message']['content'])
JavaScript Example with Strict Typing
const axios = require('axios');
async function extractProductData(htmlContent) {
const systemPrompt = `You are a precise data extraction tool.
Rules:
1. Extract ONLY information explicitly present in the provided HTML
2. Never infer, assume, or generate missing information
3. Use null for missing values
4. Return valid JSON only`;
const userPrompt = `Extract product data from this HTML:
${htmlContent}
Return JSON with these exact fields:
{
"name": string,
"price": number | null,
"currency": string | null,
"availability": string | null,
"sku": string | null
}`;
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt }
],
response_format: { type: 'json_object' },
temperature: 0.0,
max_tokens: 1000
},
{
headers: {
'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
return JSON.parse(response.data.choices[0].message.content);
}
2. Provide Explicit Anti-Hallucination Instructions
Your prompts should explicitly forbid hallucination with clear, direct instructions:
prompt = f"""Extract contact information from this webpage HTML.
STRICT RULES - READ CAREFULLY:
1. Extract ONLY information that appears verbatim in the HTML below
2. If you cannot find a field, return null - DO NOT guess or generate
3. DO NOT use your training knowledge to fill in missing information
4. DO NOT make assumptions about standard formats
5. If you're uncertain about any value, set it to null
6. Respond ONLY with valid JSON, no explanatory text
HTML Content:
{html_content}
Expected JSON format:
{{
"email": "string or null",
"phone": "string or null",
"address": "string or null",
"company_name": "string or null"
}}"""
3. Set Temperature to Zero
Temperature controls randomness in LLM outputs. For data extraction, always use temperature: 0.0
to get deterministic, consistent results:
response = requests.post(
'https://api.deepseek.com/v1/chat/completions',
json={
"model": "deepseek-chat",
"messages": messages,
"temperature": 0.0, # Maximum determinism
"top_p": 1.0,
"frequency_penalty": 0.0,
"presence_penalty": 0.0
}
)
4. Use Few-Shot Examples with Null Values
Provide examples that explicitly show how to handle missing data:
few_shot_prompt = """Extract pricing information from HTML snippets.
Example 1:
HTML: <div class="price">$29.99</div><span class="currency">USD</span>
Output: {"price": 29.99, "currency": "USD", "discount": null}
Example 2:
HTML: <div class="product">Great item</div>
Output: {"price": null, "currency": null, "discount": null}
Example 3:
HTML: <span class="sale">Was $50, now $35</span>
Output: {"price": 35.00, "currency": null, "discount": 15.00}
Now extract from this HTML:
{actual_html}
"""
5. Implement Multi-Step Validation
Extract data in stages with validation between steps:
def validated_extraction(html_content):
# Step 1: Extract raw data
extraction_prompt = f"""Extract all visible text content and prices from this HTML.
Return JSON with 'text_content' and 'numeric_values' arrays.
HTML: {html_content}"""
raw_data = call_deepseek(extraction_prompt)
# Step 2: Validate against source
validation_prompt = f"""Given this HTML and extracted data, verify accuracy.
HTML:
{html_content}
Extracted Data:
{json.dumps(raw_data)}
For each extracted field, respond with:
- 'confirmed': value appears in HTML
- 'not_found': value not in HTML (potential hallucination)
- 'uncertain': unclear
Return JSON validation report."""
validation = call_deepseek(validation_prompt)
# Step 3: Filter out unconfirmed data
cleaned_data = {
k: v for k, v in raw_data.items()
if validation.get(k) == 'confirmed'
}
return cleaned_data
6. Limit Context Window Size
Avoid overwhelming Deepseek with excessive HTML. Preprocess to extract relevant sections:
from bs4 import BeautifulSoup
def extract_relevant_html(full_html, target_selectors):
"""Extract only relevant HTML sections to reduce noise"""
soup = BeautifulSoup(full_html, 'html.parser')
relevant_parts = []
for selector in target_selectors:
elements = soup.select(selector)
relevant_parts.extend([str(el) for el in elements])
# Send only relevant HTML to Deepseek
condensed_html = '\n'.join(relevant_parts)
return condensed_html
# Usage
html_subset = extract_relevant_html(
full_page_html,
['.product-info', '.pricing', '.description']
)
7. Use Regex Post-Validation
Validate extracted data against expected patterns:
import re
def validate_extracted_data(data):
"""Validate common data types and flag suspicious values"""
validation_rules = {
'email': r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
'phone': r'^\+?[\d\s\-\(\)]{10,}$',
'url': r'^https?://[^\s]+$',
'price': r'^\d+(\.\d{2})?$'
}
validated = {}
for field, value in data.items():
if value is None:
validated[field] = None
continue
if field in validation_rules:
pattern = validation_rules[field]
if re.match(pattern, str(value)):
validated[field] = value
else:
print(f"Warning: {field} value '{value}' failed validation")
validated[field] = None # Reject invalid data
else:
validated[field] = value
return validated
8. Implement Confidence Scoring
Ask Deepseek to rate its confidence in extracted values:
confidence_prompt = f"""Extract product information and rate your confidence for each field.
HTML:
{html_content}
Return JSON format:
{{
"data": {{
"name": "value",
"price": value
}},
"confidence": {{
"name": 0.0-1.0,
"price": 0.0-1.0
}}
}}
Confidence rules:
- 1.0: Value explicitly present in HTML
- 0.5-0.9: Value present but requires interpretation
- <0.5: Value uncertain or inferred
"""
# Filter out low-confidence extractions
result = call_deepseek(confidence_prompt)
high_confidence_data = {
k: v for k, v in result['data'].items()
if result['confidence'].get(k, 0) >= 0.8
}
9. Use Deterministic Fallbacks
Combine Deepseek with traditional parsing methods for validation:
from bs4 import BeautifulSoup
import json
def hybrid_extraction(html_content):
# Traditional extraction
soup = BeautifulSoup(html_content, 'html.parser')
traditional_price = soup.select_one('.price')
traditional_price = traditional_price.text if traditional_price else None
# LLM extraction
llm_result = extract_with_deepseek(html_content)
# Cross-validate
if traditional_price and llm_result.get('price'):
# Compare traditional vs LLM extraction
traditional_clean = re.sub(r'[^\d.]', '', traditional_price)
if abs(float(traditional_clean) - llm_result['price']) > 0.01:
print("Warning: Price mismatch between methods")
return None # Reject conflicting data
return llm_result
10. Monitor and Log Hallucinations
Track extraction quality over time:
import logging
from datetime import datetime
class HallucinationMonitor:
def __init__(self):
self.logger = logging.getLogger('hallucination_monitor')
def log_extraction(self, html, extracted_data, source_url):
"""Log extractions for later review"""
log_entry = {
'timestamp': datetime.now().isoformat(),
'url': source_url,
'extracted': extracted_data,
'html_length': len(html),
'null_fields': [k for k, v in extracted_data.items() if v is None]
}
self.logger.info(json.dumps(log_entry))
# Flag suspicious patterns
if len(log_entry['null_fields']) == 0:
self.logger.warning("No null fields - possible hallucination")
def validate_against_samples(self, extracted_data, known_good_samples):
"""Compare against manually verified samples"""
for sample in known_good_samples:
if sample['url'] == extracted_data.get('url'):
mismatches = [
k for k in sample.keys()
if sample[k] != extracted_data.get(k)
]
if mismatches:
self.logger.error(f"Hallucination detected: {mismatches}")
Best Practices Summary
- Always use
temperature: 0.0
for deterministic outputs - Enforce JSON schemas with strict validation
- Explicitly forbid hallucination in your prompts
- Provide examples with null values to train proper handling of missing data
- Validate outputs using regex, traditional parsing, or multi-step verification
- Limit input size by preprocessing HTML to relevant sections
- Monitor extraction quality and log suspicious results
- Use confidence scoring to filter uncertain extractions
Conclusion
Minimizing hallucination in Deepseek-powered data extraction requires a multi-layered approach: strict prompting, schema validation, low temperature settings, and traditional verification methods. By implementing these strategies, you can significantly improve the accuracy and reliability of your LLM-based web scraping workflows.
For complex scenarios involving dynamic content, consider combining Deepseek with browser automation tools for more reliable data extraction. Remember that while LLMs are powerful for handling unstructured data, they should always be paired with validation mechanisms to ensure data integrity.