What is Structured Data Extraction with LLM and How Does Deepseek Support It?
Structured data extraction with Large Language Models (LLMs) is a transformative approach to converting unstructured or semi-structured web content into well-defined, machine-readable formats like JSON, XML, or CSV. Unlike traditional web scraping methods that rely on brittle CSS selectors or XPath expressions, LLM-based extraction uses natural language understanding to identify and extract relevant information from HTML content, making it more resilient to website layout changes.
Deepseek, a powerful and cost-effective LLM, provides robust support for structured data extraction through function calling, JSON mode, and advanced prompt engineering capabilities. This guide explores how structured data extraction works with LLMs and demonstrates practical implementation patterns using Deepseek.
Understanding Structured Data Extraction with LLMs
Traditional web scraping requires developers to manually identify HTML selectors for each piece of data they want to extract. This approach becomes problematic when:
- Website layouts change frequently
- Data appears in inconsistent formats across pages
- Content is dynamically generated or embedded in complex structures
- You need to extract semantic meaning rather than just raw text
LLM-based structured data extraction solves these challenges by using the model's natural language understanding to:
- Parse unstructured content and identify relevant information
- Transform data into predefined schemas
- Handle variations in content structure automatically
- Extract semantic relationships between data points
- Validate and normalize extracted data
How Deepseek Supports Structured Data Extraction
Deepseek offers several powerful features for structured data extraction:
1. Function Calling (Tool Use)
Deepseek supports function calling, allowing you to define JSON schemas that the model will populate with extracted data. This is the most reliable method for structured extraction.
import requests
import json
def extract_product_data(html_content):
"""Extract structured product data using Deepseek function calling"""
# Define the schema for product data
tools = [{
"type": "function",
"function": {
"name": "save_product",
"description": "Save extracted product information",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Product price in USD"
},
"rating": {
"type": "number",
"description": "Product rating (0-5)"
},
"availability": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "preorder"],
"description": "Product availability status"
},
"features": {
"type": "array",
"items": {"type": "string"},
"description": "List of key product features"
}
},
"required": ["name", "price", "availability"]
}
}
}]
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": f"Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": [
{
"role": "user",
"content": f"Extract product information from this HTML:\n\n{html_content}"
}
],
"tools": tools,
"tool_choice": "auto"
}
)
result = response.json()
# Extract the function call arguments
if result["choices"][0]["message"].get("tool_calls"):
tool_call = result["choices"][0]["message"]["tool_calls"][0]
product_data = json.loads(tool_call["function"]["arguments"])
return product_data
return None
# Example usage
html = """
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$299.99</span>
<div class="rating">4.5 stars</div>
<p class="stock">In Stock</p>
<ul class="features">
<li>Active Noise Cancellation</li>
<li>40-hour battery life</li>
<li>Bluetooth 5.0</li>
</ul>
</div>
"""
product = extract_product_data(html)
print(json.dumps(product, indent=2))
2. JSON Mode for Structured Output
Deepseek also supports a JSON response format that ensures the model returns valid JSON:
const axios = require('axios');
async function extractArticleMetadata(htmlContent) {
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: `You are a data extraction assistant. Extract article metadata and return it as JSON with these fields:
- title (string)
- author (string)
- publishDate (ISO date string)
- category (string)
- tags (array of strings)
- wordCount (number)
- excerpt (string, max 200 chars)`
},
{
role: 'user',
content: `Extract metadata from this HTML:\n\n${htmlContent}`
}
],
response_format: { type: 'json_object' },
temperature: 0.1
},
{
headers: {
'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
return JSON.parse(response.data.choices[0].message.content);
}
// Example usage
const html = `
<article>
<h1>Understanding AI in Web Scraping</h1>
<div class="author">By Jane Smith</div>
<time datetime="2025-01-15">January 15, 2025</time>
<div class="category">Technology</div>
<div class="tags">AI, Web Scraping, Machine Learning</div>
<p>Artificial intelligence is revolutionizing how we extract data from websites...</p>
</article>
`;
extractArticleMetadata(html).then(metadata => {
console.log(JSON.stringify(metadata, null, 2));
});
Advanced Prompt Engineering for Structured Extraction
The quality of structured data extraction heavily depends on prompt engineering. Here are proven patterns for Deepseek:
Pattern 1: Schema-First Extraction
Provide the exact schema upfront in your system prompt:
def extract_with_schema(html_content, schema):
"""Extract data according to a predefined schema"""
system_prompt = f"""You are a precise data extraction system.
Extract information from HTML and return ONLY valid JSON matching this exact schema:
{json.dumps(schema, indent=2)}
Rules:
- Use null for missing values
- Convert dates to ISO 8601 format
- Extract numbers without currency symbols or commas
- Preserve arrays even if empty []
- Do not add fields not in the schema"""
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": f"Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"HTML:\n{html_content}"}
],
"response_format": {"type": "json_object"},
"temperature": 0.0
}
)
return response.json()["choices"][0]["message"]["content"]
Pattern 2: Few-Shot Learning for Complex Extractions
For complex extraction tasks, provide examples:
def extract_with_examples(html_content):
"""Use few-shot learning for better extraction accuracy"""
messages = [
{
"role": "system",
"content": "Extract event information from HTML and return structured JSON."
},
{
"role": "user",
"content": '<div><h2>Tech Conference 2025</h2><p>Date: March 15-17, 2025</p><p>Location: San Francisco, CA</p></div>'
},
{
"role": "assistant",
"content": json.dumps({
"name": "Tech Conference 2025",
"startDate": "2025-03-15",
"endDate": "2025-03-17",
"location": {
"city": "San Francisco",
"state": "CA",
"country": "USA"
}
})
},
{
"role": "user",
"content": html_content
}
]
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": f"Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": messages,
"response_format": {"type": "json_object"},
"temperature": 0.1
}
)
return json.loads(response.json()["choices"][0]["message"]["content"])
Handling Batch Extraction
When extracting data from multiple similar elements (like product listings), use this pattern:
async function extractMultipleProducts(htmlContent) {
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: `Extract ALL products from the HTML. Return a JSON object with a "products" array.
Each product should have: id, name, price, imageUrl, rating, reviewCount.
If a field is missing, use null.`
},
{
role: 'user',
content: htmlContent
}
],
response_format: { type: 'json_object' },
temperature: 0.0
},
{
headers: {
'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
const data = JSON.parse(response.data.choices[0].message.content);
return data.products || [];
}
Error Handling and Validation
Always validate extracted data and handle potential errors:
from pydantic import BaseModel, ValidationError, Field
from typing import List, Optional
from datetime import date
class Product(BaseModel):
name: str
price: float = Field(gt=0)
rating: Optional[float] = Field(None, ge=0, le=5)
availability: str
features: List[str] = []
def extract_and_validate(html_content):
"""Extract data with validation"""
try:
# Extract using Deepseek
raw_data = extract_product_data(html_content)
# Validate with Pydantic
product = Product(**raw_data)
return product.dict()
except ValidationError as e:
print(f"Validation error: {e}")
return None
except Exception as e:
print(f"Extraction error: {e}")
return None
Combining LLM Extraction with Traditional Scraping
For optimal results, combine traditional web scraping techniques with LLM-based extraction:
from bs4 import BeautifulSoup
def hybrid_extraction(url):
"""Combine traditional scraping with LLM extraction"""
# Step 1: Fetch HTML with traditional tools
html = fetch_html(url) # Your fetching logic
# Step 2: Pre-process with BeautifulSoup to reduce token usage
soup = BeautifulSoup(html, 'html.parser')
main_content = soup.find('main') or soup.find('article')
if main_content:
# Only send relevant content to LLM
cleaned_html = str(main_content)
else:
cleaned_html = html
# Step 3: Extract structured data with Deepseek
structured_data = extract_with_schema(cleaned_html, YOUR_SCHEMA)
return structured_data
Best Practices for Deepseek Structured Extraction
- Minimize Token Usage: Pre-clean HTML to remove scripts, styles, and irrelevant elements
- Use Low Temperature: Set temperature to 0.0-0.1 for consistent, deterministic extraction
- Define Strict Schemas: Use function calling with detailed parameter descriptions
- Implement Retry Logic: Handle API errors and rate limits gracefully
- Validate Output: Always validate extracted JSON against your schema
- Cache Results: Cache extraction results to minimize API costs
- Monitor Costs: Track token usage and implement budget controls
Cost Optimization Strategies
Deepseek is significantly cheaper than alternatives like GPT-4, but costs can still add up:
def estimate_extraction_cost(html_content, price_per_million_tokens=0.14):
"""Estimate cost before extraction"""
# Rough estimation: 1 token ≈ 4 characters
input_tokens = len(html_content) / 4
estimated_output_tokens = 500 # Adjust based on schema
total_tokens = input_tokens + estimated_output_tokens
cost = (total_tokens / 1_000_000) * price_per_million_tokens
return {
'estimated_tokens': int(total_tokens),
'estimated_cost_usd': round(cost, 6)
}
When to Use LLM-Based vs Traditional Extraction
Use LLM extraction when: - Website structures vary significantly - You need semantic understanding of content - Data appears in natural language (reviews, descriptions) - Layout changes frequently
Use traditional extraction when: - Website structure is consistent - You need real-time, low-latency extraction - Budget is extremely constrained - Data is in clearly defined HTML elements
For complex scenarios, consider integrating both approaches with tools that handle dynamic content alongside LLM-based extraction for optimal results.
Conclusion
Structured data extraction with LLMs like Deepseek represents a paradigm shift in web scraping, offering flexibility and resilience that traditional methods cannot match. By leveraging function calling, JSON mode, and sophisticated prompt engineering, developers can build extraction systems that adapt to changes automatically while maintaining high accuracy.
Deepseek's competitive pricing and strong performance make it an excellent choice for production web scraping workflows, especially when combined with proper validation, error handling, and cost optimization strategies.