How do I Perform LLM Data Extraction Using Deepseek?
LLM data extraction with Deepseek involves using the Deepseek API to intelligently parse and extract structured information from unstructured web content. Unlike traditional web scraping that relies on CSS selectors or XPath, Deepseek's language models can understand context, handle varying HTML structures, and extract data based on semantic meaning.
What is LLM Data Extraction?
LLM (Large Language Model) data extraction leverages AI to understand and extract information from web pages without requiring rigid selectors. This approach is particularly useful when:
- Web page structures frequently change
- Data isn't consistently formatted
- You need to extract information based on context rather than exact position
- Multiple variations of the same type of content exist
Setting Up Deepseek for Data Extraction
Getting Your API Key
First, obtain your Deepseek API key from the Deepseek platform. You'll need this for authentication.
Installation
Python:
pip install openai # Deepseek uses OpenAI-compatible API
JavaScript/Node.js:
npm install openai
Basic Data Extraction with Deepseek
Python Example
Here's a complete example of extracting product information from HTML:
import requests
from openai import OpenAI
# Initialize Deepseek client
client = OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
# Fetch HTML content
url = "https://example.com/product-page"
html_content = requests.get(url).text
# Create extraction prompt
prompt = f"""
Extract the following information from this product page:
- Product name
- Price
- Description
- Availability status
- Customer rating
Return the data in JSON format.
HTML:
{html_content[:8000]} # Limit to stay within token limits
"""
# Call Deepseek API
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "You are a data extraction assistant that returns valid JSON."},
{"role": "user", "content": prompt}
],
temperature=0.0, # Lower temperature for consistent extraction
response_format={"type": "json_object"} # Ensures JSON output
)
# Parse the extracted data
extracted_data = response.choices[0].message.content
print(extracted_data)
JavaScript Example
const OpenAI = require('openai');
const axios = require('axios');
// Initialize Deepseek client
const client = new OpenAI({
apiKey: 'your-deepseek-api-key',
baseURL: 'https://api.deepseek.com'
});
async function extractProductData(url) {
// Fetch HTML content
const response = await axios.get(url);
const htmlContent = response.data;
// Create extraction prompt
const prompt = `
Extract the following information from this product page:
- Product name
- Price
- Description
- Availability status
- Customer rating
Return the data in JSON format.
HTML:
${htmlContent.substring(0, 8000)}
`;
// Call Deepseek API
const completion = await client.chat.completions.create({
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'You are a data extraction assistant that returns valid JSON.'
},
{
role: 'user',
content: prompt
}
],
temperature: 0.0,
response_format: { type: 'json_object' }
});
// Parse and return extracted data
const extractedData = JSON.parse(completion.choices[0].message.content);
return extractedData;
}
// Usage
extractProductData('https://example.com/product-page')
.then(data => console.log(data))
.catch(error => console.error('Extraction error:', error));
Advanced Data Extraction Techniques
Using Few-Shot Examples
Providing examples improves extraction accuracy:
prompt = f"""
I need to extract product information. Here are examples of the expected format:
Example 1:
Input: <h1>Wireless Mouse</h1><span class="price">$29.99</span>
Output: {{"name": "Wireless Mouse", "price": 29.99}}
Example 2:
Input: <div class="product">Gaming Keyboard - $89.99</div>
Output: {{"name": "Gaming Keyboard", "price": 89.99}}
Now extract from this HTML:
{html_content}
Return only the JSON object.
"""
Structured Output with Function Calling
Deepseek supports function calling for structured extraction:
import json
# Define extraction schema
extraction_schema = {
"name": "extract_product_data",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"product_name": {"type": "string", "description": "The name of the product"},
"price": {"type": "number", "description": "Price in USD"},
"currency": {"type": "string", "description": "Currency code"},
"in_stock": {"type": "boolean", "description": "Whether product is available"},
"rating": {"type": "number", "description": "Customer rating out of 5"},
"review_count": {"type": "integer", "description": "Number of reviews"}
},
"required": ["product_name", "price"]
}
}
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "user", "content": f"Extract product data from: {html_content}"}
],
functions=[extraction_schema],
function_call={"name": "extract_product_data"}
)
# Get structured data
function_args = json.loads(response.choices[0].message.function_call.arguments)
print(function_args)
Handling Large HTML Documents
When dealing with pages that exceed token limits:
from bs4 import BeautifulSoup
def preprocess_html(html_content, max_length=8000):
"""Clean and reduce HTML to essential content"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script, style, and navigation elements
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Get clean text with minimal markup
clean_html = str(soup)
# Truncate if still too long
if len(clean_html) > max_length:
clean_html = clean_html[:max_length]
return clean_html
# Use preprocessed HTML
clean_html = preprocess_html(html_content)
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "user", "content": f"Extract data from: {clean_html}"}
]
)
Batch Processing Multiple Pages
For extracting data from multiple URLs:
import asyncio
from concurrent.futures import ThreadPoolExecutor
async def extract_from_urls(urls, extraction_prompt_template):
"""Extract data from multiple URLs concurrently"""
results = []
def extract_single(url):
html = requests.get(url).text
prompt = extraction_prompt_template.format(html=html[:8000])
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": prompt}],
temperature=0.0,
response_format={"type": "json_object"}
)
return {
"url": url,
"data": response.choices[0].message.content
}
# Process URLs in parallel
with ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(extract_single, urls))
return results
# Usage
urls = [
"https://example.com/product1",
"https://example.com/product2",
"https://example.com/product3"
]
template = """
Extract product name and price from this HTML:
{html}
"""
results = asyncio.run(extract_from_urls(urls, template))
Integrating with Web Scraping Tools
Deepseek LLM extraction works well when combined with traditional scraping tools. For JavaScript-heavy sites, you can first render the page with browser automation tools and then use Deepseek for extraction:
from playwright.sync_api import sync_playwright
def scrape_with_browser_and_llm(url):
"""Render JavaScript page, then extract with Deepseek"""
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# Wait for content to load
page.wait_for_load_state('networkidle')
# Get rendered HTML
html_content = page.content()
browser.close()
# Extract with Deepseek
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{
"role": "user",
"content": f"Extract all product listings from: {html_content[:8000]}"
}],
response_format={"type": "json_object"}
)
return response.choices[0].message.content
Error Handling and Validation
Implement robust error handling for production use:
import json
from jsonschema import validate, ValidationError
def extract_with_validation(html_content, schema):
"""Extract data and validate against schema"""
max_retries = 3
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{
"role": "user",
"content": f"Extract data following this schema: {schema}\n\nHTML: {html_content[:8000]}"
}],
temperature=0.0,
response_format={"type": "json_object"}
)
data = json.loads(response.choices[0].message.content)
# Validate against schema
validate(instance=data, schema=schema)
return data
except (json.JSONDecodeError, ValidationError) as e:
if attempt == max_retries - 1:
raise Exception(f"Failed to extract valid data after {max_retries} attempts: {e}")
continue
return None
# Define validation schema
schema = {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"}
},
"required": ["product_name", "price"]
}
result = extract_with_validation(html_content, schema)
Cost Optimization Tips
LLM data extraction can be expensive at scale. Here are optimization strategies:
- Preprocess HTML: Remove unnecessary content before sending to API
- Cache results: Store extracted data to avoid reprocessing
- Use cheaper models: Start with
deepseek-chat
before trying larger models - Batch requests: Group multiple extractions when possible
- Set token limits: Use
max_tokens
parameter to control costs
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": prompt}],
max_tokens=500, # Limit response length
temperature=0.0
)
When to Use LLM Data Extraction
LLM extraction with Deepseek is ideal for:
- Unstructured content: News articles, blog posts, product descriptions
- Varying layouts: Sites with inconsistent HTML structure
- Semantic extraction: Getting information based on meaning, not position
- Complex data: Extracting relationships and context between elements
For structured, consistent pages with predictable layouts, traditional CSS selectors or XPath may be more cost-effective and faster.
Conclusion
Deepseek provides powerful LLM capabilities for intelligent data extraction from web pages. By combining it with proper preprocessing, validation, and error handling, you can build robust extraction pipelines that handle varying HTML structures gracefully. When dealing with dynamic content and AJAX requests, pairing Deepseek with browser automation creates a comprehensive scraping solution.
Remember to always respect robots.txt, rate limits, and website terms of service when extracting data at scale. For production workloads requiring higher reliability and built-in proxy rotation, consider using a dedicated web scraping API service.