How do I use Deepseek for JSON output in web scraping?
Deepseek is a powerful large language model (LLM) that excels at extracting structured data from unstructured web content. When web scraping, getting data in JSON format is essential for downstream processing, storage, and integration with other systems. This guide shows you how to configure Deepseek to return clean, structured JSON output from your web scraping tasks.
Why Use Deepseek for JSON Extraction?
Traditional web scraping relies on CSS selectors and XPath to extract data, which breaks when websites change their HTML structure. Deepseep offers several advantages:
- Schema flexibility: Define custom JSON schemas without writing complex parsing logic
- Resilience to layout changes: Understands content semantically, not just structurally
- Natural language instructions: Describe what you want in plain English
- Complex data extraction: Handles nested structures, relationships, and context-aware extraction
- Cost-effective: Deepseek pricing is competitive compared to other LLMs
Basic JSON Output with Deepseek API
The Deepseek API supports structured output through its chat completion endpoint. Here's how to configure it to return JSON:
Python Example
import requests
import json
# Deepseek API configuration
API_KEY = "your-deepseek-api-key"
API_URL = "https://api.deepseek.com/v1/chat/completions"
# HTML content to scrape (simplified example)
html_content = """
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$79.99</span>
<p class="description">Premium noise-canceling headphones with 30-hour battery life.</p>
<div class="rating">4.5 stars (234 reviews)</div>
</div>
"""
# Define the JSON schema you want
json_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
"rating": {"type": "number"},
"review_count": {"type": "integer"}
},
"required": ["name", "price", "description"]
}
# Create the prompt
prompt = f"""
Extract product information from the following HTML and return it as JSON matching this schema:
{json.dumps(json_schema, indent=2)}
HTML:
{html_content}
Return only valid JSON, no additional text.
"""
# Make API request
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "You are a data extraction assistant. Always return valid JSON matching the requested schema."
},
{
"role": "user",
"content": prompt
}
],
"response_format": {"type": "json_object"},
"temperature": 0.0 # Lower temperature for consistent output
}
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
# Parse the JSON output
extracted_data = json.loads(result['choices'][0]['message']['content'])
print(json.dumps(extracted_data, indent=2))
JavaScript/Node.js Example
const axios = require('axios');
const API_KEY = 'your-deepseek-api-key';
const API_URL = 'https://api.deepseek.com/v1/chat/completions';
async function extractProductData(html) {
const jsonSchema = {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
description: { type: "string" },
rating: { type: "number" },
review_count: { type: "integer" }
},
required: ["name", "price", "description"]
};
const prompt = `
Extract product information from the following HTML and return it as JSON matching this schema:
${JSON.stringify(jsonSchema, null, 2)}
HTML:
${html}
Return only valid JSON, no additional text.
`;
try {
const response = await axios.post(
API_URL,
{
model: "deepseek-chat",
messages: [
{
role: "system",
content: "You are a data extraction assistant. Always return valid JSON matching the requested schema."
},
{
role: "user",
content: prompt
}
],
response_format: { type: "json_object" },
temperature: 0.0
},
{
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
}
}
);
const extractedData = JSON.parse(response.data.choices[0].message.content);
return extractedData;
} catch (error) {
console.error('Error extracting data:', error.message);
throw error;
}
}
// Usage example
const htmlContent = `
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$79.99</span>
<p class="description">Premium noise-canceling headphones with 30-hour battery life.</p>
<div class="rating">4.5 stars (234 reviews)</div>
</div>
`;
extractProductData(htmlContent)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error(error));
Advanced JSON Extraction Techniques
Using JSON Schema for Complex Structures
For more complex data structures like arrays and nested objects, you can define detailed JSON schemas:
# Define a complex schema for e-commerce listings
complex_schema = {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"id": {"type": "string"},
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"availability": {"type": "boolean"},
"specs": {
"type": "object",
"properties": {
"brand": {"type": "string"},
"color": {"type": "string"},
"weight": {"type": "string"}
}
},
"reviews": {
"type": "object",
"properties": {
"average_rating": {"type": "number"},
"total_count": {"type": "integer"}
}
}
},
"required": ["name", "price"]
}
}
}
}
Combining Deepseek with Traditional Scraping
For optimal results, combine traditional web scraping libraries with Deepseek for JSON extraction. First, fetch the HTML content, then use Deepseek to parse it:
import requests
from bs4 import BeautifulSoup
def scrape_and_extract(url):
# Fetch the page content
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
# Parse HTML and extract relevant section
soup = BeautifulSoup(response.text, 'html.parser')
product_section = soup.find('div', class_='product-details')
# Send to Deepseek for structured extraction
extracted_json = extract_with_deepseek(str(product_section))
return extracted_json
def extract_with_deepseek(html_content):
# Use the Deepseek API code from earlier examples
# ... (API call implementation)
pass
When scraping JavaScript-heavy sites, you may need to use browser automation tools to handle dynamic content rendering before passing the HTML to Deepseek.
Best Practices for JSON Output
1. Use Low Temperature Settings
Set temperature
to 0.0 or close to it for deterministic, consistent JSON output:
payload = {
"model": "deepseek-chat",
"temperature": 0.0, # Ensures consistent output
# ... other parameters
}
2. Validate JSON Output
Always validate the returned JSON against your expected schema:
import jsonschema
def validate_json_output(data, schema):
try:
jsonschema.validate(instance=data, schema=schema)
return True
except jsonschema.exceptions.ValidationError as e:
print(f"Validation error: {e.message}")
return False
# Usage
if validate_json_output(extracted_data, json_schema):
print("Data is valid!")
else:
print("Data validation failed")
3. Handle Parsing Errors Gracefully
Implement robust error handling for JSON parsing:
function safeJsonParse(content) {
try {
return JSON.parse(content);
} catch (error) {
console.error('Failed to parse JSON:', error.message);
console.error('Raw content:', content);
// Attempt to extract JSON from markdown code blocks
const jsonMatch = content.match(/```language-json\n([\s\S]*?)\n```/);
if (jsonMatch) {
try {
return JSON.parse(jsonMatch[1]);
} catch (e) {
console.error('Failed to parse extracted JSON:', e.message);
}
}
return null;
}
}
4. Optimize Token Usage
To reduce costs, pre-process HTML to remove unnecessary elements:
from bs4 import BeautifulSoup
def clean_html_for_llm(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove script and style tags
for tag in soup(['script', 'style', 'meta', 'link']):
tag.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Get text with minimal whitespace
return str(soup).strip()
5. Use System Prompts Effectively
Craft clear system prompts to guide Deepseek's behavior:
system_prompt = """You are a precise data extraction assistant specializing in web scraping.
Rules:
1. Always return valid JSON matching the provided schema
2. Extract data accurately from HTML without hallucinating
3. Use null for missing optional fields
4. Convert prices to numbers (remove currency symbols)
5. Parse dates to ISO 8601 format when possible
6. If data cannot be found, return an empty object: {}
"""
Working with Dynamic Content
When scraping dynamic websites that load content via JavaScript, you'll need to render the page first. While Deepseek handles the extraction, you can use browser automation tools for rendering:
from playwright.sync_api import sync_playwright
def scrape_spa_with_deepseek(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# Wait for dynamic content to load
page.wait_for_selector('.product-list')
# Get rendered HTML
html_content = page.content()
browser.close()
# Extract JSON using Deepseek
return extract_with_deepseek(html_content)
Batch Processing Multiple Pages
For scraping multiple pages, implement batch processing with rate limiting:
import time
from typing import List, Dict
def batch_scrape_to_json(urls: List[str], delay: float = 1.0) -> List[Dict]:
results = []
for i, url in enumerate(urls):
try:
print(f"Processing {i+1}/{len(urls)}: {url}")
# Fetch and extract
html = fetch_page(url)
json_data = extract_with_deepseek(html)
results.append(json_data)
# Rate limiting
if i < len(urls) - 1:
time.sleep(delay)
except Exception as e:
print(f"Error processing {url}: {e}")
results.append(None)
return results
Error Handling and Retries
Implement retry logic for API failures:
import time
from typing import Optional
def extract_with_retry(html: str, max_retries: int = 3) -> Optional[dict]:
for attempt in range(max_retries):
try:
return extract_with_deepseek(html)
except requests.exceptions.RequestException as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Attempt {attempt + 1} failed. Retrying in {wait_time}s...")
time.sleep(wait_time)
else:
print(f"All {max_retries} attempts failed")
raise
Monitoring and Debugging
Log all API interactions for debugging:
import logging
import json
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def extract_with_logging(html: str) -> dict:
logger.info("Starting extraction")
logger.debug(f"HTML length: {len(html)} characters")
try:
result = extract_with_deepseek(html)
logger.info(f"Extraction successful: {len(result)} fields")
logger.debug(f"Result: {json.dumps(result, indent=2)}")
return result
except Exception as e:
logger.error(f"Extraction failed: {e}")
raise
Conclusion
Deepseek provides a powerful way to extract structured JSON data from web pages without relying on fragile CSS selectors. By following these best practices—using JSON schemas, setting low temperature values, validating output, and implementing proper error handling—you can build robust web scraping pipelines that convert unstructured HTML into clean, structured JSON data.
The combination of Deepseek's natural language understanding with traditional scraping techniques creates a flexible system that adapts to website changes while maintaining consistent data quality. Whether you're scraping product catalogs, news articles, or research data, Deepseek's JSON output capabilities streamline the entire extraction process.