How do I Handle JSON Extraction Using Deepseek?
JSON extraction with Deepseek is a powerful approach to convert unstructured HTML content into clean, structured JSON data. Deepseek's language models excel at understanding context and extracting specific fields from web pages, making it ideal for web scraping tasks that require structured output.
Understanding Deepseek for JSON Extraction
Deepseek offers several models optimized for different tasks, with Deepseek-V3 and Deepseek-R1 being particularly effective for data extraction. These models can parse HTML content and return JSON-formatted responses based on your schema requirements.
The key advantages of using Deepseek for JSON extraction include:
- Schema-based extraction: Define your desired JSON structure and let Deepseek extract matching data
- Context awareness: The model understands relationships between data points
- Flexible parsing: Works with various HTML structures without brittle selectors
- Multi-field extraction: Extract multiple related fields in a single API call
Basic JSON Extraction with Deepseek API
Python Implementation
Here's a complete example of extracting product data from an HTML page using Deepseek:
import requests
import json
def extract_json_with_deepseek(html_content, schema):
"""
Extract structured JSON data from HTML using Deepseek API
"""
api_key = "your-deepseek-api-key"
# Define the extraction prompt
prompt = f"""
Extract the following information from this HTML and return it as valid JSON:
Schema:
{json.dumps(schema, indent=2)}
HTML Content:
{html_content}
Return only the JSON object with no additional text or explanation.
"""
# Call Deepseek API
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "You are a data extraction expert. Always return valid JSON."
},
{
"role": "user",
"content": prompt
}
],
"temperature": 0.1, # Lower temperature for consistent extraction
"response_format": {"type": "json_object"}
}
)
result = response.json()
extracted_data = json.loads(result["choices"][0]["message"]["content"])
return extracted_data
# Example usage
html = """
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$99.99</span>
<p class="description">High-quality Bluetooth headphones with noise cancellation</p>
<div class="rating">4.5 stars (230 reviews)</div>
<span class="availability">In Stock</span>
</div>
"""
schema = {
"name": "string",
"price": "number",
"description": "string",
"rating": "number",
"review_count": "number",
"in_stock": "boolean"
}
product_data = extract_json_with_deepseek(html, schema)
print(json.dumps(product_data, indent=2))
Expected output:
{
"name": "Wireless Headphones",
"price": 99.99,
"description": "High-quality Bluetooth headphones with noise cancellation",
"rating": 4.5,
"review_count": 230,
"in_stock": true
}
JavaScript/Node.js Implementation
Here's the equivalent implementation in JavaScript:
const axios = require('axios');
async function extractJsonWithDeepseek(htmlContent, schema) {
const apiKey = 'your-deepseek-api-key';
const prompt = `
Extract the following information from this HTML and return it as valid JSON:
Schema:
${JSON.stringify(schema, null, 2)}
HTML Content:
${htmlContent}
Return only the JSON object with no additional text or explanation.
`;
try {
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'You are a data extraction expert. Always return valid JSON.'
},
{
role: 'user',
content: prompt
}
],
temperature: 0.1,
response_format: { type: 'json_object' }
},
{
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
}
}
);
const extractedData = JSON.parse(
response.data.choices[0].message.content
);
return extractedData;
} catch (error) {
console.error('Extraction error:', error.message);
throw error;
}
}
// Example usage
const html = `
<article class="blog-post">
<h2>Getting Started with AI Web Scraping</h2>
<div class="meta">
<span class="author">John Doe</span>
<time>2024-01-15</time>
</div>
<div class="content">
<p>Learn how to use AI for efficient web scraping...</p>
</div>
<div class="tags">
<span>AI</span>
<span>Web Scraping</span>
<span>Tutorial</span>
</div>
</article>
`;
const schema = {
title: 'string',
author: 'string',
publish_date: 'string',
tags: 'array of strings',
summary: 'string'
};
extractJsonWithDeepseek(html, schema)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(err => console.error(err));
Advanced JSON Extraction Techniques
Extracting Nested JSON Structures
For complex data with nested relationships, define a hierarchical schema:
# Schema for extracting nested product data
nested_schema = {
"product": {
"name": "string",
"price": {
"amount": "number",
"currency": "string",
"on_sale": "boolean",
"original_price": "number or null"
},
"specifications": {
"brand": "string",
"model": "string",
"features": "array of strings"
},
"shipping": {
"available": "boolean",
"cost": "number",
"estimated_days": "number"
}
}
}
# The prompt should explicitly mention the nested structure
prompt = f"""
Extract product information from the HTML into a nested JSON structure.
Follow this exact schema and maintain the hierarchy:
{json.dumps(nested_schema, indent=2)}
HTML: {html_content}
Return valid JSON only.
"""
Extracting Arrays and Lists
When scraping multiple items from a page (like search results or product listings):
def extract_list_with_deepseek(html_content):
"""Extract a list of items as JSON array"""
prompt = f"""
Extract all product items from this HTML page.
Return a JSON array where each item has this structure:
{{
"items": [
{{
"title": "string",
"price": "number",
"url": "string",
"image_url": "string"
}}
]
}}
HTML:
{html_content}
Return only valid JSON.
"""
# API call similar to previous examples
# ...
return extracted_data
Handling Dynamic Content with Deepseek
When working with JavaScript-rendered pages, combine Deepseek with browser automation tools. While handling AJAX requests using Puppeteer can capture dynamic content, you can then pass the rendered HTML to Deepseek for JSON extraction:
const puppeteer = require('puppeteer');
const axios = require('axios');
async function scrapeAndExtractJson(url, schema) {
// Launch browser and get rendered HTML
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
// Wait for dynamic content to load
await page.waitForSelector('.product-list');
const html = await page.content();
await browser.close();
// Extract JSON using Deepseek
const extractedData = await extractJsonWithDeepseek(html, schema);
return extractedData;
}
Best Practices for JSON Extraction
1. Define Clear Schemas
Always provide explicit schema definitions with data types:
# Good: Explicit schema with types
good_schema = {
"title": "string",
"price": "number (USD)",
"published_date": "string (ISO 8601 format)",
"available": "boolean"
}
# Avoid: Vague schema
bad_schema = {
"data": "various fields"
}
2. Use Type Validation
Validate the extracted JSON to ensure data quality:
import jsonschema
def validate_extracted_data(data, validation_schema):
"""Validate extracted JSON against a schema"""
try:
jsonschema.validate(instance=data, schema=validation_schema)
return True
except jsonschema.exceptions.ValidationError as e:
print(f"Validation error: {e.message}")
return False
# JSON Schema for validation
validation_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number", "minimum": 0},
"in_stock": {"type": "boolean"}
},
"required": ["name", "price"]
}
extracted = extract_json_with_deepseek(html, schema)
if validate_extracted_data(extracted, validation_schema):
print("Data is valid!")
3. Handle Missing or Null Values
Specify how to handle missing data in your schema:
schema_with_nulls = {
"title": "string (required)",
"subtitle": "string or null if not present",
"price": "number or null if not available",
"discount_price": "number or null if no discount"
}
4. Optimize Token Usage
To reduce costs when extracting JSON from large HTML pages:
from bs4 import BeautifulSoup
def preprocess_html(html_content):
"""Remove unnecessary HTML before sending to Deepseek"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove scripts, styles, and comments
for element in soup(['script', 'style', 'meta', 'link']):
element.decompose()
# Get only the relevant section
main_content = soup.find('main') or soup.find('body')
return str(main_content)
# Use preprocessed HTML
clean_html = preprocess_html(raw_html)
extracted_data = extract_json_with_deepseek(clean_html, schema)
5. Implement Error Handling and Retries
Always implement robust error handling:
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def extract_with_retry(html, schema):
"""Extract JSON with automatic retries on failure"""
try:
result = extract_json_with_deepseek(html, schema)
# Verify we got valid JSON
if not isinstance(result, dict):
raise ValueError("Response is not a valid JSON object")
return result
except json.JSONDecodeError as e:
print(f"JSON parsing error: {e}")
raise
except requests.exceptions.RequestException as e:
print(f"API request error: {e}")
raise
Comparing Deepseek with Traditional JSON Extraction
Traditional web scraping relies on CSS selectors or XPath to extract data, which can be brittle:
# Traditional approach with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
traditional_data = {
'name': soup.select_one('.product h1').text,
'price': float(soup.select_one('.price').text.strip('$')),
# Breaks if HTML structure changes
}
# Deepseek approach - more resilient to HTML changes
deepseek_data = extract_json_with_deepseek(html, schema)
The Deepseek approach is more maintainable when dealing with sites that frequently update their HTML structure, similar to how monitoring network requests in Puppeteer provides flexibility in capturing dynamic data.
Real-World Use Cases
E-commerce Product Scraping
ecommerce_schema = {
"products": [
{
"name": "string",
"brand": "string",
"price": "number",
"currency": "string",
"rating": "number (0-5)",
"review_count": "integer",
"availability": "string (in stock/out of stock)",
"specifications": "object with key-value pairs"
}
]
}
News Article Extraction
article_schema = {
"headline": "string",
"subheadline": "string or null",
"author": "string or array of strings",
"publish_date": "string (ISO format)",
"last_updated": "string or null",
"categories": "array of strings",
"content": "string (article body)",
"image_urls": "array of strings"
}
Job Listing Aggregation
job_schema = {
"jobs": [
{
"title": "string",
"company": "string",
"location": "string",
"salary_range": "string or null",
"employment_type": "string (full-time/part-time/contract)",
"posted_date": "string",
"requirements": "array of strings",
"description": "string"
}
]
}
Conclusion
Deepseek provides a powerful and flexible approach to JSON extraction from web pages. By leveraging its natural language understanding capabilities, you can create more maintainable web scraping solutions that are resilient to HTML structure changes. The key to success is defining clear schemas, implementing proper error handling, and optimizing your prompts for consistent results.
Whether you're building a data pipeline, aggregating content, or creating a web scraping service, Deepseek's JSON extraction capabilities can significantly reduce development time and improve data quality compared to traditional parsing methods.