How do I Handle JSON Extraction Using Deepseek?

JSON extraction with Deepseek is a powerful approach to convert unstructured HTML content into clean, structured JSON data. Deepseek's language models excel at understanding context and extracting specific fields from web pages, making it ideal for web scraping tasks that require structured output.

Understanding Deepseek for JSON Extraction

Deepseek offers several models optimized for different tasks, with Deepseek-V3 and Deepseek-R1 being particularly effective for data extraction. These models can parse HTML content and return JSON-formatted responses based on your schema requirements.

The key advantages of using Deepseek for JSON extraction include:

Schema-based extraction: Define your desired JSON structure and let Deepseek extract matching data
Context awareness: The model understands relationships between data points
Flexible parsing: Works with various HTML structures without brittle selectors
Multi-field extraction: Extract multiple related fields in a single API call

Basic JSON Extraction with Deepseek API

Python Implementation

Here's a complete example of extracting product data from an HTML page using Deepseek:

import requests
import json

def extract_json_with_deepseek(html_content, schema):
    """
    Extract structured JSON data from HTML using Deepseek API
    """
    api_key = "your-deepseek-api-key"

    # Define the extraction prompt
    prompt = f"""
Extract the following information from this HTML and return it as valid JSON:

Schema:
{json.dumps(schema, indent=2)}

HTML Content:
{html_content}

Return only the JSON object with no additional text or explanation.
"""

    # Call Deepseek API
    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a data extraction expert. Always return valid JSON."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            "temperature": 0.1,  # Lower temperature for consistent extraction
            "response_format": {"type": "json_object"}
        }
    )

    result = response.json()
    extracted_data = json.loads(result["choices"][0]["message"]["content"])

    return extracted_data

# Example usage
html = """
<div class="product">
    <h1>Wireless Headphones</h1>
    <span class="price">$99.99</span>
    <p class="description">High-quality Bluetooth headphones with noise cancellation</p>
    <div class="rating">4.5 stars (230 reviews)</div>
    <span class="availability">In Stock</span>
</div>
"""

schema = {
    "name": "string",
    "price": "number",
    "description": "string",
    "rating": "number",
    "review_count": "number",
    "in_stock": "boolean"
}

product_data = extract_json_with_deepseek(html, schema)
print(json.dumps(product_data, indent=2))

Expected output:

{
  "name": "Wireless Headphones",
  "price": 99.99,
  "description": "High-quality Bluetooth headphones with noise cancellation",
  "rating": 4.5,
  "review_count": 230,
  "in_stock": true
}

JavaScript/Node.js Implementation

Here's the equivalent implementation in JavaScript:

const axios = require('axios');

async function extractJsonWithDeepseek(htmlContent, schema) {
    const apiKey = 'your-deepseek-api-key';

    const prompt = `
Extract the following information from this HTML and return it as valid JSON:

Schema:
${JSON.stringify(schema, null, 2)}

HTML Content:
${htmlContent}

Return only the JSON object with no additional text or explanation.
`;

    try {
        const response = await axios.post(
            'https://api.deepseek.com/v1/chat/completions',
            {
                model: 'deepseek-chat',
                messages: [
                    {
                        role: 'system',
                        content: 'You are a data extraction expert. Always return valid JSON.'
                    },
                    {
                        role: 'user',
                        content: prompt
                    }
                ],
                temperature: 0.1,
                response_format: { type: 'json_object' }
            },
            {
                headers: {
                    'Authorization': `Bearer ${apiKey}`,
                    'Content-Type': 'application/json'
                }
            }
        );

        const extractedData = JSON.parse(
            response.data.choices[0].message.content
        );

        return extractedData;
    } catch (error) {
        console.error('Extraction error:', error.message);
        throw error;
    }
}

// Example usage
const html = `
<article class="blog-post">
    <h2>Getting Started with AI Web Scraping</h2>
    <div class="meta">
        <span class="author">John Doe</span>
        <time>2024-01-15</time>
    </div>
    <div class="content">
        <p>Learn how to use AI for efficient web scraping...</p>
    </div>
    <div class="tags">
        <span>AI</span>
        <span>Web Scraping</span>
        <span>Tutorial</span>
    </div>
</article>
`;

const schema = {
    title: 'string',
    author: 'string',
    publish_date: 'string',
    tags: 'array of strings',
    summary: 'string'
};

extractJsonWithDeepseek(html, schema)
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(err => console.error(err));

Advanced JSON Extraction Techniques

Extracting Nested JSON Structures

For complex data with nested relationships, define a hierarchical schema:

# Schema for extracting nested product data
nested_schema = {
    "product": {
        "name": "string",
        "price": {
            "amount": "number",
            "currency": "string",
            "on_sale": "boolean",
            "original_price": "number or null"
        },
        "specifications": {
            "brand": "string",
            "model": "string",
            "features": "array of strings"
        },
        "shipping": {
            "available": "boolean",
            "cost": "number",
            "estimated_days": "number"
        }
    }
}

# The prompt should explicitly mention the nested structure
prompt = f"""
Extract product information from the HTML into a nested JSON structure.
Follow this exact schema and maintain the hierarchy:

{json.dumps(nested_schema, indent=2)}

HTML: {html_content}

Return valid JSON only.
"""

Extracting Arrays and Lists

When scraping multiple items from a page (like search results or product listings):

def extract_list_with_deepseek(html_content):
    """Extract a list of items as JSON array"""

    prompt = f"""
Extract all product items from this HTML page.
Return a JSON array where each item has this structure:

{{
    "items": [
        {{
            "title": "string",
            "price": "number",
            "url": "string",
            "image_url": "string"
        }}
    ]
}}

HTML:
{html_content}

Return only valid JSON.
"""

    # API call similar to previous examples
    # ...

    return extracted_data

Handling Dynamic Content with Deepseek

When working with JavaScript-rendered pages, combine Deepseek with browser automation tools. While handling AJAX requests using Puppeteer can capture dynamic content, you can then pass the rendered HTML to Deepseek for JSON extraction:

const puppeteer = require('puppeteer');
const axios = require('axios');

async function scrapeAndExtractJson(url, schema) {
    // Launch browser and get rendered HTML
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto(url, { waitUntil: 'networkidle2' });

    // Wait for dynamic content to load
    await page.waitForSelector('.product-list');

    const html = await page.content();
    await browser.close();

    // Extract JSON using Deepseek
    const extractedData = await extractJsonWithDeepseek(html, schema);

    return extractedData;
}

Best Practices for JSON Extraction

1. Define Clear Schemas

Always provide explicit schema definitions with data types:

# Good: Explicit schema with types
good_schema = {
    "title": "string",
    "price": "number (USD)",
    "published_date": "string (ISO 8601 format)",
    "available": "boolean"
}

# Avoid: Vague schema
bad_schema = {
    "data": "various fields"
}

2. Use Type Validation

Validate the extracted JSON to ensure data quality:

import jsonschema

def validate_extracted_data(data, validation_schema):
    """Validate extracted JSON against a schema"""
    try:
        jsonschema.validate(instance=data, schema=validation_schema)
        return True
    except jsonschema.exceptions.ValidationError as e:
        print(f"Validation error: {e.message}")
        return False

# JSON Schema for validation
validation_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number", "minimum": 0},
        "in_stock": {"type": "boolean"}
    },
    "required": ["name", "price"]
}

extracted = extract_json_with_deepseek(html, schema)
if validate_extracted_data(extracted, validation_schema):
    print("Data is valid!")

3. Handle Missing or Null Values

Specify how to handle missing data in your schema:

schema_with_nulls = {
    "title": "string (required)",
    "subtitle": "string or null if not present",
    "price": "number or null if not available",
    "discount_price": "number or null if no discount"
}

4. Optimize Token Usage

To reduce costs when extracting JSON from large HTML pages:

from bs4 import BeautifulSoup

def preprocess_html(html_content):
    """Remove unnecessary HTML before sending to Deepseek"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove scripts, styles, and comments
    for element in soup(['script', 'style', 'meta', 'link']):
        element.decompose()

    # Get only the relevant section
    main_content = soup.find('main') or soup.find('body')

    return str(main_content)

# Use preprocessed HTML
clean_html = preprocess_html(raw_html)
extracted_data = extract_json_with_deepseek(clean_html, schema)

5. Implement Error Handling and Retries

Always implement robust error handling:

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def extract_with_retry(html, schema):
    """Extract JSON with automatic retries on failure"""
    try:
        result = extract_json_with_deepseek(html, schema)

        # Verify we got valid JSON
        if not isinstance(result, dict):
            raise ValueError("Response is not a valid JSON object")

        return result

    except json.JSONDecodeError as e:
        print(f"JSON parsing error: {e}")
        raise
    except requests.exceptions.RequestException as e:
        print(f"API request error: {e}")
        raise

Comparing Deepseek with Traditional JSON Extraction

Traditional web scraping relies on CSS selectors or XPath to extract data, which can be brittle:

# Traditional approach with BeautifulSoup
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
traditional_data = {
    'name': soup.select_one('.product h1').text,
    'price': float(soup.select_one('.price').text.strip('$')),
    # Breaks if HTML structure changes
}

# Deepseek approach - more resilient to HTML changes
deepseek_data = extract_json_with_deepseek(html, schema)

The Deepseek approach is more maintainable when dealing with sites that frequently update their HTML structure, similar to how monitoring network requests in Puppeteer provides flexibility in capturing dynamic data.

Real-World Use Cases

E-commerce Product Scraping

ecommerce_schema = {
    "products": [
        {
            "name": "string",
            "brand": "string",
            "price": "number",
            "currency": "string",
            "rating": "number (0-5)",
            "review_count": "integer",
            "availability": "string (in stock/out of stock)",
            "specifications": "object with key-value pairs"
        }
    ]
}

News Article Extraction

article_schema = {
    "headline": "string",
    "subheadline": "string or null",
    "author": "string or array of strings",
    "publish_date": "string (ISO format)",
    "last_updated": "string or null",
    "categories": "array of strings",
    "content": "string (article body)",
    "image_urls": "array of strings"
}

Job Listing Aggregation

job_schema = {
    "jobs": [
        {
            "title": "string",
            "company": "string",
            "location": "string",
            "salary_range": "string or null",
            "employment_type": "string (full-time/part-time/contract)",
            "posted_date": "string",
            "requirements": "array of strings",
            "description": "string"
        }
    ]
}

Conclusion

Deepseek provides a powerful and flexible approach to JSON extraction from web pages. By leveraging its natural language understanding capabilities, you can create more maintainable web scraping solutions that are resilient to HTML structure changes. The key to success is defining clear schemas, implementing proper error handling, and optimizing your prompts for consistent results.

Whether you're building a data pipeline, aggregating content, or creating a web scraping service, Deepseek's JSON extraction capabilities can significantly reduce development time and improve data quality compared to traditional parsing methods.

Table of contents