Table of contents

How do I use Deepseek for JSON output in web scraping?

Deepseek is a powerful large language model (LLM) that excels at extracting structured data from unstructured web content. When web scraping, getting data in JSON format is essential for downstream processing, storage, and integration with other systems. This guide shows you how to configure Deepseek to return clean, structured JSON output from your web scraping tasks.

Why Use Deepseek for JSON Extraction?

Traditional web scraping relies on CSS selectors and XPath to extract data, which breaks when websites change their HTML structure. Deepseep offers several advantages:

  • Schema flexibility: Define custom JSON schemas without writing complex parsing logic
  • Resilience to layout changes: Understands content semantically, not just structurally
  • Natural language instructions: Describe what you want in plain English
  • Complex data extraction: Handles nested structures, relationships, and context-aware extraction
  • Cost-effective: Deepseek pricing is competitive compared to other LLMs

Basic JSON Output with Deepseek API

The Deepseek API supports structured output through its chat completion endpoint. Here's how to configure it to return JSON:

Python Example

import requests
import json

# Deepseek API configuration
API_KEY = "your-deepseek-api-key"
API_URL = "https://api.deepseek.com/v1/chat/completions"

# HTML content to scrape (simplified example)
html_content = """
<div class="product">
    <h1>Wireless Headphones</h1>
    <span class="price">$79.99</span>
    <p class="description">Premium noise-canceling headphones with 30-hour battery life.</p>
    <div class="rating">4.5 stars (234 reviews)</div>
</div>
"""

# Define the JSON schema you want
json_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "description": {"type": "string"},
        "rating": {"type": "number"},
        "review_count": {"type": "integer"}
    },
    "required": ["name", "price", "description"]
}

# Create the prompt
prompt = f"""
Extract product information from the following HTML and return it as JSON matching this schema:
{json.dumps(json_schema, indent=2)}

HTML:
{html_content}

Return only valid JSON, no additional text.
"""

# Make API request
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}

payload = {
    "model": "deepseek-chat",
    "messages": [
        {
            "role": "system",
            "content": "You are a data extraction assistant. Always return valid JSON matching the requested schema."
        },
        {
            "role": "user",
            "content": prompt
        }
    ],
    "response_format": {"type": "json_object"},
    "temperature": 0.0  # Lower temperature for consistent output
}

response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()

# Parse the JSON output
extracted_data = json.loads(result['choices'][0]['message']['content'])
print(json.dumps(extracted_data, indent=2))

JavaScript/Node.js Example

const axios = require('axios');

const API_KEY = 'your-deepseek-api-key';
const API_URL = 'https://api.deepseek.com/v1/chat/completions';

async function extractProductData(html) {
    const jsonSchema = {
        type: "object",
        properties: {
            name: { type: "string" },
            price: { type: "number" },
            description: { type: "string" },
            rating: { type: "number" },
            review_count: { type: "integer" }
        },
        required: ["name", "price", "description"]
    };

    const prompt = `
Extract product information from the following HTML and return it as JSON matching this schema:
${JSON.stringify(jsonSchema, null, 2)}

HTML:
${html}

Return only valid JSON, no additional text.
    `;

    try {
        const response = await axios.post(
            API_URL,
            {
                model: "deepseek-chat",
                messages: [
                    {
                        role: "system",
                        content: "You are a data extraction assistant. Always return valid JSON matching the requested schema."
                    },
                    {
                        role: "user",
                        content: prompt
                    }
                ],
                response_format: { type: "json_object" },
                temperature: 0.0
            },
            {
                headers: {
                    'Authorization': `Bearer ${API_KEY}`,
                    'Content-Type': 'application/json'
                }
            }
        );

        const extractedData = JSON.parse(response.data.choices[0].message.content);
        return extractedData;
    } catch (error) {
        console.error('Error extracting data:', error.message);
        throw error;
    }
}

// Usage example
const htmlContent = `
<div class="product">
    <h1>Wireless Headphones</h1>
    <span class="price">$79.99</span>
    <p class="description">Premium noise-canceling headphones with 30-hour battery life.</p>
    <div class="rating">4.5 stars (234 reviews)</div>
</div>
`;

extractProductData(htmlContent)
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(error => console.error(error));

Advanced JSON Extraction Techniques

Using JSON Schema for Complex Structures

For more complex data structures like arrays and nested objects, you can define detailed JSON schemas:

# Define a complex schema for e-commerce listings
complex_schema = {
    "type": "object",
    "properties": {
        "products": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "id": {"type": "string"},
                    "name": {"type": "string"},
                    "price": {"type": "number"},
                    "currency": {"type": "string"},
                    "availability": {"type": "boolean"},
                    "specs": {
                        "type": "object",
                        "properties": {
                            "brand": {"type": "string"},
                            "color": {"type": "string"},
                            "weight": {"type": "string"}
                        }
                    },
                    "reviews": {
                        "type": "object",
                        "properties": {
                            "average_rating": {"type": "number"},
                            "total_count": {"type": "integer"}
                        }
                    }
                },
                "required": ["name", "price"]
            }
        }
    }
}

Combining Deepseek with Traditional Scraping

For optimal results, combine traditional web scraping libraries with Deepseek for JSON extraction. First, fetch the HTML content, then use Deepseek to parse it:

import requests
from bs4 import BeautifulSoup

def scrape_and_extract(url):
    # Fetch the page content
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })

    # Parse HTML and extract relevant section
    soup = BeautifulSoup(response.text, 'html.parser')
    product_section = soup.find('div', class_='product-details')

    # Send to Deepseek for structured extraction
    extracted_json = extract_with_deepseek(str(product_section))
    return extracted_json

def extract_with_deepseek(html_content):
    # Use the Deepseek API code from earlier examples
    # ... (API call implementation)
    pass

When scraping JavaScript-heavy sites, you may need to use browser automation tools to handle dynamic content rendering before passing the HTML to Deepseek.

Best Practices for JSON Output

1. Use Low Temperature Settings

Set temperature to 0.0 or close to it for deterministic, consistent JSON output:

payload = {
    "model": "deepseek-chat",
    "temperature": 0.0,  # Ensures consistent output
    # ... other parameters
}

2. Validate JSON Output

Always validate the returned JSON against your expected schema:

import jsonschema

def validate_json_output(data, schema):
    try:
        jsonschema.validate(instance=data, schema=schema)
        return True
    except jsonschema.exceptions.ValidationError as e:
        print(f"Validation error: {e.message}")
        return False

# Usage
if validate_json_output(extracted_data, json_schema):
    print("Data is valid!")
else:
    print("Data validation failed")

3. Handle Parsing Errors Gracefully

Implement robust error handling for JSON parsing:

function safeJsonParse(content) {
    try {
        return JSON.parse(content);
    } catch (error) {
        console.error('Failed to parse JSON:', error.message);
        console.error('Raw content:', content);

        // Attempt to extract JSON from markdown code blocks
        const jsonMatch = content.match(/```language-json\n([\s\S]*?)\n```/);
        if (jsonMatch) {
            try {
                return JSON.parse(jsonMatch[1]);
            } catch (e) {
                console.error('Failed to parse extracted JSON:', e.message);
            }
        }

        return null;
    }
}

4. Optimize Token Usage

To reduce costs, pre-process HTML to remove unnecessary elements:

from bs4 import BeautifulSoup

def clean_html_for_llm(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove script and style tags
    for tag in soup(['script', 'style', 'meta', 'link']):
        tag.decompose()

    # Remove comments
    for comment in soup.find_all(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Get text with minimal whitespace
    return str(soup).strip()

5. Use System Prompts Effectively

Craft clear system prompts to guide Deepseek's behavior:

system_prompt = """You are a precise data extraction assistant specializing in web scraping.

Rules:
1. Always return valid JSON matching the provided schema
2. Extract data accurately from HTML without hallucinating
3. Use null for missing optional fields
4. Convert prices to numbers (remove currency symbols)
5. Parse dates to ISO 8601 format when possible
6. If data cannot be found, return an empty object: {}
"""

Working with Dynamic Content

When scraping dynamic websites that load content via JavaScript, you'll need to render the page first. While Deepseek handles the extraction, you can use browser automation tools for rendering:

from playwright.sync_api import sync_playwright

def scrape_spa_with_deepseek(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Wait for dynamic content to load
        page.wait_for_selector('.product-list')

        # Get rendered HTML
        html_content = page.content()
        browser.close()

        # Extract JSON using Deepseek
        return extract_with_deepseek(html_content)

Batch Processing Multiple Pages

For scraping multiple pages, implement batch processing with rate limiting:

import time
from typing import List, Dict

def batch_scrape_to_json(urls: List[str], delay: float = 1.0) -> List[Dict]:
    results = []

    for i, url in enumerate(urls):
        try:
            print(f"Processing {i+1}/{len(urls)}: {url}")

            # Fetch and extract
            html = fetch_page(url)
            json_data = extract_with_deepseek(html)
            results.append(json_data)

            # Rate limiting
            if i < len(urls) - 1:
                time.sleep(delay)

        except Exception as e:
            print(f"Error processing {url}: {e}")
            results.append(None)

    return results

Error Handling and Retries

Implement retry logic for API failures:

import time
from typing import Optional

def extract_with_retry(html: str, max_retries: int = 3) -> Optional[dict]:
    for attempt in range(max_retries):
        try:
            return extract_with_deepseek(html)
        except requests.exceptions.RequestException as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Attempt {attempt + 1} failed. Retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                print(f"All {max_retries} attempts failed")
                raise

Monitoring and Debugging

Log all API interactions for debugging:

import logging
import json

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def extract_with_logging(html: str) -> dict:
    logger.info("Starting extraction")
    logger.debug(f"HTML length: {len(html)} characters")

    try:
        result = extract_with_deepseek(html)
        logger.info(f"Extraction successful: {len(result)} fields")
        logger.debug(f"Result: {json.dumps(result, indent=2)}")
        return result
    except Exception as e:
        logger.error(f"Extraction failed: {e}")
        raise

Conclusion

Deepseek provides a powerful way to extract structured JSON data from web pages without relying on fragile CSS selectors. By following these best practices—using JSON schemas, setting low temperature values, validating output, and implementing proper error handling—you can build robust web scraping pipelines that convert unstructured HTML into clean, structured JSON data.

The combination of Deepseek's natural language understanding with traditional scraping techniques creates a flexible system that adapts to website changes while maintaining consistent data quality. Whether you're scraping product catalogs, news articles, or research data, Deepseek's JSON output capabilities streamline the entire extraction process.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon