Table of contents

How do I Get Structured Output from Deepseek for Data Extraction?

Getting structured output from Deepseek for data extraction involves using the API's JSON mode or carefully crafted prompts to ensure the model returns data in a consistent, parseable format. Deepseek models support structured output through JSON formatting, making them ideal for web scraping tasks where you need consistent data extraction from HTML content.

Understanding Deepseek's Structured Output Capabilities

Deepseek models, particularly DeepSeek-V3 and DeepSeek-R1, excel at extracting structured data from unstructured content. The key to getting reliable structured output is using the response_format parameter with type: "json_object" or providing clear instructions in your prompts that specify the exact JSON schema you expect.

Using JSON Mode for Structured Output

The most reliable way to get structured output from Deepseek is to use JSON mode. This ensures the model always returns valid JSON that you can parse programmatically.

Python Example with JSON Mode

import requests
import json

def extract_structured_data(html_content, schema_description):
    """
    Extract structured data from HTML using Deepseek API
    """
    url = "https://api.deepseek.com/v1/chat/completions"

    headers = {
        "Authorization": "Bearer YOUR_DEEPSEEK_API_KEY",
        "Content-Type": "application/json"
    }

    prompt = f"""Extract the following information from this HTML and return it as JSON:

{schema_description}

HTML Content:
{html_content}

Return only valid JSON with no additional text."""

    payload = {
        "model": "deepseek-chat",
        "messages": [
            {
                "role": "system",
                "content": "You are a data extraction assistant. Always return valid JSON."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        "response_format": {
            "type": "json_object"
        },
        "temperature": 0.1  # Low temperature for consistent output
    }

    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()

    result = response.json()
    extracted_data = json.loads(result['choices'][0]['message']['content'])

    return extracted_data

# Example usage
html = """
<div class="product">
    <h1>Wireless Bluetooth Headphones</h1>
    <span class="price">$79.99</span>
    <p class="description">Premium noise-canceling headphones</p>
    <span class="rating">4.5 stars</span>
    <div class="stock">In Stock</div>
</div>
"""

schema = """
{
    "title": "product name",
    "price": "numeric price value",
    "description": "product description",
    "rating": "numeric rating",
    "in_stock": "boolean availability"
}
"""

product_data = extract_structured_data(html, schema)
print(json.dumps(product_data, indent=2))

JavaScript Example with JSON Mode

const axios = require('axios');

async function extractStructuredData(htmlContent, schemaDescription) {
    const url = 'https://api.deepseek.com/v1/chat/completions';

    const prompt = `Extract the following information from this HTML and return it as JSON:

${schemaDescription}

HTML Content:
${htmlContent}

Return only valid JSON with no additional text.`;

    const payload = {
        model: 'deepseek-chat',
        messages: [
            {
                role: 'system',
                content: 'You are a data extraction assistant. Always return valid JSON.'
            },
            {
                role: 'user',
                content: prompt
            }
        ],
        response_format: {
            type: 'json_object'
        },
        temperature: 0.1
    };

    const response = await axios.post(url, payload, {
        headers: {
            'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
            'Content-Type': 'application/json'
        }
    });

    const extractedData = JSON.parse(response.data.choices[0].message.content);
    return extractedData;
}

// Example usage
const html = `
<article class="blog-post">
    <h2>Understanding Web Scraping Ethics</h2>
    <span class="author">Jane Doe</span>
    <time datetime="2024-01-15">January 15, 2024</time>
    <p class="excerpt">A comprehensive guide to ethical web scraping practices...</p>
</article>
`;

const schema = `
{
    "title": "article title",
    "author": "author name",
    "date": "publication date in ISO format",
    "excerpt": "article excerpt"
}
`;

extractStructuredData(html, schema)
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(error => console.error('Error:', error));

Advanced Structured Output with Schema Validation

For more complex extraction tasks, you can define detailed JSON schemas and validate the output:

from jsonschema import validate, ValidationError
import requests
import json

def extract_with_validation(html_content, json_schema):
    """
    Extract data with schema validation
    """
    url = "https://api.deepseek.com/v1/chat/completions"

    headers = {
        "Authorization": f"Bearer {os.getenv('DEEPSEEK_API_KEY')}",
        "Content-Type": "application/json"
    }

    prompt = f"""Extract data from the HTML below according to this JSON schema:

Schema:
{json.dumps(json_schema, indent=2)}

HTML:
{html_content}

Return a JSON object that matches the schema exactly."""

    payload = {
        "model": "deepseek-chat",
        "messages": [
            {
                "role": "system",
                "content": "You are a precise data extraction system. Return only valid JSON that matches the provided schema."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        "response_format": {
            "type": "json_object"
        },
        "temperature": 0
    }

    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()

    extracted_data = json.loads(response.json()['choices'][0]['message']['content'])

    # Validate against schema
    try:
        validate(instance=extracted_data, schema=json_schema)
        return extracted_data
    except ValidationError as e:
        raise ValueError(f"Extracted data doesn't match schema: {e.message}")

# Define a complex schema
product_schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "availability": {"type": "boolean"},
        "rating": {
            "type": "object",
            "properties": {
                "score": {"type": "number", "minimum": 0, "maximum": 5},
                "count": {"type": "integer"}
            },
            "required": ["score", "count"]
        },
        "features": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["title", "price", "availability"]
}

html = """
<div class="product-detail">
    <h1>Premium Laptop Stand</h1>
    <div class="price">$89.99 USD</div>
    <span class="stock">Available</span>
    <div class="rating">4.7/5 (234 reviews)</div>
    <ul class="features">
        <li>Adjustable height</li>
        <li>Aluminum construction</li>
        <li>Cable management</li>
    </ul>
</div>
"""

result = extract_with_validation(html, product_schema)
print(json.dumps(result, indent=2))

Extracting Multiple Items (Arrays)

When scraping lists of items, you need structured output that returns arrays:

async function extractProductList(html) {
    const url = 'https://api.deepseek.com/v1/chat/completions';

    const prompt = `Extract all products from this HTML and return as a JSON array.

For each product, extract:
- title (string)
- price (number, without currency symbol)
- url (string)
- image_url (string)

HTML:
${html}

Return format:
{
    "products": [
        {"title": "...", "price": 0.0, "url": "...", "image_url": "..."},
        ...
    ]
}`;

    const response = await axios.post(url, {
        model: 'deepseek-chat',
        messages: [
            {
                role: 'system',
                content: 'Extract structured data from HTML. Return valid JSON arrays.'
            },
            {
                role: 'user',
                content: prompt
            }
        ],
        response_format: {
            type: 'json_object'
        },
        temperature: 0
    }, {
        headers: {
            'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
            'Content-Type': 'application/json'
        }
    });

    return JSON.parse(response.data.choices[0].message.content);
}

Handling Dynamic Content with Structured Output

When dealing with JavaScript-rendered pages, you'll need to combine Deepseek with browser automation tools to handle AJAX requests before extracting structured data:

from playwright.sync_api import sync_playwright
import requests
import json

def scrape_dynamic_page_structured(url, data_schema):
    """
    Scrape dynamic page and extract structured data
    """
    # First, render the page with Playwright
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        page.wait_for_load_state('networkidle')
        html_content = page.content()
        browser.close()

    # Then extract structured data with Deepseek
    deepseek_url = "https://api.deepseek.com/v1/chat/completions"

    prompt = f"""Extract data from this HTML according to the schema:

Schema: {json.dumps(data_schema, indent=2)}

HTML: {html_content}

Return valid JSON matching the schema."""

    response = requests.post(deepseek_url,
        headers={
            "Authorization": f"Bearer {os.getenv('DEEPSEEK_API_KEY')}",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {"role": "system", "content": "Extract structured data. Return only JSON."},
                {"role": "user", "content": prompt}
            ],
            "response_format": {"type": "json_object"},
            "temperature": 0
        }
    )

    return json.loads(response.json()['choices'][0]['message']['content'])

# Example usage
schema = {
    "listings": [
        {
            "title": "string",
            "location": "string",
            "price_per_night": "number",
            "rating": "number"
        }
    ]
}

data = scrape_dynamic_page_structured("https://example.com/listings", schema)

Best Practices for Structured Output

1. Use Low Temperature Settings

Set temperature to 0 or 0.1 for consistent, deterministic output:

payload = {
    "model": "deepseek-chat",
    "temperature": 0,  # Maximum consistency
    # ... other parameters
}

2. Provide Clear Schema Examples

Include example output in your prompt:

prompt = """Extract product data and return as JSON.

Example output format:
{
    "name": "Product Name",
    "price": 29.99,
    "in_stock": true
}

HTML to extract from:
{html_content}
"""

3. Handle Errors Gracefully

Always validate and handle parsing errors:

async function safeExtractData(html, schema) {
    try {
        const result = await extractStructuredData(html, schema);

        // Validate the result has expected keys
        const requiredKeys = ['title', 'price'];
        const hasAllKeys = requiredKeys.every(key => key in result);

        if (!hasAllKeys) {
            throw new Error('Missing required fields in extracted data');
        }

        return result;
    } catch (error) {
        console.error('Extraction failed:', error);
        return null;
    }
}

4. Optimize Token Usage

For large HTML documents, extract only the relevant sections before sending to the API:

from bs4 import BeautifulSoup

def extract_relevant_html(full_html, selector):
    """
    Extract only relevant portions of HTML to reduce token usage
    """
    soup = BeautifulSoup(full_html, 'html.parser')
    relevant_elements = soup.select(selector)

    # Return only the relevant HTML
    return '\n'.join(str(elem) for elem in relevant_elements)

# Extract only product containers
html = extract_relevant_html(page_html, '.product-card')

# Then send to Deepseek for structured extraction
result = extract_structured_data(html, schema)

Combining Deepseek with Web Scraping APIs

For production use cases, consider combining Deepseek's structured output capabilities with specialized web scraping services that handle browser sessions and anti-bot measures:

import requests
import json

def scrape_with_api_and_deepseek(target_url, extraction_schema):
    """
    Use WebScraping.AI to fetch content, then Deepseek to extract structured data
    """
    # Fetch HTML with WebScraping.AI
    scraping_response = requests.get(
        'https://api.webscraping.ai/html',
        params={
            'url': target_url,
            'api_key': os.getenv('WEBSCRAPING_AI_KEY')
        }
    )

    html_content = scraping_response.text

    # Extract structured data with Deepseek
    deepseek_response = requests.post(
        'https://api.deepseek.com/v1/chat/completions',
        headers={
            'Authorization': f"Bearer {os.getenv('DEEPSEEK_API_KEY')}",
            'Content-Type': 'application/json'
        },
        json={
            'model': 'deepseek-chat',
            'messages': [
                {
                    'role': 'system',
                    'content': 'Extract structured data. Return only valid JSON.'
                },
                {
                    'role': 'user',
                    'content': f"Extract data matching this schema: {json.dumps(extraction_schema)}\n\nHTML: {html_content}"
                }
            ],
            'response_format': {'type': 'json_object'},
            'temperature': 0
        }
    )

    return json.loads(deepseek_response.json()['choices'][0]['message']['content'])

Conclusion

Deepseek provides powerful capabilities for extracting structured data from web pages through its JSON mode and well-crafted prompts. By using the response_format parameter, providing clear schemas, and implementing proper validation, you can build reliable data extraction pipelines. Remember to use low temperature settings for consistency, handle errors gracefully, and optimize your HTML input to reduce token usage and costs.

For complex scenarios involving dynamic content, combine Deepseek with browser automation tools to ensure you're working with fully rendered HTML before extraction. This combination of technologies provides a robust solution for modern web scraping challenges.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon