Table of contents

What is Structured Data Extraction with LLM and How Does Deepseek Support It?

Structured data extraction with Large Language Models (LLMs) is a transformative approach to converting unstructured or semi-structured web content into well-defined, machine-readable formats like JSON, XML, or CSV. Unlike traditional web scraping methods that rely on brittle CSS selectors or XPath expressions, LLM-based extraction uses natural language understanding to identify and extract relevant information from HTML content, making it more resilient to website layout changes.

Deepseek, a powerful and cost-effective LLM, provides robust support for structured data extraction through function calling, JSON mode, and advanced prompt engineering capabilities. This guide explores how structured data extraction works with LLMs and demonstrates practical implementation patterns using Deepseek.

Understanding Structured Data Extraction with LLMs

Traditional web scraping requires developers to manually identify HTML selectors for each piece of data they want to extract. This approach becomes problematic when:

  • Website layouts change frequently
  • Data appears in inconsistent formats across pages
  • Content is dynamically generated or embedded in complex structures
  • You need to extract semantic meaning rather than just raw text

LLM-based structured data extraction solves these challenges by using the model's natural language understanding to:

  1. Parse unstructured content and identify relevant information
  2. Transform data into predefined schemas
  3. Handle variations in content structure automatically
  4. Extract semantic relationships between data points
  5. Validate and normalize extracted data

How Deepseek Supports Structured Data Extraction

Deepseek offers several powerful features for structured data extraction:

1. Function Calling (Tool Use)

Deepseek supports function calling, allowing you to define JSON schemas that the model will populate with extracted data. This is the most reliable method for structured extraction.

import requests
import json

def extract_product_data(html_content):
    """Extract structured product data using Deepseek function calling"""

    # Define the schema for product data
    tools = [{
        "type": "function",
        "function": {
            "name": "save_product",
            "description": "Save extracted product information",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "Product name"
                    },
                    "price": {
                        "type": "number",
                        "description": "Product price in USD"
                    },
                    "rating": {
                        "type": "number",
                        "description": "Product rating (0-5)"
                    },
                    "availability": {
                        "type": "string",
                        "enum": ["in_stock", "out_of_stock", "preorder"],
                        "description": "Product availability status"
                    },
                    "features": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "List of key product features"
                    }
                },
                "required": ["name", "price", "availability"]
            }
        }
    }]

    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {
                    "role": "user",
                    "content": f"Extract product information from this HTML:\n\n{html_content}"
                }
            ],
            "tools": tools,
            "tool_choice": "auto"
        }
    )

    result = response.json()

    # Extract the function call arguments
    if result["choices"][0]["message"].get("tool_calls"):
        tool_call = result["choices"][0]["message"]["tool_calls"][0]
        product_data = json.loads(tool_call["function"]["arguments"])
        return product_data

    return None

# Example usage
html = """
<div class="product">
    <h1>Premium Wireless Headphones</h1>
    <span class="price">$299.99</span>
    <div class="rating">4.5 stars</div>
    <p class="stock">In Stock</p>
    <ul class="features">
        <li>Active Noise Cancellation</li>
        <li>40-hour battery life</li>
        <li>Bluetooth 5.0</li>
    </ul>
</div>
"""

product = extract_product_data(html)
print(json.dumps(product, indent=2))

2. JSON Mode for Structured Output

Deepseek also supports a JSON response format that ensures the model returns valid JSON:

const axios = require('axios');

async function extractArticleMetadata(htmlContent) {
    const response = await axios.post(
        'https://api.deepseek.com/v1/chat/completions',
        {
            model: 'deepseek-chat',
            messages: [
                {
                    role: 'system',
                    content: `You are a data extraction assistant. Extract article metadata and return it as JSON with these fields:
                    - title (string)
                    - author (string)
                    - publishDate (ISO date string)
                    - category (string)
                    - tags (array of strings)
                    - wordCount (number)
                    - excerpt (string, max 200 chars)`
                },
                {
                    role: 'user',
                    content: `Extract metadata from this HTML:\n\n${htmlContent}`
                }
            ],
            response_format: { type: 'json_object' },
            temperature: 0.1
        },
        {
            headers: {
                'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
                'Content-Type': 'application/json'
            }
        }
    );

    return JSON.parse(response.data.choices[0].message.content);
}

// Example usage
const html = `
<article>
    <h1>Understanding AI in Web Scraping</h1>
    <div class="author">By Jane Smith</div>
    <time datetime="2025-01-15">January 15, 2025</time>
    <div class="category">Technology</div>
    <div class="tags">AI, Web Scraping, Machine Learning</div>
    <p>Artificial intelligence is revolutionizing how we extract data from websites...</p>
</article>
`;

extractArticleMetadata(html).then(metadata => {
    console.log(JSON.stringify(metadata, null, 2));
});

Advanced Prompt Engineering for Structured Extraction

The quality of structured data extraction heavily depends on prompt engineering. Here are proven patterns for Deepseek:

Pattern 1: Schema-First Extraction

Provide the exact schema upfront in your system prompt:

def extract_with_schema(html_content, schema):
    """Extract data according to a predefined schema"""

    system_prompt = f"""You are a precise data extraction system.
Extract information from HTML and return ONLY valid JSON matching this exact schema:

{json.dumps(schema, indent=2)}

Rules:
- Use null for missing values
- Convert dates to ISO 8601 format
- Extract numbers without currency symbols or commas
- Preserve arrays even if empty []
- Do not add fields not in the schema"""

    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": f"HTML:\n{html_content}"}
            ],
            "response_format": {"type": "json_object"},
            "temperature": 0.0
        }
    )

    return response.json()["choices"][0]["message"]["content"]

Pattern 2: Few-Shot Learning for Complex Extractions

For complex extraction tasks, provide examples:

def extract_with_examples(html_content):
    """Use few-shot learning for better extraction accuracy"""

    messages = [
        {
            "role": "system",
            "content": "Extract event information from HTML and return structured JSON."
        },
        {
            "role": "user",
            "content": '<div><h2>Tech Conference 2025</h2><p>Date: March 15-17, 2025</p><p>Location: San Francisco, CA</p></div>'
        },
        {
            "role": "assistant",
            "content": json.dumps({
                "name": "Tech Conference 2025",
                "startDate": "2025-03-15",
                "endDate": "2025-03-17",
                "location": {
                    "city": "San Francisco",
                    "state": "CA",
                    "country": "USA"
                }
            })
        },
        {
            "role": "user",
            "content": html_content
        }
    ]

    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer YOUR_API_KEY",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": messages,
            "response_format": {"type": "json_object"},
            "temperature": 0.1
        }
    )

    return json.loads(response.json()["choices"][0]["message"]["content"])

Handling Batch Extraction

When extracting data from multiple similar elements (like product listings), use this pattern:

async function extractMultipleProducts(htmlContent) {
    const response = await axios.post(
        'https://api.deepseek.com/v1/chat/completions',
        {
            model: 'deepseek-chat',
            messages: [
                {
                    role: 'system',
                    content: `Extract ALL products from the HTML. Return a JSON object with a "products" array.
Each product should have: id, name, price, imageUrl, rating, reviewCount.
If a field is missing, use null.`
                },
                {
                    role: 'user',
                    content: htmlContent
                }
            ],
            response_format: { type: 'json_object' },
            temperature: 0.0
        },
        {
            headers: {
                'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
                'Content-Type': 'application/json'
            }
        }
    );

    const data = JSON.parse(response.data.choices[0].message.content);
    return data.products || [];
}

Error Handling and Validation

Always validate extracted data and handle potential errors:

from pydantic import BaseModel, ValidationError, Field
from typing import List, Optional
from datetime import date

class Product(BaseModel):
    name: str
    price: float = Field(gt=0)
    rating: Optional[float] = Field(None, ge=0, le=5)
    availability: str
    features: List[str] = []

def extract_and_validate(html_content):
    """Extract data with validation"""
    try:
        # Extract using Deepseek
        raw_data = extract_product_data(html_content)

        # Validate with Pydantic
        product = Product(**raw_data)
        return product.dict()

    except ValidationError as e:
        print(f"Validation error: {e}")
        return None
    except Exception as e:
        print(f"Extraction error: {e}")
        return None

Combining LLM Extraction with Traditional Scraping

For optimal results, combine traditional web scraping techniques with LLM-based extraction:

from bs4 import BeautifulSoup

def hybrid_extraction(url):
    """Combine traditional scraping with LLM extraction"""

    # Step 1: Fetch HTML with traditional tools
    html = fetch_html(url)  # Your fetching logic

    # Step 2: Pre-process with BeautifulSoup to reduce token usage
    soup = BeautifulSoup(html, 'html.parser')
    main_content = soup.find('main') or soup.find('article')

    if main_content:
        # Only send relevant content to LLM
        cleaned_html = str(main_content)
    else:
        cleaned_html = html

    # Step 3: Extract structured data with Deepseek
    structured_data = extract_with_schema(cleaned_html, YOUR_SCHEMA)

    return structured_data

Best Practices for Deepseek Structured Extraction

  1. Minimize Token Usage: Pre-clean HTML to remove scripts, styles, and irrelevant elements
  2. Use Low Temperature: Set temperature to 0.0-0.1 for consistent, deterministic extraction
  3. Define Strict Schemas: Use function calling with detailed parameter descriptions
  4. Implement Retry Logic: Handle API errors and rate limits gracefully
  5. Validate Output: Always validate extracted JSON against your schema
  6. Cache Results: Cache extraction results to minimize API costs
  7. Monitor Costs: Track token usage and implement budget controls

Cost Optimization Strategies

Deepseek is significantly cheaper than alternatives like GPT-4, but costs can still add up:

def estimate_extraction_cost(html_content, price_per_million_tokens=0.14):
    """Estimate cost before extraction"""

    # Rough estimation: 1 token ≈ 4 characters
    input_tokens = len(html_content) / 4
    estimated_output_tokens = 500  # Adjust based on schema

    total_tokens = input_tokens + estimated_output_tokens
    cost = (total_tokens / 1_000_000) * price_per_million_tokens

    return {
        'estimated_tokens': int(total_tokens),
        'estimated_cost_usd': round(cost, 6)
    }

When to Use LLM-Based vs Traditional Extraction

Use LLM extraction when: - Website structures vary significantly - You need semantic understanding of content - Data appears in natural language (reviews, descriptions) - Layout changes frequently

Use traditional extraction when: - Website structure is consistent - You need real-time, low-latency extraction - Budget is extremely constrained - Data is in clearly defined HTML elements

For complex scenarios, consider integrating both approaches with tools that handle dynamic content alongside LLM-based extraction for optimal results.

Conclusion

Structured data extraction with LLMs like Deepseek represents a paradigm shift in web scraping, offering flexibility and resilience that traditional methods cannot match. By leveraging function calling, JSON mode, and sophisticated prompt engineering, developers can build extraction systems that adapt to changes automatically while maintaining high accuracy.

Deepseek's competitive pricing and strong performance make it an excellent choice for production web scraping workflows, especially when combined with proper validation, error handling, and cost optimization strategies.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon