Table of contents

How do I Perform LLM Data Extraction Using Deepseek?

LLM data extraction with Deepseek involves using the Deepseek API to intelligently parse and extract structured information from unstructured web content. Unlike traditional web scraping that relies on CSS selectors or XPath, Deepseek's language models can understand context, handle varying HTML structures, and extract data based on semantic meaning.

What is LLM Data Extraction?

LLM (Large Language Model) data extraction leverages AI to understand and extract information from web pages without requiring rigid selectors. This approach is particularly useful when:

  • Web page structures frequently change
  • Data isn't consistently formatted
  • You need to extract information based on context rather than exact position
  • Multiple variations of the same type of content exist

Setting Up Deepseek for Data Extraction

Getting Your API Key

First, obtain your Deepseek API key from the Deepseek platform. You'll need this for authentication.

Installation

Python:

pip install openai  # Deepseek uses OpenAI-compatible API

JavaScript/Node.js:

npm install openai

Basic Data Extraction with Deepseek

Python Example

Here's a complete example of extracting product information from HTML:

import requests
from openai import OpenAI

# Initialize Deepseek client
client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

# Fetch HTML content
url = "https://example.com/product-page"
html_content = requests.get(url).text

# Create extraction prompt
prompt = f"""
Extract the following information from this product page:
- Product name
- Price
- Description
- Availability status
- Customer rating

Return the data in JSON format.

HTML:
{html_content[:8000]}  # Limit to stay within token limits
"""

# Call Deepseek API
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a data extraction assistant that returns valid JSON."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.0,  # Lower temperature for consistent extraction
    response_format={"type": "json_object"}  # Ensures JSON output
)

# Parse the extracted data
extracted_data = response.choices[0].message.content
print(extracted_data)

JavaScript Example

const OpenAI = require('openai');
const axios = require('axios');

// Initialize Deepseek client
const client = new OpenAI({
    apiKey: 'your-deepseek-api-key',
    baseURL: 'https://api.deepseek.com'
});

async function extractProductData(url) {
    // Fetch HTML content
    const response = await axios.get(url);
    const htmlContent = response.data;

    // Create extraction prompt
    const prompt = `
    Extract the following information from this product page:
    - Product name
    - Price
    - Description
    - Availability status
    - Customer rating

    Return the data in JSON format.

    HTML:
    ${htmlContent.substring(0, 8000)}
    `;

    // Call Deepseek API
    const completion = await client.chat.completions.create({
        model: 'deepseek-chat',
        messages: [
            {
                role: 'system',
                content: 'You are a data extraction assistant that returns valid JSON.'
            },
            {
                role: 'user',
                content: prompt
            }
        ],
        temperature: 0.0,
        response_format: { type: 'json_object' }
    });

    // Parse and return extracted data
    const extractedData = JSON.parse(completion.choices[0].message.content);
    return extractedData;
}

// Usage
extractProductData('https://example.com/product-page')
    .then(data => console.log(data))
    .catch(error => console.error('Extraction error:', error));

Advanced Data Extraction Techniques

Using Few-Shot Examples

Providing examples improves extraction accuracy:

prompt = f"""
I need to extract product information. Here are examples of the expected format:

Example 1:
Input: <h1>Wireless Mouse</h1><span class="price">$29.99</span>
Output: {{"name": "Wireless Mouse", "price": 29.99}}

Example 2:
Input: <div class="product">Gaming Keyboard - $89.99</div>
Output: {{"name": "Gaming Keyboard", "price": 89.99}}

Now extract from this HTML:
{html_content}

Return only the JSON object.
"""

Structured Output with Function Calling

Deepseek supports function calling for structured extraction:

import json

# Define extraction schema
extraction_schema = {
    "name": "extract_product_data",
    "description": "Extract product information from HTML",
    "parameters": {
        "type": "object",
        "properties": {
            "product_name": {"type": "string", "description": "The name of the product"},
            "price": {"type": "number", "description": "Price in USD"},
            "currency": {"type": "string", "description": "Currency code"},
            "in_stock": {"type": "boolean", "description": "Whether product is available"},
            "rating": {"type": "number", "description": "Customer rating out of 5"},
            "review_count": {"type": "integer", "description": "Number of reviews"}
        },
        "required": ["product_name", "price"]
    }
}

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "user", "content": f"Extract product data from: {html_content}"}
    ],
    functions=[extraction_schema],
    function_call={"name": "extract_product_data"}
)

# Get structured data
function_args = json.loads(response.choices[0].message.function_call.arguments)
print(function_args)

Handling Large HTML Documents

When dealing with pages that exceed token limits:

from bs4 import BeautifulSoup

def preprocess_html(html_content, max_length=8000):
    """Clean and reduce HTML to essential content"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script, style, and navigation elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Get clean text with minimal markup
    clean_html = str(soup)

    # Truncate if still too long
    if len(clean_html) > max_length:
        clean_html = clean_html[:max_length]

    return clean_html

# Use preprocessed HTML
clean_html = preprocess_html(html_content)
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "user", "content": f"Extract data from: {clean_html}"}
    ]
)

Batch Processing Multiple Pages

For extracting data from multiple URLs:

import asyncio
from concurrent.futures import ThreadPoolExecutor

async def extract_from_urls(urls, extraction_prompt_template):
    """Extract data from multiple URLs concurrently"""
    results = []

    def extract_single(url):
        html = requests.get(url).text
        prompt = extraction_prompt_template.format(html=html[:8000])

        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
            response_format={"type": "json_object"}
        )

        return {
            "url": url,
            "data": response.choices[0].message.content
        }

    # Process URLs in parallel
    with ThreadPoolExecutor(max_workers=5) as executor:
        results = list(executor.map(extract_single, urls))

    return results

# Usage
urls = [
    "https://example.com/product1",
    "https://example.com/product2",
    "https://example.com/product3"
]

template = """
Extract product name and price from this HTML:
{html}
"""

results = asyncio.run(extract_from_urls(urls, template))

Integrating with Web Scraping Tools

Deepseek LLM extraction works well when combined with traditional scraping tools. For JavaScript-heavy sites, you can first render the page with browser automation tools and then use Deepseek for extraction:

from playwright.sync_api import sync_playwright

def scrape_with_browser_and_llm(url):
    """Render JavaScript page, then extract with Deepseek"""
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Wait for content to load
        page.wait_for_load_state('networkidle')

        # Get rendered HTML
        html_content = page.content()
        browser.close()

    # Extract with Deepseek
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{
            "role": "user",
            "content": f"Extract all product listings from: {html_content[:8000]}"
        }],
        response_format={"type": "json_object"}
    )

    return response.choices[0].message.content

Error Handling and Validation

Implement robust error handling for production use:

import json
from jsonschema import validate, ValidationError

def extract_with_validation(html_content, schema):
    """Extract data and validate against schema"""
    max_retries = 3

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="deepseek-chat",
                messages=[{
                    "role": "user",
                    "content": f"Extract data following this schema: {schema}\n\nHTML: {html_content[:8000]}"
                }],
                temperature=0.0,
                response_format={"type": "json_object"}
            )

            data = json.loads(response.choices[0].message.content)

            # Validate against schema
            validate(instance=data, schema=schema)

            return data

        except (json.JSONDecodeError, ValidationError) as e:
            if attempt == max_retries - 1:
                raise Exception(f"Failed to extract valid data after {max_retries} attempts: {e}")
            continue

    return None

# Define validation schema
schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"}
    },
    "required": ["product_name", "price"]
}

result = extract_with_validation(html_content, schema)

Cost Optimization Tips

LLM data extraction can be expensive at scale. Here are optimization strategies:

  1. Preprocess HTML: Remove unnecessary content before sending to API
  2. Cache results: Store extracted data to avoid reprocessing
  3. Use cheaper models: Start with deepseek-chat before trying larger models
  4. Batch requests: Group multiple extractions when possible
  5. Set token limits: Use max_tokens parameter to control costs
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=500,  # Limit response length
    temperature=0.0
)

When to Use LLM Data Extraction

LLM extraction with Deepseek is ideal for:

  • Unstructured content: News articles, blog posts, product descriptions
  • Varying layouts: Sites with inconsistent HTML structure
  • Semantic extraction: Getting information based on meaning, not position
  • Complex data: Extracting relationships and context between elements

For structured, consistent pages with predictable layouts, traditional CSS selectors or XPath may be more cost-effective and faster.

Conclusion

Deepseek provides powerful LLM capabilities for intelligent data extraction from web pages. By combining it with proper preprocessing, validation, and error handling, you can build robust extraction pipelines that handle varying HTML structures gracefully. When dealing with dynamic content and AJAX requests, pairing Deepseek with browser automation creates a comprehensive scraping solution.

Remember to always respect robots.txt, rate limits, and website terms of service when extracting data at scale. For production workloads requiring higher reliability and built-in proxy rotation, consider using a dedicated web scraping API service.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon