Table of contents

How do I Extract Data from Websites Using Deepseek?

Deepseek is a powerful large language model (LLM) that can be used to extract structured data from websites by interpreting HTML content and converting it into JSON or other formats. Unlike traditional web scraping that relies on brittle CSS selectors or XPath expressions, Deepseek can understand the semantic meaning of content and extract data even when page structure changes.

Understanding Deepseek for Web Scraping

Deepseek offers several models optimized for different tasks:

  • Deepseek-V3: The latest general-purpose model with excellent reasoning capabilities
  • Deepseek-Coder: Specialized for code generation and technical content
  • Deepseek-R1: Enhanced reasoning model for complex extraction tasks

For web scraping, you'll typically use the Deepseek API to send HTML content along with instructions about what data to extract, and receive structured output in return.

Prerequisites

Before you start, you'll need:

  1. A Deepseek API key (obtain from platform.deepseek.com)
  2. Python 3.7+ or Node.js 14+ installed
  3. HTTP client library for making requests
  4. HTML fetching capability (requests, axios, or a web scraping API)

Method 1: Using Python with Deepseek

Here's a complete example of extracting product data from an e-commerce page:

import requests
from openai import OpenAI

# Initialize Deepseek client (compatible with OpenAI SDK)
client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

# Fetch HTML content
url = "https://example.com/product-page"
response = requests.get(url)
html_content = response.text

# Define extraction schema
extraction_prompt = """
Extract the following information from the HTML:
- Product name
- Price
- Description
- Availability status
- Customer rating

Return the data as a JSON object.
"""

# Call Deepseek API
completion = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {
            "role": "system",
            "content": "You are a data extraction assistant. Extract structured data from HTML and return valid JSON."
        },
        {
            "role": "user",
            "content": f"{extraction_prompt}\n\nHTML:\n{html_content[:8000]}"
        }
    ],
    response_format={"type": "json_object"}
)

# Parse the extracted data
import json
extracted_data = json.loads(completion.choices[0].message.content)
print(json.dumps(extracted_data, indent=2))

Output Example

{
  "product_name": "Wireless Bluetooth Headphones",
  "price": "$79.99",
  "description": "Premium noise-canceling headphones with 30-hour battery life",
  "availability": "In Stock",
  "rating": 4.5
}

Method 2: Using JavaScript/Node.js with Deepseek

For JavaScript developers, here's how to integrate Deepseek with a web scraping workflow:

const axios = require('axios');

async function extractDataWithDeepseek(html, schema) {
  const apiKey = 'your-deepseek-api-key';

  const response = await axios.post(
    'https://api.deepseek.com/v1/chat/completions',
    {
      model: 'deepseek-chat',
      messages: [
        {
          role: 'system',
          content: 'You are a web scraping assistant. Extract data from HTML and return structured JSON.'
        },
        {
          role: 'user',
          content: `Extract the following fields: ${schema.join(', ')}\n\nHTML:\n${html}`
        }
      ],
      response_format: { type: 'json_object' },
      temperature: 0.1
    },
    {
      headers: {
        'Authorization': `Bearer ${apiKey}`,
        'Content-Type': 'application/json'
      }
    }
  );

  return JSON.parse(response.data.choices[0].message.content);
}

// Example usage
async function scrapeWebsite() {
  const url = 'https://example.com/articles';
  const htmlResponse = await axios.get(url);

  const schema = ['title', 'author', 'publish_date', 'content', 'tags'];
  const extractedData = await extractDataWithDeepseek(
    htmlResponse.data,
    schema
  );

  console.log(extractedData);
}

scrapeWebsite();

Advanced Techniques

Handling Large HTML Documents

Deepseek has token limits, so for large pages, extract only relevant sections:

from bs4 import BeautifulSoup

def extract_relevant_html(full_html, selector):
    """Extract only the relevant portion of HTML"""
    soup = BeautifulSoup(full_html, 'html.parser')
    relevant_section = soup.select_one(selector)
    return str(relevant_section) if relevant_section else full_html

# Usage
html = requests.get(url).text
focused_html = extract_relevant_html(html, '.product-details')

# Now send only focused_html to Deepseek

Batch Processing Multiple Pages

For scraping multiple pages efficiently:

import asyncio
from openai import AsyncOpenAI

async def extract_batch(urls, client):
    """Extract data from multiple URLs concurrently"""
    async def process_url(url):
        html = requests.get(url).text
        response = await client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": "Extract product data as JSON."},
                {"role": "user", "content": f"HTML: {html[:8000]}"}
            ]
        )
        return json.loads(response.choices[0].message.content)

    tasks = [process_url(url) for url in urls]
    results = await asyncio.gather(*tasks)
    return results

# Usage
client = AsyncOpenAI(
    api_key="your-key",
    base_url="https://api.deepseek.com"
)

urls = ['https://example.com/product1', 'https://example.com/product2']
results = asyncio.run(extract_batch(urls, client))

Structured Output with Schema Validation

Define precise JSON schemas for consistent extraction:

schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "in_stock": {"type": "boolean"},
        "images": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["title", "price"]
}

prompt = f"""
Extract data matching this JSON schema:
{json.dumps(schema, indent=2)}

Only return valid JSON matching this exact structure.
"""

Combining Deepseek with Browser Automation

For JavaScript-heavy websites, combine Deepseek with browser automation tools like Puppeteer:

from playwright.sync_api import sync_playwright

def scrape_dynamic_page(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Wait for dynamic content to load
        page.wait_for_selector('.product-data')

        # Get rendered HTML
        html = page.content()
        browser.close()

        # Now use Deepseek to extract data
        extracted = extract_with_deepseek(html)
        return extracted

This approach is particularly useful when dealing with AJAX requests and dynamically loaded content.

Error Handling and Validation

Always implement robust error handling:

def safe_extract(html, max_retries=3):
    """Extract data with retry logic and validation"""
    for attempt in range(max_retries):
        try:
            completion = client.chat.completions.create(
                model="deepseek-chat",
                messages=[
                    {"role": "system", "content": "Extract data as JSON."},
                    {"role": "user", "content": html}
                ],
                timeout=30
            )

            data = json.loads(completion.choices[0].message.content)

            # Validate required fields
            required_fields = ['title', 'price']
            if all(field in data for field in required_fields):
                return data
            else:
                raise ValueError("Missing required fields")

        except json.JSONDecodeError:
            if attempt < max_retries - 1:
                continue
            raise
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

    return None

Cost Optimization Tips

Deepseek is cost-effective, but you can optimize further:

  1. Preprocess HTML: Remove unnecessary tags, scripts, and styles before sending to the API
  2. Use focused selectors: Extract only relevant sections using BeautifulSoup or similar
  3. Cache results: Store extracted data to avoid re-processing identical pages
  4. Batch requests: Process multiple extractions in a single API call when possible
  5. Set lower temperature: Use temperature=0 or 0.1 for deterministic, focused extraction
def clean_html(raw_html):
    """Remove unnecessary elements to reduce token usage"""
    soup = BeautifulSoup(raw_html, 'html.parser')

    # Remove scripts, styles, comments
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()

    # Get text-focused HTML
    return str(soup.get_text(separator=' ', strip=True))

Handling Authentication and Sessions

For pages requiring authentication, fetch HTML with proper session handling:

session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0...'})

# Login
session.post('https://example.com/login', data={
    'username': 'user',
    'password': 'pass'
})

# Scrape authenticated page
html = session.get('https://example.com/protected-data').text
extracted = extract_with_deepseek(html)

For complex authentication scenarios with browser sessions, consider using browser automation tools before passing HTML to Deepseek.

When to Use Deepseek vs Traditional Scraping

Use Deepseek when: - Page structure changes frequently - Data is presented in varied formats - You need semantic understanding (e.g., extracting sentiment or categorizing content) - Working with unstructured content like articles or reviews

Use traditional CSS/XPath when: - Page structure is stable and predictable - You need maximum speed and minimal cost - Extracting simple, well-structured data - Processing millions of pages at scale

Conclusion

Deepseek provides a flexible, AI-powered approach to web scraping that can adapt to changing page structures and extract semantic information that traditional methods struggle with. By combining Deepseek with traditional web scraping tools and proper preprocessing, you can build robust data extraction pipelines that are both intelligent and cost-effective.

The key to success is finding the right balance between AI-powered extraction and traditional methods, preprocessing HTML to reduce token usage, and implementing proper error handling and validation throughout your scraping workflow.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon