Table of contents

How do I use LLM for web scraping with Deepseek?

Using Deepseek LLM for web scraping represents a paradigm shift from traditional selector-based extraction to intelligent, context-aware data parsing. Deepseek's powerful language models can understand page structure, extract relevant information, and transform unstructured HTML into structured data without requiring brittle XPath or CSS selectors.

What is LLM-based web scraping?

LLM-based web scraping uses large language models to interpret and extract data from web pages. Instead of writing complex selectors that break when page layouts change, you describe what data you want in natural language, and the LLM intelligently extracts it. Deepseek offers several models optimized for this task, including DeepSeek-V3 and DeepSeek-R1.

Setting up Deepseek for web scraping

Prerequisites

Before you begin, you'll need:

  1. A Deepseek API key (get one from platform.deepseek.com)
  2. Python 3.7+ or Node.js 14+ installed
  3. A library for making HTTP requests (requests, axios, or similar)

Installation

Python:

pip install openai requests beautifulsoup4

JavaScript:

npm install openai axios cheerio

Basic LLM web scraping workflow

The typical workflow for LLM-based web scraping with Deepseek involves:

  1. Fetch the HTML content from the target page
  2. Optionally clean or simplify the HTML
  3. Send the HTML to Deepseek with a prompt describing what to extract
  4. Parse the structured response

Python implementation

Here's a complete example using Python:

from openai import OpenAI
import requests
import json

# Initialize Deepseek client
client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

def scrape_with_deepseek(url, extraction_prompt):
    """
    Scrape a webpage using Deepseek LLM

    Args:
        url: Target webpage URL
        extraction_prompt: Natural language description of data to extract

    Returns:
        Extracted data as a dictionary
    """
    # Fetch the webpage
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    html_content = response.text

    # Create the prompt for Deepseek
    system_prompt = """You are a web scraping assistant. Extract the requested
    information from the HTML and return it as valid JSON. Be precise and only
    extract data that is clearly present in the HTML."""

    user_prompt = f"""Extract the following information from this HTML:
    {extraction_prompt}

    HTML Content:
    {html_content[:8000]}  # Limit to avoid token limits

    Return the data as a JSON object."""

    # Call Deepseek API
    completion = client.chat.completions.create(
        model="deepseek-chat",  # or "deepseek-reasoner" for complex tasks
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        response_format={'type': 'json_object'},  # Ensure JSON output
        temperature=0.1  # Lower temperature for more consistent extraction
    )

    # Parse and return the result
    result = json.loads(completion.choices[0].message.content)
    return result

# Example usage
url = "https://example.com/product-page"
prompt = """Extract:
- Product title
- Price
- Description
- Availability status
- Customer rating"""

data = scrape_with_deepseek(url, prompt)
print(json.dumps(data, indent=2))

JavaScript implementation

Here's the equivalent implementation in Node.js:

const OpenAI = require('openai');
const axios = require('axios');

// Initialize Deepseek client
const client = new OpenAI({
    apiKey: 'your-deepseek-api-key',
    baseURL: 'https://api.deepseek.com'
});

async function scrapeWithDeepseek(url, extractionPrompt) {
    try {
        // Fetch the webpage
        const response = await axios.get(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
            }
        });

        const htmlContent = response.data;

        // Create the prompt
        const systemPrompt = `You are a web scraping assistant. Extract the requested
        information from the HTML and return it as valid JSON. Be precise and only
        extract data that is clearly present in the HTML.`;

        const userPrompt = `Extract the following information from this HTML:
        ${extractionPrompt}

        HTML Content:
        ${htmlContent.substring(0, 8000)}

        Return the data as a JSON object.`;

        // Call Deepseek API
        const completion = await client.chat.completions.create({
            model: 'deepseek-chat',
            messages: [
                { role: 'system', content: systemPrompt },
                { role: 'user', content: userPrompt }
            ],
            response_format: { type: 'json_object' },
            temperature: 0.1
        });

        // Parse and return result
        const result = JSON.parse(completion.choices[0].message.content);
        return result;

    } catch (error) {
        console.error('Scraping error:', error.message);
        throw error;
    }
}

// Example usage
const url = 'https://example.com/product-page';
const prompt = `Extract:
- Product title
- Price
- Description
- Availability status
- Customer rating`;

scrapeWithDeepseek(url, prompt)
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(error => console.error(error));

Advanced techniques

Using DeepSeek-R1 for complex reasoning

For pages with complex layouts or when you need the model to perform reasoning, use the deepseek-reasoner model:

completion = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ],
    response_format={'type': 'json_object'}
)

# Access the reasoning process
reasoning = completion.choices[0].message.reasoning_content
result = json.loads(completion.choices[0].message.content)

print("Reasoning:", reasoning)
print("Result:", result)

Handling pagination and dynamic content

When scraping pages with dynamic content or AJAX requests, combine Deepseek with browser automation:

from playwright.sync_api import sync_playwright

def scrape_dynamic_page_with_llm(url, extraction_prompt):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()

        # Navigate and wait for content
        page.goto(url)
        page.wait_for_load_state('networkidle')

        # Get rendered HTML
        html_content = page.content()
        browser.close()

        # Now use Deepseek to extract data
        return extract_with_deepseek(html_content, extraction_prompt)

Extracting data from multiple pages

For scraping multiple pages efficiently:

import concurrent.futures

def scrape_multiple_urls(urls, extraction_prompt):
    """
    Scrape multiple URLs in parallel using Deepseek
    """
    def scrape_single(url):
        try:
            return {
                'url': url,
                'data': scrape_with_deepseek(url, extraction_prompt),
                'success': True
            }
        except Exception as e:
            return {
                'url': url,
                'error': str(e),
                'success': False
            }

    # Use ThreadPoolExecutor for parallel processing
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        results = list(executor.map(scrape_single, urls))

    return results

# Example
urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3'
]

results = scrape_multiple_urls(urls, "Extract product title and price")

Optimizing HTML for LLM processing

To reduce token usage and improve accuracy, clean the HTML before sending it to Deepseek:

from bs4 import BeautifulSoup

def clean_html_for_llm(html_content):
    """
    Remove unnecessary elements to reduce token count
    """
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'noscript', 'iframe']):
        element.decompose()

    # Remove comments
    for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Get text with some structure preserved
    return str(soup)

# Use in scraping
html_content = requests.get(url).text
cleaned_html = clean_html_for_llm(html_content)
# Now send cleaned_html to Deepseek

Handling errors and retries

Implement robust error handling when working with LLMs:

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
def scrape_with_retry(url, extraction_prompt):
    """
    Scrape with automatic retries on failure
    """
    try:
        result = scrape_with_deepseek(url, extraction_prompt)

        # Validate the result
        if not result or len(result) == 0:
            raise ValueError("Empty result from LLM")

        return result

    except Exception as e:
        print(f"Error scraping {url}: {str(e)}")
        raise

Cost optimization strategies

Deepseek offers competitive pricing, but for large-scale scraping, consider these optimizations:

  1. Pre-filter HTML: Extract only relevant sections before sending to the LLM
  2. Batch requests: Group multiple small extraction tasks into a single API call
  3. Cache results: Store extracted data to avoid re-processing identical pages
  4. Use appropriate models: DeepSeek-Chat for simple extraction, DeepSeek-R1 only when reasoning is needed
import hashlib
import json
from pathlib import Path

class CachedScraper:
    def __init__(self, cache_dir='./scrape_cache'):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(exist_ok=True)

    def get_cache_key(self, url, prompt):
        """Generate cache key from URL and prompt"""
        combined = f"{url}:{prompt}"
        return hashlib.md5(combined.encode()).hexdigest()

    def scrape_cached(self, url, extraction_prompt):
        """Scrape with caching"""
        cache_key = self.get_cache_key(url, extraction_prompt)
        cache_file = self.cache_dir / f"{cache_key}.json"

        # Check cache
        if cache_file.exists():
            with open(cache_file, 'r') as f:
                return json.load(f)

        # Scrape and cache
        result = scrape_with_deepseek(url, extraction_prompt)

        with open(cache_file, 'w') as f:
            json.dump(result, f)

        return result

Combining traditional scraping with LLM extraction

For best results, you can navigate to specific sections using traditional tools, then use Deepseek for intelligent extraction:

def hybrid_scraping_approach(url):
    """
    Use BeautifulSoup to isolate relevant sections,
    then Deepseek to extract structured data
    """
    # Fetch page
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the product container using traditional selectors
    product_section = soup.find('div', {'class': 'product-details'})

    if product_section:
        # Now use LLM to extract from this specific section
        section_html = str(product_section)

        result = extract_with_deepseek(
            section_html,
            "Extract product title, price, and specifications"
        )
        return result

    return None

Best practices

  1. Be specific in prompts: Clearly describe the exact data you want and the expected format
  2. Use JSON mode: Enable response_format={'type': 'json_object'} for structured output
  3. Set low temperature: Use temperature around 0.1 for consistent extraction
  4. Validate outputs: Always validate the LLM's response before using it
  5. Handle failures gracefully: Implement retries and fallback mechanisms
  6. Monitor token usage: Track API costs and optimize HTML preprocessing
  7. Respect rate limits: Implement proper rate limiting in your scraping code

Conclusion

Using Deepseek LLM for web scraping provides a flexible, maintainable alternative to traditional selector-based approaches. While it may be more expensive per request than simple parsing, the ability to handle layout changes, extract semantically similar data, and process unstructured content makes it invaluable for complex scraping tasks. Start with simple extractions, optimize your prompts and HTML preprocessing, and scale up as you become familiar with the model's capabilities.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon