Table of contents

How can I use AI web scraping with Deepseek?

AI web scraping with Deepseek combines traditional web scraping techniques with advanced language model capabilities to extract, parse, and structure data from websites intelligently. Unlike conventional web scraping that relies on rigid CSS selectors or XPath expressions, Deepseek can understand page context, extract relevant information from unstructured HTML, and transform it into structured formats.

What is Deepseek and Why Use It for Web Scraping?

Deepseek is a powerful large language model (LLM) that offers competitive performance at a fraction of the cost compared to other AI providers. For web scraping tasks, Deepseek excels at:

  • Understanding unstructured HTML and extracting meaningful data without precise selectors
  • Handling layout changes gracefully since it interprets content semantically
  • Extracting multiple fields from complex pages in a single API call
  • Normalizing data automatically into consistent formats
  • Understanding context to distinguish between similar elements on a page

Setting Up Deepseek for Web Scraping

Prerequisites

First, you'll need a Deepseek API key. Sign up at deepseek.com and obtain your API credentials.

Install the required dependencies:

Python:

pip install openai requests beautifulsoup4

JavaScript/Node.js:

npm install openai axios cheerio

Basic Integration Pattern

The typical workflow for AI web scraping with Deepseek involves:

  1. Fetch the HTML content using traditional HTTP requests or a headless browser
  2. Clean and prepare the HTML (optional but recommended to reduce token usage)
  3. Send the HTML to Deepseek with extraction instructions
  4. Parse the structured response

Python Implementation

Here's a complete example using Python:

import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import json

# Initialize Deepseek client
client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

def fetch_html(url):
    """Fetch HTML content from a URL"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    return response.text

def clean_html(html):
    """Remove scripts, styles, and unnecessary tags to reduce tokens"""
    soup = BeautifulSoup(html, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Get text or minimal HTML
    return str(soup)

def extract_data_with_deepseek(html, extraction_prompt):
    """Use Deepseek to extract structured data from HTML"""

    system_prompt = """You are a web scraping assistant. Extract data from HTML
    according to the user's instructions. Return the data as valid JSON only,
    with no additional text or explanation."""

    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"HTML:\n{html}\n\nInstructions:\n{extraction_prompt}"}
        ],
        temperature=0.0,  # Use 0 for deterministic extraction
        response_format={"type": "json_object"}  # Force JSON output
    )

    return json.loads(response.choices[0].message.content)

# Example usage
url = "https://example.com/products/laptop"
html = fetch_html(url)
cleaned_html = clean_html(html)

extraction_instructions = """
Extract the following fields from this product page:
- product_name: The name of the product
- price: The current price (as a number)
- currency: The currency symbol or code
- rating: The average customer rating
- reviews_count: Number of reviews
- availability: Whether the item is in stock (boolean)
- features: List of key product features
"""

data = extract_data_with_deepseek(cleaned_html, extraction_instructions)
print(json.dumps(data, indent=2))

JavaScript Implementation

Here's the equivalent implementation in Node.js:

const axios = require('axios');
const cheerio = require('cheerio');
const OpenAI = require('openai');

// Initialize Deepseek client
const client = new OpenAI({
  apiKey: 'your-deepseek-api-key',
  baseURL: 'https://api.deepseek.com'
});

async function fetchHTML(url) {
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
  });
  return response.data;
}

function cleanHTML(html) {
  const $ = cheerio.load(html);

  // Remove unnecessary elements
  $('script, style, nav, footer, header').remove();

  return $.html();
}

async function extractDataWithDeepseek(html, extractionPrompt) {
  const systemPrompt = `You are a web scraping assistant. Extract data from HTML
  according to the user's instructions. Return the data as valid JSON only,
  with no additional text or explanation.`;

  const response = await client.chat.completions.create({
    model: 'deepseek-chat',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: `HTML:\n${html}\n\nInstructions:\n${extractionPrompt}` }
    ],
    temperature: 0.0,
    response_format: { type: 'json_object' }
  });

  return JSON.parse(response.choices[0].message.content);
}

// Example usage
(async () => {
  const url = 'https://example.com/products/laptop';
  const html = await fetchHTML(url);
  const cleanedHTML = cleanHTML(html);

  const extractionInstructions = `
  Extract the following fields from this product page:
  - product_name: The name of the product
  - price: The current price (as a number)
  - currency: The currency symbol or code
  - rating: The average customer rating
  - reviews_count: Number of reviews
  - availability: Whether the item is in stock (boolean)
  - features: List of key product features
  `;

  const data = await extractDataWithDeepseek(cleanedHTML, extractionInstructions);
  console.log(JSON.stringify(data, null, 2));
})();

Advanced Techniques

Handling JavaScript-Rendered Content

For pages that require JavaScript execution, combine Deepseek with a headless browser:

from playwright.sync_api import sync_playwright

def fetch_dynamic_html(url):
    """Fetch HTML from JavaScript-rendered pages"""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until='networkidle')
        html = page.content()
        browser.close()
        return html

# Use with Deepseek extraction
url = "https://example.com/spa-application"
html = fetch_dynamic_html(url)
data = extract_data_with_deepseek(html, extraction_instructions)

When working with dynamic single-page applications, you may need to wait for specific content to load before extracting the HTML for AI processing.

Batch Processing Multiple Pages

Process multiple pages efficiently by batching requests:

from concurrent.futures import ThreadPoolExecutor
import time

def scrape_page(url):
    """Scrape a single page"""
    try:
        html = fetch_html(url)
        cleaned = clean_html(html)
        return extract_data_with_deepseek(cleaned, extraction_instructions)
    except Exception as e:
        print(f"Error scraping {url}: {e}")
        return None

# Scrape multiple URLs in parallel
urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3",
]

with ThreadPoolExecutor(max_workers=5) as executor:
    results = list(executor.map(scrape_page, urls))

# Filter out None values (failed requests)
successful_results = [r for r in results if r is not None]

Reducing Token Usage and Costs

Since Deepseek charges based on tokens, optimize your HTML before sending:

def extract_relevant_content(html, css_selector=None):
    """Extract only the relevant portion of the page"""
    soup = BeautifulSoup(html, 'html.parser')

    if css_selector:
        # Extract only the specified section
        relevant_section = soup.select_one(css_selector)
        if relevant_section:
            return str(relevant_section)

    # Otherwise, clean the full page
    for element in soup(['script', 'style', 'svg', 'nav', 'footer', 'header', 'aside']):
        element.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Remove empty tags
    for tag in soup.find_all():
        if not tag.contents and not tag.string:
            tag.decompose()

    return str(soup)

# Use it
html = fetch_html(url)
relevant_html = extract_relevant_content(html, css_selector='main.product-details')
data = extract_data_with_deepseek(relevant_html, extraction_instructions)

Error Handling and Retries

Implement robust error handling for production use:

import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def extract_with_retry(html, prompt):
    """Extract data with automatic retries on failure"""
    try:
        return extract_data_with_deepseek(html, prompt)
    except Exception as e:
        print(f"Extraction failed: {e}")
        raise

# Validate extracted data
def validate_product_data(data):
    """Ensure extracted data has required fields"""
    required_fields = ['product_name', 'price']

    for field in required_fields:
        if field not in data or not data[field]:
            raise ValueError(f"Missing required field: {field}")

    return True

# Use in scraping pipeline
try:
    html = fetch_html(url)
    cleaned = clean_html(html)
    data = extract_with_retry(cleaned, extraction_instructions)

    if validate_product_data(data):
        # Process valid data
        print("Successfully extracted:", data)
except Exception as e:
    print(f"Failed to extract data: {e}")

Using Deepseek with Specialized Web Scraping APIs

For production workloads, combine Deepseek with a specialized web scraping API that handles proxies, JavaScript rendering, and anti-bot measures:

import requests

def scrape_with_api_and_deepseek(url, api_key):
    """Use WebScraping.AI API + Deepseek for robust scraping"""

    # Fetch HTML using scraping API
    api_url = "https://api.webscraping.ai/html"
    params = {
        'url': url,
        'api_key': api_key,
        'js': 'true',  # Enable JavaScript rendering
        'timeout': 10000
    }

    response = requests.get(api_url, params=params)
    response.raise_for_status()
    html = response.text

    # Extract data with Deepseek
    cleaned = clean_html(html)
    return extract_data_with_deepseek(cleaned, extraction_instructions)

# Use it
data = scrape_with_api_and_deepseek(
    url="https://example.com/product",
    api_key="your-webscraping-ai-key"
)

Best Practices

1. Use Temperature 0 for Consistent Extraction

For data extraction tasks, always set temperature to 0.0 to ensure deterministic, consistent results:

response = client.chat.completions.create(
    model="deepseek-chat",
    temperature=0.0,  # Deterministic output
    # ... other parameters
)

2. Provide Clear, Structured Prompts

Be explicit about the expected output format:

extraction_prompt = """
Extract the following information and return as JSON:

{
  "title": "string - the article title",
  "author": "string - author name",
  "published_date": "string - ISO 8601 format (YYYY-MM-DD)",
  "content": "string - main article text",
  "tags": ["array", "of", "strings"],
  "read_time": "number - estimated reading time in minutes"
}

If a field is not found, use null for the value.
"""

3. Handle Rate Limits

Implement rate limiting to avoid API throttling:

import time
from datetime import datetime, timedelta

class RateLimiter:
    def __init__(self, max_requests_per_minute=60):
        self.max_requests = max_requests_per_minute
        self.requests = []

    def wait_if_needed(self):
        now = datetime.now()
        # Remove requests older than 1 minute
        self.requests = [req for req in self.requests
                        if now - req < timedelta(minutes=1)]

        if len(self.requests) >= self.max_requests:
            sleep_time = 60 - (now - self.requests[0]).seconds
            time.sleep(sleep_time)

        self.requests.append(now)

# Use it
limiter = RateLimiter(max_requests_per_minute=50)

for url in urls:
    limiter.wait_if_needed()
    data = scrape_page(url)

4. Monitor Costs

Track token usage to manage costs effectively:

def extract_with_cost_tracking(html, prompt):
    """Extract data and track API costs"""
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"HTML:\n{html}\n\nInstructions:\n{prompt}"}
        ],
        temperature=0.0
    )

    # Track usage
    usage = response.usage
    print(f"Tokens used - Input: {usage.prompt_tokens}, Output: {usage.completion_tokens}")

    # Deepseek pricing (example rates)
    input_cost = usage.prompt_tokens * 0.00014 / 1000  # $0.14 per 1M tokens
    output_cost = usage.completion_tokens * 0.00028 / 1000  # $0.28 per 1M tokens
    total_cost = input_cost + output_cost
    print(f"Estimated cost: ${total_cost:.6f}")

    return json.loads(response.choices[0].message.content)

Conclusion

AI web scraping with Deepseek offers a powerful, cost-effective approach to extracting structured data from websites. By combining traditional web scraping techniques with Deepseek's language understanding capabilities, you can build robust scrapers that handle layout changes, extract complex data, and process unstructured content intelligently.

The key to success is optimizing your HTML input, providing clear extraction instructions, implementing proper error handling, and monitoring costs. When dealing with complex authentication scenarios or JavaScript-heavy sites, combine Deepseek with headless browsers for the best results.

Start with simple extraction tasks, monitor the quality of results, and gradually expand to more complex use cases as you refine your prompts and preprocessing pipeline.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon