Table of contents

Deepseek API Tutorial for Web Scraping Beginners

This comprehensive tutorial will guide you through using the Deepseek API for web scraping tasks, from basic setup to advanced data extraction techniques. Deepseek offers powerful language models that can parse HTML, extract structured data, and understand complex web page layouts without requiring traditional CSS selectors or XPath expressions.

What is Deepseek?

Deepseek is a family of large language models (LLMs) that excel at understanding and processing structured and unstructured data. For web scraping, Deepseek models can intelligently extract information from HTML content by understanding context and semantics, making them ideal for scenarios where traditional parsing methods fall short.

Prerequisites

Before starting this tutorial, you should have:

  • Basic knowledge of Python or JavaScript
  • An API key from Deepseek (sign up at deepseek.com)
  • Python 3.7+ or Node.js 14+ installed
  • A text editor or IDE

Step 1: Getting Your Deepseek API Key

  1. Visit the Deepseek platform
  2. Sign up for an account
  3. Navigate to the API section
  4. Generate a new API key
  5. Store your API key securely (never commit it to version control)

Step 2: Installation and Setup

Python Setup

First, install the required packages:

pip install openai requests beautifulsoup4

The Deepseek API is compatible with the OpenAI SDK, making integration straightforward.

Create a new Python file and set up your environment:

import os
from openai import OpenAI
import requests
from bs4 import BeautifulSoup

# Set your Deepseek API key
client = OpenAI(
    api_key="YOUR_DEEPSEEK_API_KEY",
    base_url="https://api.deepseek.com"
)

JavaScript Setup

Install the necessary packages:

npm install openai axios cheerio

Set up your JavaScript environment:

const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');

const client = new OpenAI({
    apiKey: 'YOUR_DEEPSEEK_API_KEY',
    baseURL: 'https://api.deepseek.com'
});

Step 3: Basic Web Scraping with Deepseek

Fetching HTML Content

Before using Deepseek, you need to fetch the HTML content from your target website:

Python:

def fetch_html(url):
    """Fetch HTML content from a URL"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    response = requests.get(url, headers=headers)
    response.raise_for_status()
    return response.text

# Example usage
html_content = fetch_html('https://example.com/products')

JavaScript:

async function fetchHTML(url) {
    const response = await axios.get(url, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    });
    return response.data;
}

// Example usage
const htmlContent = await fetchHTML('https://example.com/products');

Extracting Data with Deepseek

Now, use Deepseek to extract structured data from the HTML:

Python:

def extract_data_with_deepseek(html_content, extraction_prompt):
    """Extract structured data using Deepseek"""
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "system",
                "content": "You are a web scraping assistant. Extract data from HTML and return it in JSON format."
            },
            {
                "role": "user",
                "content": f"{extraction_prompt}\n\nHTML:\n{html_content}"
            }
        ],
        temperature=0.0,  # Use 0 for deterministic output
        response_format={"type": "json_object"}
    )

    return response.choices[0].message.content

# Example: Extract product information
prompt = """
Extract all products from this e-commerce page. For each product, extract:
- Product name
- Price
- Description
- Availability status

Return as a JSON array with key 'products'.
"""

result = extract_data_with_deepseek(html_content, prompt)
print(result)

JavaScript:

async function extractDataWithDeepseek(htmlContent, extractionPrompt) {
    const response = await client.chat.completions.create({
        model: 'deepseek-chat',
        messages: [
            {
                role: 'system',
                content: 'You are a web scraping assistant. Extract data from HTML and return it in JSON format.'
            },
            {
                role: 'user',
                content: `${extractionPrompt}\n\nHTML:\n${htmlContent}`
            }
        ],
        temperature: 0.0,
        response_format: { type: 'json_object' }
    });

    return JSON.parse(response.choices[0].message.content);
}

// Example usage
const prompt = `
Extract all products from this e-commerce page. For each product, extract:
- Product name
- Price
- Description
- Availability status

Return as a JSON array with key 'products'.
`;

const result = await extractDataWithDeepseek(htmlContent, prompt);
console.log(result);

Step 4: Advanced Techniques

Cleaning HTML Before Extraction

For better results and to reduce token usage, clean the HTML before sending it to Deepseek:

Python:

def clean_html(html_content):
    """Remove unnecessary elements from HTML"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Get text or cleaned HTML
    return str(soup)

# Use cleaned HTML
cleaned_html = clean_html(html_content)
result = extract_data_with_deepseek(cleaned_html, prompt)

JavaScript:

function cleanHTML(htmlContent) {
    const $ = cheerio.load(htmlContent);

    // Remove unnecessary elements
    $('script, style, nav, footer, header').remove();

    return $.html();
}

// Use cleaned HTML
const cleanedHTML = cleanHTML(htmlContent);
const result = await extractDataWithDeepseek(cleanedHTML, prompt);

Handling Large Pages

When working with large HTML documents, you may exceed token limits. Here's how to handle this:

Python:

def extract_relevant_section(html_content, css_selector):
    """Extract only relevant section of the page"""
    soup = BeautifulSoup(html_content, 'html.parser')
    section = soup.select_one(css_selector)
    return str(section) if section else html_content

# Extract only the main content area
relevant_html = extract_relevant_section(html_content, '.product-list')
result = extract_data_with_deepseek(relevant_html, prompt)

Structured Output with Function Calling

Use function calling for more reliable structured output:

Python:

def extract_with_function_calling(html_content):
    """Use function calling for structured extraction"""
    tools = [{
        "type": "function",
        "function": {
            "name": "extract_products",
            "description": "Extract product information from HTML",
            "parameters": {
                "type": "object",
                "properties": {
                    "products": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "price": {"type": "number"},
                                "description": {"type": "string"},
                                "in_stock": {"type": "boolean"}
                            },
                            "required": ["name", "price"]
                        }
                    }
                },
                "required": ["products"]
            }
        }
    }]

    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "user",
                "content": f"Extract all products from this HTML:\n{html_content}"
            }
        ],
        tools=tools,
        tool_choice={"type": "function", "function": {"name": "extract_products"}}
    )

    return response.choices[0].message.tool_calls[0].function.arguments

Step 5: Error Handling and Best Practices

Implementing Retry Logic

Python:

import time
from openai import APIError, RateLimitError

def extract_with_retry(html_content, prompt, max_retries=3):
    """Extract data with retry logic"""
    for attempt in range(max_retries):
        try:
            return extract_data_with_deepseek(html_content, prompt)
        except RateLimitError:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Rate limited. Waiting {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                raise
        except APIError as e:
            print(f"API error: {e}")
            raise

Cost Optimization

Monitor and optimize your token usage:

Python:

def extract_with_cost_tracking(html_content, prompt):
    """Track token usage and estimated costs"""
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "system", "content": "Extract data as JSON."},
            {"role": "user", "content": f"{prompt}\n\n{html_content}"}
        ],
        temperature=0.0
    )

    usage = response.usage
    print(f"Tokens used: {usage.total_tokens}")
    print(f"Prompt tokens: {usage.prompt_tokens}")
    print(f"Completion tokens: {usage.completion_tokens}")

    # Deepseek pricing (as of 2025)
    cost = (usage.prompt_tokens * 0.14 + usage.completion_tokens * 0.28) / 1_000_000
    print(f"Estimated cost: ${cost:.6f}")

    return response.choices[0].message.content

Step 6: Complete Working Example

Here's a complete example that scrapes product data from an e-commerce site:

Python:

import os
import json
from openai import OpenAI
import requests
from bs4 import BeautifulSoup

class DeepseekScraper:
    def __init__(self, api_key):
        self.client = OpenAI(
            api_key=api_key,
            base_url="https://api.deepseek.com"
        )

    def fetch_page(self, url):
        """Fetch and clean HTML content"""
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        # Clean HTML
        soup = BeautifulSoup(response.text, 'html.parser')
        for element in soup(['script', 'style', 'nav', 'footer']):
            element.decompose()

        return str(soup)

    def extract_data(self, html_content, schema_description):
        """Extract structured data using Deepseek"""
        response = self.client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "system",
                    "content": "You are a web scraping expert. Extract data accurately and return valid JSON."
                },
                {
                    "role": "user",
                    "content": f"{schema_description}\n\nHTML:\n{html_content}"
                }
            ],
            temperature=0.0,
            response_format={"type": "json_object"}
        )

        return json.loads(response.choices[0].message.content)

    def scrape(self, url, schema_description):
        """Complete scraping pipeline"""
        print(f"Fetching {url}...")
        html = self.fetch_page(url)

        print("Extracting data with Deepseek...")
        data = self.extract_data(html, schema_description)

        return data

# Usage example
if __name__ == "__main__":
    scraper = DeepseekScraper(api_key=os.environ.get("DEEPSEEK_API_KEY"))

    schema = """
    Extract all product listings from this page. For each product, extract:
    - name: Product name (string)
    - price: Numeric price value (number)
    - currency: Currency symbol or code (string)
    - rating: Customer rating if available (number or null)
    - image_url: Main product image URL (string or null)

    Return as: {"products": [...]}
    """

    results = scraper.scrape("https://example.com/products", schema)
    print(json.dumps(results, indent=2))

Comparing Deepseek to Traditional Methods

While traditional web scraping tools like Beautiful Soup and Selenium rely on CSS selectors and DOM traversal, Deepseek offers several advantages:

  • No selector maintenance: Extract data without writing fragile CSS or XPath selectors
  • Semantic understanding: Understands context and can handle layout variations
  • Natural language queries: Describe what you want in plain English
  • Adaptive to changes: More resilient to minor HTML structure changes

However, for simple, high-volume scraping tasks, traditional methods are still more cost-effective and faster.

Integration with WebScraping.AI

For production web scraping that combines the power of LLMs with traditional scraping infrastructure, consider using WebScraping.AI's API which handles:

  • Proxy rotation and residential proxies
  • JavaScript rendering and AJAX handling
  • CAPTCHA bypassing
  • Rate limiting and retry logic
  • LLM-powered data extraction

This allows you to focus on data extraction while the infrastructure handles the complexities of modern web scraping.

Common Pitfalls to Avoid

  1. Sending entire HTML documents: Always clean and extract relevant sections first to reduce costs
  2. Ignoring rate limits: Implement proper retry logic with exponential backoff
  3. Vague prompts: Be specific about the data structure you want
  4. No validation: Always validate the extracted data before using it
  5. Ignoring costs: Monitor token usage to avoid unexpected bills

Next Steps

Now that you understand the basics of using Deepseek for web scraping, you can:

  • Experiment with different model parameters (temperature, top_p)
  • Build scrapers for specific use cases (e-commerce, news, real estate)
  • Combine Deepseek with traditional scraping tools for optimal results
  • Explore batch processing for scraping multiple pages efficiently
  • Implement caching to reduce API calls for frequently accessed pages

Conclusion

The Deepseek API provides a powerful, AI-driven approach to web scraping that can handle complex extraction tasks with minimal code. By following this tutorial, you should now be able to set up a basic scraping pipeline, extract structured data, and implement best practices for production use.

Remember to always respect website terms of service, implement rate limiting, and use proxies when scraping at scale. For complex scraping needs requiring browser automation, consider learning about handling AJAX requests and working with dynamic content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon