Table of contents

Where Can I Find Comprehensive API Documentation for Deepseek Web Scraping?

The Deepseek API documentation is available at platform.deepseek.com/api-docs, providing complete reference materials for integrating Deepseek's large language models into your web scraping workflows. This comprehensive guide will help you navigate the documentation and implement Deepseek effectively for data extraction tasks.

Official Deepseek API Documentation Resources

Primary Documentation Sources

  1. Official API Documentation: https://platform.deepseek.com/api-docs

    • Complete endpoint references
    • Authentication methods
    • Request/response schemas
    • Rate limits and pricing
  2. Deepseek Platform: https://platform.deepseek.com

    • API key management
    • Usage dashboard
    • Billing information
    • Model selection
  3. GitHub Repository: https://github.com/deepseek-ai

    • Code examples
    • SDK libraries
    • Community contributions
    • Issue tracking

Key API Endpoints for Web Scraping

The Deepseek API follows an OpenAI-compatible structure, making it easy to integrate if you're familiar with other LLM APIs.

Chat Completions Endpoint

The primary endpoint for data extraction is the chat completions API:

POST https://api.deepseek.com/v1/chat/completions

Python Implementation

Here's a complete example of using Deepseek for web scraping data extraction:

import requests
import json
from bs4 import BeautifulSoup

# Your Deepseek API key
API_KEY = "your_deepseek_api_key"
API_URL = "https://api.deepseek.com/v1/chat/completions"

def extract_data_with_deepseek(html_content, extraction_schema):
    """
    Extract structured data from HTML using Deepseek API

    Args:
        html_content: Raw HTML string
        extraction_schema: JSON schema describing desired output

    Returns:
        Extracted data as JSON
    """
    headers = {
        "Authorization": f"Bearer {API_KEY}",
        "Content-Type": "application/json"
    }

    prompt = f"""Extract the following information from this HTML:

Schema:
{json.dumps(extraction_schema, indent=2)}

HTML:
{html_content}

Return only valid JSON matching the schema."""

    payload = {
        "model": "deepseek-chat",
        "messages": [
            {
                "role": "system",
                "content": "You are a data extraction expert. Extract structured data from HTML and return valid JSON."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        "temperature": 0.1,
        "max_tokens": 4096,
        "response_format": {"type": "json_object"}
    }

    response = requests.post(API_URL, headers=headers, json=payload)
    response.raise_for_status()

    result = response.json()
    extracted_data = json.loads(result["choices"][0]["message"]["content"])

    return extracted_data

# Example usage
html = """
<div class="product">
    <h1>Wireless Headphones</h1>
    <span class="price">$99.99</span>
    <p class="description">Premium noise-canceling headphones</p>
    <div class="rating">4.5 stars</div>
</div>
"""

schema = {
    "product_name": "string",
    "price": "number",
    "description": "string",
    "rating": "number"
}

product_data = extract_data_with_deepseek(html, schema)
print(json.dumps(product_data, indent=2))

JavaScript/Node.js Implementation

const axios = require('axios');

const DEEPSEEK_API_KEY = 'your_deepseek_api_key';
const API_URL = 'https://api.deepseek.com/v1/chat/completions';

async function extractDataWithDeepseek(htmlContent, extractionSchema) {
    const prompt = `Extract the following information from this HTML:

Schema:
${JSON.stringify(extractionSchema, null, 2)}

HTML:
${htmlContent}

Return only valid JSON matching the schema.`;

    try {
        const response = await axios.post(
            API_URL,
            {
                model: 'deepseek-chat',
                messages: [
                    {
                        role: 'system',
                        content: 'You are a data extraction expert. Extract structured data from HTML and return valid JSON.'
                    },
                    {
                        role: 'user',
                        content: prompt
                    }
                ],
                temperature: 0.1,
                max_tokens: 4096,
                response_format: { type: 'json_object' }
            },
            {
                headers: {
                    'Authorization': `Bearer ${DEEPSEEK_API_KEY}`,
                    'Content-Type': 'application/json'
                }
            }
        );

        const extractedData = JSON.parse(
            response.data.choices[0].message.content
        );

        return extractedData;
    } catch (error) {
        console.error('Deepseek API error:', error.response?.data || error.message);
        throw error;
    }
}

// Example usage with Puppeteer for dynamic content
const puppeteer = require('puppeteer');

async function scrapeWithDeepseek(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto(url, { waitUntil: 'networkidle0' });
    const htmlContent = await page.content();

    await browser.close();

    const schema = {
        title: 'string',
        price: 'number',
        availability: 'string',
        reviews_count: 'number'
    };

    const data = await extractDataWithDeepseek(htmlContent, schema);
    return data;
}

Understanding API Parameters

Essential Request Parameters

| Parameter | Type | Description | Recommended for Scraping | |-----------|------|-------------|-------------------------| | model | string | Model identifier | deepseek-chat or deepseek-coder | | messages | array | Conversation history | System + user message with HTML | | temperature | float | Randomness (0-2) | 0.1-0.3 for consistent extraction | | max_tokens | integer | Maximum response length | 2048-4096 for data extraction | | response_format | object | Output format | {"type": "json_object"} for structured data | | stream | boolean | Enable streaming | false for scraping |

Temperature Settings for Data Extraction

For web scraping tasks, use low temperature values to ensure consistent, deterministic output:

# Configuration for web scraping tasks
scraping_config = {
    "temperature": 0.1,  # Very low for consistency
    "max_tokens": 4096,
    "top_p": 0.95,
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0
}

Authentication and API Key Management

Obtaining Your API Key

  1. Sign up at platform.deepseek.com
  2. Navigate to API Keys section
  3. Create a new API key
  4. Store securely (never commit to version control)

Secure API Key Storage

Environment Variables (Recommended):

# .env file
DEEPSEEK_API_KEY=your_api_key_here

Python with python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv('DEEPSEEK_API_KEY')

Node.js with dotenv:

require('dotenv').config();
const apiKey = process.env.DEEPSEEK_API_KEY;

Rate Limits and Pricing

Understanding Rate Limits

Deepseek implements rate limiting to ensure fair usage:

  • Requests per minute (RPM): Varies by tier
  • Tokens per minute (TPM): Model-dependent
  • Concurrent requests: Check your dashboard

Handling Rate Limits

import time
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1):
    """Decorator to handle rate limiting with exponential backoff"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except requests.exceptions.HTTPError as e:
                    if e.response.status_code == 429:  # Rate limit
                        if attempt < max_retries - 1:
                            delay = base_delay * (2 ** attempt)
                            print(f"Rate limited. Retrying in {delay}s...")
                            time.sleep(delay)
                        else:
                            raise
                    else:
                        raise
            return None
        return wrapper
    return decorator

@retry_with_backoff(max_retries=5, base_delay=2)
def call_deepseek_api(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    response.raise_for_status()
    return response.json()

Advanced Features for Web Scraping

Function Calling for Structured Extraction

Deepseek supports function calling, which is excellent for web scraping:

def extract_with_function_calling(html_content):
    """Use function calling for guaranteed structured output"""
    functions = [
        {
            "name": "extract_product_data",
            "description": "Extract product information from HTML",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_name": {
                        "type": "string",
                        "description": "The product name or title"
                    },
                    "price": {
                        "type": "number",
                        "description": "Price in dollars"
                    },
                    "in_stock": {
                        "type": "boolean",
                        "description": "Whether product is in stock"
                    },
                    "categories": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "Product categories"
                    }
                },
                "required": ["product_name", "price"]
            }
        }
    ]

    payload = {
        "model": "deepseek-chat",
        "messages": [
            {
                "role": "user",
                "content": f"Extract product data from this HTML:\n{html_content}"
            }
        ],
        "functions": functions,
        "function_call": {"name": "extract_product_data"}
    }

    response = requests.post(API_URL, headers=headers, json=payload)
    result = response.json()

    # Parse function call arguments
    function_args = json.loads(
        result["choices"][0]["message"]["function_call"]["arguments"]
    )

    return function_args

Batch Processing for Large-Scale Scraping

When scraping multiple pages, implement efficient batch processing:

import asyncio
import aiohttp

async def async_extract_data(session, html_content, schema):
    """Async function for parallel API calls"""
    async with session.post(
        API_URL,
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={
            "model": "deepseek-chat",
            "messages": [
                {"role": "system", "content": "Extract data as JSON"},
                {"role": "user", "content": f"Schema: {schema}\nHTML: {html_content}"}
            ],
            "temperature": 0.1,
            "response_format": {"type": "json_object"}
        }
    ) as response:
        result = await response.json()
        return json.loads(result["choices"][0]["message"]["content"])

async def batch_extract(html_pages, schema):
    """Process multiple pages concurrently"""
    async with aiohttp.ClientSession() as session:
        tasks = [
            async_extract_data(session, html, schema)
            for html in html_pages
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

# Usage
html_pages = ["<html>...</html>", "<html>...</html>", ...]
schema = {"title": "string", "price": "number"}
results = asyncio.run(batch_extract(html_pages, schema))

Integration with Web Scraping Tools

Combining with BeautifulSoup

from bs4 import BeautifulSoup
import requests

def scrape_and_extract(url):
    """Fetch HTML and extract data with Deepseek"""
    # Fetch HTML
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract relevant section (reduces token usage)
    main_content = soup.find('main') or soup.find('body')
    clean_html = str(main_content)

    # Extract with Deepseek
    schema = {
        "headline": "string",
        "author": "string",
        "publish_date": "string",
        "article_text": "string"
    }

    data = extract_data_with_deepseek(clean_html, schema)
    return data

Using with Selenium for Dynamic Content

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_dynamic_page(url):
    """Scrape JavaScript-rendered content"""
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    driver = webdriver.Chrome(options=options)

    try:
        driver.get(url)
        # Wait for dynamic content to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located(('id', 'content'))
        )

        html_content = driver.page_source

        schema = {
            "items": [
                {
                    "name": "string",
                    "price": "number",
                    "rating": "number"
                }
            ]
        }

        return extract_data_with_deepseek(html_content, schema)
    finally:
        driver.quit()

When working with browser automation tools for dynamic content, you might find it helpful to understand how to handle AJAX requests using Puppeteer or how to handle timeouts in Puppeteer for more robust scraping implementations.

Error Handling and Debugging

Common API Errors

def handle_deepseek_errors(response):
    """Comprehensive error handling for Deepseek API"""
    try:
        response.raise_for_status()
        return response.json()
    except requests.exceptions.HTTPError as e:
        status_code = e.response.status_code

        if status_code == 401:
            raise Exception("Invalid API key. Check your credentials.")
        elif status_code == 429:
            raise Exception("Rate limit exceeded. Implement backoff strategy.")
        elif status_code == 500:
            raise Exception("Deepseek server error. Retry after delay.")
        elif status_code == 400:
            error_detail = e.response.json()
            raise Exception(f"Bad request: {error_detail.get('error', {}).get('message')}")
        else:
            raise Exception(f"API error {status_code}: {e.response.text}")
    except requests.exceptions.Timeout:
        raise Exception("Request timeout. Increase timeout or retry.")
    except requests.exceptions.ConnectionError:
        raise Exception("Connection error. Check network connectivity.")

Logging API Usage

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def extract_with_logging(html_content, schema):
    """Extract data with comprehensive logging"""
    logger.info(f"Starting extraction. HTML length: {len(html_content)}")

    start_time = time.time()

    try:
        result = extract_data_with_deepseek(html_content, schema)

        duration = time.time() - start_time
        logger.info(f"Extraction successful. Duration: {duration:.2f}s")

        return result
    except Exception as e:
        logger.error(f"Extraction failed: {str(e)}")
        raise

Additional Resources and Community Support

Official Resources

Community and Support

  • GitHub Issues: Report bugs and request features
  • Discord/Slack: Join community channels
  • Stack Overflow: Tag questions with deepseek-api
  • Email Support: For account and billing issues

Best Practices Documentation

When building production web scraping systems with Deepseek:

  1. Implement robust error handling with retries and exponential backoff
  2. Monitor API usage to avoid unexpected costs
  3. Cache results when appropriate to reduce API calls
  4. Use precise prompts with clear schemas for better extraction accuracy
  5. Validate extracted data before storage or further processing
  6. Respect rate limits and implement queuing for large-scale scraping

Conclusion

The Deepseek API documentation provides all the necessary information to integrate powerful LLM-based data extraction into your web scraping workflows. By combining Deepseek's natural language understanding with traditional scraping tools, you can build robust systems that handle complex, unstructured web data with ease. Start with the official documentation at platform.deepseek.com/api-docs, experiment with the examples provided above, and iterate based on your specific scraping requirements.

For dynamic content extraction scenarios, consider exploring tools like Puppeteer for crawling single page applications to complement your Deepseek-powered extraction pipeline.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon