Where Can I Find Comprehensive API Documentation for Deepseek Web Scraping?
The Deepseek API documentation is available at platform.deepseek.com/api-docs, providing complete reference materials for integrating Deepseek's large language models into your web scraping workflows. This comprehensive guide will help you navigate the documentation and implement Deepseek effectively for data extraction tasks.
Official Deepseek API Documentation Resources
Primary Documentation Sources
Official API Documentation: https://platform.deepseek.com/api-docs
- Complete endpoint references
- Authentication methods
- Request/response schemas
- Rate limits and pricing
Deepseek Platform: https://platform.deepseek.com
- API key management
- Usage dashboard
- Billing information
- Model selection
GitHub Repository: https://github.com/deepseek-ai
- Code examples
- SDK libraries
- Community contributions
- Issue tracking
Key API Endpoints for Web Scraping
The Deepseek API follows an OpenAI-compatible structure, making it easy to integrate if you're familiar with other LLM APIs.
Chat Completions Endpoint
The primary endpoint for data extraction is the chat completions API:
POST https://api.deepseek.com/v1/chat/completions
Python Implementation
Here's a complete example of using Deepseek for web scraping data extraction:
import requests
import json
from bs4 import BeautifulSoup
# Your Deepseek API key
API_KEY = "your_deepseek_api_key"
API_URL = "https://api.deepseek.com/v1/chat/completions"
def extract_data_with_deepseek(html_content, extraction_schema):
"""
Extract structured data from HTML using Deepseek API
Args:
html_content: Raw HTML string
extraction_schema: JSON schema describing desired output
Returns:
Extracted data as JSON
"""
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
prompt = f"""Extract the following information from this HTML:
Schema:
{json.dumps(extraction_schema, indent=2)}
HTML:
{html_content}
Return only valid JSON matching the schema."""
payload = {
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "You are a data extraction expert. Extract structured data from HTML and return valid JSON."
},
{
"role": "user",
"content": prompt
}
],
"temperature": 0.1,
"max_tokens": 4096,
"response_format": {"type": "json_object"}
}
response = requests.post(API_URL, headers=headers, json=payload)
response.raise_for_status()
result = response.json()
extracted_data = json.loads(result["choices"][0]["message"]["content"])
return extracted_data
# Example usage
html = """
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$99.99</span>
<p class="description">Premium noise-canceling headphones</p>
<div class="rating">4.5 stars</div>
</div>
"""
schema = {
"product_name": "string",
"price": "number",
"description": "string",
"rating": "number"
}
product_data = extract_data_with_deepseek(html, schema)
print(json.dumps(product_data, indent=2))
JavaScript/Node.js Implementation
const axios = require('axios');
const DEEPSEEK_API_KEY = 'your_deepseek_api_key';
const API_URL = 'https://api.deepseek.com/v1/chat/completions';
async function extractDataWithDeepseek(htmlContent, extractionSchema) {
const prompt = `Extract the following information from this HTML:
Schema:
${JSON.stringify(extractionSchema, null, 2)}
HTML:
${htmlContent}
Return only valid JSON matching the schema.`;
try {
const response = await axios.post(
API_URL,
{
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'You are a data extraction expert. Extract structured data from HTML and return valid JSON.'
},
{
role: 'user',
content: prompt
}
],
temperature: 0.1,
max_tokens: 4096,
response_format: { type: 'json_object' }
},
{
headers: {
'Authorization': `Bearer ${DEEPSEEK_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
const extractedData = JSON.parse(
response.data.choices[0].message.content
);
return extractedData;
} catch (error) {
console.error('Deepseek API error:', error.response?.data || error.message);
throw error;
}
}
// Example usage with Puppeteer for dynamic content
const puppeteer = require('puppeteer');
async function scrapeWithDeepseek(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle0' });
const htmlContent = await page.content();
await browser.close();
const schema = {
title: 'string',
price: 'number',
availability: 'string',
reviews_count: 'number'
};
const data = await extractDataWithDeepseek(htmlContent, schema);
return data;
}
Understanding API Parameters
Essential Request Parameters
| Parameter | Type | Description | Recommended for Scraping |
|-----------|------|-------------|-------------------------|
| model
| string | Model identifier | deepseek-chat
or deepseek-coder
|
| messages
| array | Conversation history | System + user message with HTML |
| temperature
| float | Randomness (0-2) | 0.1-0.3 for consistent extraction |
| max_tokens
| integer | Maximum response length | 2048-4096 for data extraction |
| response_format
| object | Output format | {"type": "json_object"}
for structured data |
| stream
| boolean | Enable streaming | false
for scraping |
Temperature Settings for Data Extraction
For web scraping tasks, use low temperature values to ensure consistent, deterministic output:
# Configuration for web scraping tasks
scraping_config = {
"temperature": 0.1, # Very low for consistency
"max_tokens": 4096,
"top_p": 0.95,
"frequency_penalty": 0.0,
"presence_penalty": 0.0
}
Authentication and API Key Management
Obtaining Your API Key
- Sign up at platform.deepseek.com
- Navigate to API Keys section
- Create a new API key
- Store securely (never commit to version control)
Secure API Key Storage
Environment Variables (Recommended):
# .env file
DEEPSEEK_API_KEY=your_api_key_here
Python with python-dotenv:
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv('DEEPSEEK_API_KEY')
Node.js with dotenv:
require('dotenv').config();
const apiKey = process.env.DEEPSEEK_API_KEY;
Rate Limits and Pricing
Understanding Rate Limits
Deepseek implements rate limiting to ensure fair usage:
- Requests per minute (RPM): Varies by tier
- Tokens per minute (TPM): Model-dependent
- Concurrent requests: Check your dashboard
Handling Rate Limits
import time
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
"""Decorator to handle rate limiting with exponential backoff"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429: # Rate limit
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt)
print(f"Rate limited. Retrying in {delay}s...")
time.sleep(delay)
else:
raise
else:
raise
return None
return wrapper
return decorator
@retry_with_backoff(max_retries=5, base_delay=2)
def call_deepseek_api(payload):
response = requests.post(API_URL, headers=headers, json=payload)
response.raise_for_status()
return response.json()
Advanced Features for Web Scraping
Function Calling for Structured Extraction
Deepseek supports function calling, which is excellent for web scraping:
def extract_with_function_calling(html_content):
"""Use function calling for guaranteed structured output"""
functions = [
{
"name": "extract_product_data",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"product_name": {
"type": "string",
"description": "The product name or title"
},
"price": {
"type": "number",
"description": "Price in dollars"
},
"in_stock": {
"type": "boolean",
"description": "Whether product is in stock"
},
"categories": {
"type": "array",
"items": {"type": "string"},
"description": "Product categories"
}
},
"required": ["product_name", "price"]
}
}
]
payload = {
"model": "deepseek-chat",
"messages": [
{
"role": "user",
"content": f"Extract product data from this HTML:\n{html_content}"
}
],
"functions": functions,
"function_call": {"name": "extract_product_data"}
}
response = requests.post(API_URL, headers=headers, json=payload)
result = response.json()
# Parse function call arguments
function_args = json.loads(
result["choices"][0]["message"]["function_call"]["arguments"]
)
return function_args
Batch Processing for Large-Scale Scraping
When scraping multiple pages, implement efficient batch processing:
import asyncio
import aiohttp
async def async_extract_data(session, html_content, schema):
"""Async function for parallel API calls"""
async with session.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "deepseek-chat",
"messages": [
{"role": "system", "content": "Extract data as JSON"},
{"role": "user", "content": f"Schema: {schema}\nHTML: {html_content}"}
],
"temperature": 0.1,
"response_format": {"type": "json_object"}
}
) as response:
result = await response.json()
return json.loads(result["choices"][0]["message"]["content"])
async def batch_extract(html_pages, schema):
"""Process multiple pages concurrently"""
async with aiohttp.ClientSession() as session:
tasks = [
async_extract_data(session, html, schema)
for html in html_pages
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
# Usage
html_pages = ["<html>...</html>", "<html>...</html>", ...]
schema = {"title": "string", "price": "number"}
results = asyncio.run(batch_extract(html_pages, schema))
Integration with Web Scraping Tools
Combining with BeautifulSoup
from bs4 import BeautifulSoup
import requests
def scrape_and_extract(url):
"""Fetch HTML and extract data with Deepseek"""
# Fetch HTML
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract relevant section (reduces token usage)
main_content = soup.find('main') or soup.find('body')
clean_html = str(main_content)
# Extract with Deepseek
schema = {
"headline": "string",
"author": "string",
"publish_date": "string",
"article_text": "string"
}
data = extract_data_with_deepseek(clean_html, schema)
return data
Using with Selenium for Dynamic Content
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_dynamic_page(url):
"""Scrape JavaScript-rendered content"""
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options=options)
try:
driver.get(url)
# Wait for dynamic content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located(('id', 'content'))
)
html_content = driver.page_source
schema = {
"items": [
{
"name": "string",
"price": "number",
"rating": "number"
}
]
}
return extract_data_with_deepseek(html_content, schema)
finally:
driver.quit()
When working with browser automation tools for dynamic content, you might find it helpful to understand how to handle AJAX requests using Puppeteer or how to handle timeouts in Puppeteer for more robust scraping implementations.
Error Handling and Debugging
Common API Errors
def handle_deepseek_errors(response):
"""Comprehensive error handling for Deepseek API"""
try:
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as e:
status_code = e.response.status_code
if status_code == 401:
raise Exception("Invalid API key. Check your credentials.")
elif status_code == 429:
raise Exception("Rate limit exceeded. Implement backoff strategy.")
elif status_code == 500:
raise Exception("Deepseek server error. Retry after delay.")
elif status_code == 400:
error_detail = e.response.json()
raise Exception(f"Bad request: {error_detail.get('error', {}).get('message')}")
else:
raise Exception(f"API error {status_code}: {e.response.text}")
except requests.exceptions.Timeout:
raise Exception("Request timeout. Increase timeout or retry.")
except requests.exceptions.ConnectionError:
raise Exception("Connection error. Check network connectivity.")
Logging API Usage
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def extract_with_logging(html_content, schema):
"""Extract data with comprehensive logging"""
logger.info(f"Starting extraction. HTML length: {len(html_content)}")
start_time = time.time()
try:
result = extract_data_with_deepseek(html_content, schema)
duration = time.time() - start_time
logger.info(f"Extraction successful. Duration: {duration:.2f}s")
return result
except Exception as e:
logger.error(f"Extraction failed: {str(e)}")
raise
Additional Resources and Community Support
Official Resources
- Documentation: https://platform.deepseek.com/api-docs
- Status Page: Check for API status and incidents
- Blog: https://www.deepseek.com/blog
- Pricing Calculator: Available on the platform
Community and Support
- GitHub Issues: Report bugs and request features
- Discord/Slack: Join community channels
- Stack Overflow: Tag questions with
deepseek-api
- Email Support: For account and billing issues
Best Practices Documentation
When building production web scraping systems with Deepseek:
- Implement robust error handling with retries and exponential backoff
- Monitor API usage to avoid unexpected costs
- Cache results when appropriate to reduce API calls
- Use precise prompts with clear schemas for better extraction accuracy
- Validate extracted data before storage or further processing
- Respect rate limits and implement queuing for large-scale scraping
Conclusion
The Deepseek API documentation provides all the necessary information to integrate powerful LLM-based data extraction into your web scraping workflows. By combining Deepseek's natural language understanding with traditional scraping tools, you can build robust systems that handle complex, unstructured web data with ease. Start with the official documentation at platform.deepseek.com/api-docs, experiment with the examples provided above, and iterate based on your specific scraping requirements.
For dynamic content extraction scenarios, consider exploring tools like Puppeteer for crawling single page applications to complement your Deepseek-powered extraction pipeline.