How Much Does It Cost to Use Deepseek for Web Scraping?
The cost of using Deepseek for web scraping is significantly lower than most competing AI models, making it an attractive option for large-scale data extraction projects. Deepseek offers competitive pricing while delivering performance comparable to GPT-4 and Claude, with costs ranging from $0.14 to $2.19 per million tokens depending on the model and caching configuration.
Understanding Deepseek's pricing structure is essential for budgeting your web scraping projects effectively, especially when processing thousands or millions of pages.
Deepseek API Pricing Structure
Deepseek charges based on tokens—text units that roughly correspond to 4 characters or 0.75 words in English. Like other LLM APIs, both input tokens (your HTML content and prompts) and output tokens (the extracted data) are counted and billed separately.
Current Pricing by Model (as of 2025)
| Model | Input Tokens (per 1M) | Output Tokens (per 1M) | Cache Hits (per 1M) | |-------|----------------------|------------------------|---------------------| | Deepseek-V3 | $0.27 | $1.10 | $0.014 | | Deepseek-R1 | $0.55 | $2.19 | $0.014 | | Deepseek-Chat | $0.14 | $0.28 | $0.07 |
Key advantages: - 70-90% cheaper than GPT-4 and Claude 3.5 Opus - Cache hits cost 95% less than regular input tokens - No minimum commitment or subscription required - Pay-as-you-go pricing model
For web scraping tasks, Deepseek-Chat offers the best cost-to-performance ratio for structured data extraction, while Deepseek-V3 provides superior accuracy for complex extraction scenarios.
Cost Calculation for Web Scraping
The total cost depends on several factors:
- HTML size: Larger pages consume more input tokens
- Extraction complexity: Complex prompts increase token usage
- Response format: JSON outputs are typically more concise
- Model selection: Different models have different pricing tiers
- Cache efficiency: Repeated prompts benefit from caching
Example Cost Calculation
Let's calculate the cost to scrape 10,000 product pages using Deepseek-Chat:
Assumptions: - Average HTML page size: 50 KB (compressed to ~12,500 tokens) - System prompt size: ~300 tokens (cached after first use) - User prompt size: ~200 tokens - Output JSON: ~250 tokens
Cost per page (first request): - Input: 13,000 tokens × $0.14 / 1,000,000 = $0.00182 - Output: 250 tokens × $0.28 / 1,000,000 = $0.00007 - Total: $0.00189
Cost per page (with cached prompt): - New input: 12,700 tokens × $0.14 / 1,000,000 = $0.00178 - Cached: 300 tokens × $0.07 / 1,000,000 = $0.000021 - Output: 250 tokens × $0.28 / 1,000,000 = $0.00007 - Total: $0.00187
Cost for 10,000 pages: ~$18.70
Comparison with Competing Models
For the same 10,000-page scraping project:
| Model | Cost per Page | Total Cost (10K pages) | Relative Cost | |-------|--------------|----------------------|---------------| | Deepseek-Chat | $0.00187 | $18.70 | 1x (baseline) | | Deepseek-V3 | $0.00213 | $21.30 | 1.14x | | GPT-4o-mini | $0.00207 | $20.70 | 1.11x | | GPT-4o | $0.00525 | $52.50 | 2.81x | | Claude 3.5 Sonnet | $0.00415 | $41.50 | 2.22x |
Deepseek-Chat delivers 40-65% cost savings compared to premium alternatives while maintaining competitive accuracy.
Practical Python Implementation with Cost Tracking
Here's a production-ready implementation with comprehensive cost tracking:
import requests
import json
from bs4 import BeautifulSoup
import tiktoken
from datetime import datetime
class DeepseekScraper:
def __init__(self, api_key, model="deepseek-chat"):
self.api_key = api_key
self.model = model
self.api_url = "https://api.deepseek.com/v1/chat/completions"
# Track token usage
self.total_input_tokens = 0
self.total_output_tokens = 0
self.total_cached_tokens = 0
self.requests_made = 0
# Pricing per 1M tokens (USD)
self.pricing = {
"deepseek-chat": {
"input": 0.14,
"output": 0.28,
"cache": 0.07
},
"deepseek-v3": {
"input": 0.27,
"output": 1.10,
"cache": 0.014
},
"deepseek-r1": {
"input": 0.55,
"output": 2.19,
"cache": 0.014
}
}
# System prompt (will be cached after first use)
self.system_prompt = """You are a precise data extraction assistant.
Extract structured information from HTML content and return ONLY valid JSON.
Never include explanations, markdown formatting, or extra text."""
def clean_html(self, html):
"""Remove unnecessary elements to reduce token count"""
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, and other non-content elements
for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'iframe', 'noscript']):
tag.decompose()
# Remove HTML comments
for comment in soup.find_all(string=lambda text: isinstance(text, type(soup))):
if hasattr(comment, 'extract'):
comment.extract()
# Get main content if identifiable
main_content = (
soup.find('main') or
soup.find('article') or
soup.find(class_=['content', 'main-content', 'product']) or
soup.body
)
return str(main_content) if main_content else str(soup)
def extract_data(self, html, schema, url=None):
"""Extract structured data from HTML using Deepseek"""
# Clean HTML to reduce tokens
cleaned_html = self.clean_html(html)
# Truncate if still too large (optional)
max_chars = 40000 # ~10k tokens
if len(cleaned_html) > max_chars:
cleaned_html = cleaned_html[:max_chars] + "..."
# Build extraction prompt
user_prompt = f"""Extract the following information from this HTML:
{schema}
HTML Content:
{cleaned_html}
Return ONLY valid JSON matching the schema."""
# Make API request
headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
}
payload = {
"model": self.model,
"messages": [
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_prompt}
],
"temperature": 0.1, # Low temperature for consistent extraction
"max_tokens": 4000,
"response_format": {"type": "json_object"}
}
try:
response = requests.post(
self.api_url,
headers=headers,
json=payload,
timeout=30
)
response.raise_for_status()
result = response.json()
# Track token usage
usage = result.get('usage', {})
self.total_input_tokens += usage.get('prompt_tokens', 0)
self.total_output_tokens += usage.get('completion_tokens', 0)
# Track cached tokens if available
if 'prompt_cache_hit_tokens' in usage:
self.total_cached_tokens += usage['prompt_cache_hit_tokens']
self.requests_made += 1
# Parse and return extracted data
extracted = json.loads(result['choices'][0]['message']['content'])
return {
"success": True,
"data": extracted,
"url": url,
"tokens_used": {
"input": usage.get('prompt_tokens', 0),
"output": usage.get('completion_tokens', 0),
"cached": usage.get('prompt_cache_hit_tokens', 0)
}
}
except requests.exceptions.RequestException as e:
return {
"success": False,
"error": str(e),
"url": url
}
except json.JSONDecodeError as e:
return {
"success": False,
"error": f"Invalid JSON response: {str(e)}",
"url": url
}
def get_cost_summary(self):
"""Calculate and return detailed cost breakdown"""
pricing = self.pricing[self.model]
# Calculate costs
input_cost = (self.total_input_tokens / 1_000_000) * pricing["input"]
output_cost = (self.total_output_tokens / 1_000_000) * pricing["output"]
cache_cost = (self.total_cached_tokens / 1_000_000) * pricing["cache"]
# Cache savings
cache_savings = (self.total_cached_tokens / 1_000_000) * (pricing["input"] - pricing["cache"])
total_cost = input_cost + output_cost + cache_cost
return {
"model": self.model,
"total_cost": total_cost,
"cost_breakdown": {
"input_tokens_cost": input_cost,
"output_tokens_cost": output_cost,
"cached_tokens_cost": cache_cost,
"cache_savings": cache_savings
},
"token_usage": {
"input_tokens": self.total_input_tokens,
"output_tokens": self.total_output_tokens,
"cached_tokens": self.total_cached_tokens,
"total_tokens": self.total_input_tokens + self.total_output_tokens
},
"requests": self.requests_made,
"average_cost_per_request": total_cost / self.requests_made if self.requests_made > 0 else 0
}
def print_cost_summary(self):
"""Print formatted cost summary"""
summary = self.get_cost_summary()
print(f"\n{'='*50}")
print(f"DEEPSEEK SCRAPING COST SUMMARY")
print(f"{'='*50}")
print(f"Model: {summary['model']}")
print(f"Total Requests: {summary['requests']}")
print(f"\nToken Usage:")
print(f" Input Tokens: {summary['token_usage']['input_tokens']:,}")
print(f" Output Tokens: {summary['token_usage']['output_tokens']:,}")
print(f" Cached Tokens: {summary['token_usage']['cached_tokens']:,}")
print(f" Total Tokens: {summary['token_usage']['total_tokens']:,}")
print(f"\nCosts:")
print(f" Input Cost: ${summary['cost_breakdown']['input_tokens_cost']:.4f}")
print(f" Output Cost: ${summary['cost_breakdown']['output_tokens_cost']:.4f}")
print(f" Cache Cost: ${summary['cost_breakdown']['cached_tokens_cost']:.4f}")
print(f" Cache Savings: ${summary['cost_breakdown']['cache_savings']:.4f}")
print(f" Total Cost: ${summary['total_cost']:.4f}")
print(f" Average per Request: ${summary['average_cost_per_request']:.6f}")
print(f"{'='*50}\n")
# Usage Example
if __name__ == "__main__":
# Initialize scraper
scraper = DeepseekScraper(
api_key="your_deepseek_api_key",
model="deepseek-chat"
)
# Define extraction schema
schema = """
{
"product_name": "string",
"price": "number",
"currency": "string",
"rating": "number or null",
"reviews_count": "integer",
"availability": "boolean",
"description": "string",
"features": ["array of strings"]
}
"""
# Example HTML (in production, fetch from URL)
sample_html = """
<div class="product-page">
<h1 class="product-title">Premium Wireless Headphones</h1>
<div class="price">$299.99</div>
<div class="rating">4.5 stars from 1,234 reviews</div>
<p class="description">High-quality noise-canceling headphones with 30-hour battery life.</p>
<span class="stock">In Stock</span>
<ul class="features">
<li>Active noise cancellation</li>
<li>30-hour battery life</li>
<li>Bluetooth 5.0</li>
</ul>
</div>
"""
# Extract data
result = scraper.extract_data(
html=sample_html,
schema=schema,
url="https://example.com/product/1"
)
if result["success"]:
print("Extracted Data:")
print(json.dumps(result["data"], indent=2))
print(f"\nTokens used: Input={result['tokens_used']['input']}, Output={result['tokens_used']['output']}")
else:
print(f"Error: {result['error']}")
# Print cost summary
scraper.print_cost_summary()
JavaScript/Node.js Implementation
import axios from 'axios';
import * as cheerio from 'cheerio';
class DeepseekScraper {
constructor(apiKey, model = 'deepseek-chat') {
this.apiKey = apiKey;
this.model = model;
this.apiUrl = 'https://api.deepseek.com/v1/chat/completions';
// Track usage
this.totalInputTokens = 0;
this.totalOutputTokens = 0;
this.totalCachedTokens = 0;
this.requestsMade = 0;
// Pricing
this.pricing = {
'deepseek-chat': { input: 0.14, output: 0.28, cache: 0.07 },
'deepseek-v3': { input: 0.27, output: 1.10, cache: 0.014 },
'deepseek-r1': { input: 0.55, output: 2.19, cache: 0.014 }
};
this.systemPrompt = `You are a precise data extraction assistant.
Extract structured information from HTML content and return ONLY valid JSON.
Never include explanations, markdown formatting, or extra text.`;
}
cleanHtml(html) {
const $ = cheerio.load(html);
// Remove unwanted elements
$('script, style, nav, footer, header, iframe, noscript').remove();
// Get main content
const mainContent = $('main').html() ||
$('article').html() ||
$('.content, .main-content, .product').html() ||
$('body').html();
return mainContent || html;
}
async extractData(html, schema, url = null) {
// Clean HTML
const cleanedHtml = this.cleanHtml(html).substring(0, 40000);
const userPrompt = `Extract the following information from this HTML:
${schema}
HTML Content:
${cleanedHtml}
Return ONLY valid JSON matching the schema.`;
try {
const response = await axios.post(
this.apiUrl,
{
model: this.model,
messages: [
{ role: 'system', content: this.systemPrompt },
{ role: 'user', content: userPrompt }
],
temperature: 0.1,
max_tokens: 4000,
response_format: { type: 'json_object' }
},
{
headers: {
'Authorization': `Bearer ${this.apiKey}`,
'Content-Type': 'application/json'
},
timeout: 30000
}
);
const usage = response.data.usage || {};
this.totalInputTokens += usage.prompt_tokens || 0;
this.totalOutputTokens += usage.completion_tokens || 0;
this.totalCachedTokens += usage.prompt_cache_hit_tokens || 0;
this.requestsMade++;
return {
success: true,
data: JSON.parse(response.data.choices[0].message.content),
url,
tokensUsed: {
input: usage.prompt_tokens || 0,
output: usage.completion_tokens || 0,
cached: usage.prompt_cache_hit_tokens || 0
}
};
} catch (error) {
return {
success: false,
error: error.message,
url
};
}
}
getCostSummary() {
const pricing = this.pricing[this.model];
const inputCost = (this.totalInputTokens / 1_000_000) * pricing.input;
const outputCost = (this.totalOutputTokens / 1_000_000) * pricing.output;
const cacheCost = (this.totalCachedTokens / 1_000_000) * pricing.cache;
const cacheSavings = (this.totalCachedTokens / 1_000_000) * (pricing.input - pricing.cache);
const totalCost = inputCost + outputCost + cacheCost;
return {
model: this.model,
totalCost,
costBreakdown: {
inputTokensCost: inputCost,
outputTokensCost: outputCost,
cachedTokensCost: cacheCost,
cacheSavings
},
tokenUsage: {
inputTokens: this.totalInputTokens,
outputTokens: this.totalOutputTokens,
cachedTokens: this.totalCachedTokens,
totalTokens: this.totalInputTokens + this.totalOutputTokens
},
requests: this.requestsMade,
averageCostPerRequest: this.requestsMade > 0 ? totalCost / this.requestsMade : 0
};
}
printCostSummary() {
const summary = this.getCostSummary();
console.log('\n' + '='.repeat(50));
console.log('DEEPSEEK SCRAPING COST SUMMARY');
console.log('='.repeat(50));
console.log(`Model: ${summary.model}`);
console.log(`Total Requests: ${summary.requests}`);
console.log('\nToken Usage:');
console.log(` Input Tokens: ${summary.tokenUsage.inputTokens.toLocaleString()}`);
console.log(` Output Tokens: ${summary.tokenUsage.outputTokens.toLocaleString()}`);
console.log(` Cached Tokens: ${summary.tokenUsage.cachedTokens.toLocaleString()}`);
console.log('\nCosts:');
console.log(` Input Cost: $${summary.costBreakdown.inputTokensCost.toFixed(4)}`);
console.log(` Output Cost: $${summary.costBreakdown.outputTokensCost.toFixed(4)}`);
console.log(` Cache Cost: $${summary.costBreakdown.cachedTokensCost.toFixed(4)}`);
console.log(` Cache Savings: $${summary.costBreakdown.cacheSavings.toFixed(4)}`);
console.log(` Total Cost: $${summary.totalCost.toFixed(4)}`);
console.log(` Average per Request: $${summary.averageCostPerRequest.toFixed(6)}`);
console.log('='.repeat(50) + '\n');
}
}
// Usage
const scraper = new DeepseekScraper('your_api_key', 'deepseek-chat');
const schema = `{
"product_name": "string",
"price": "number",
"currency": "string",
"rating": "number or null",
"availability": "boolean"
}`;
const html = `<div class="product">
<h1>Premium Headphones</h1>
<span class="price">$299.99</span>
<div class="rating">4.5 stars</div>
<span class="stock">In Stock</span>
</div>`;
const result = await scraper.extractData(html, schema, 'https://example.com/product/1');
console.log(JSON.stringify(result.data, null, 2));
scraper.printCostSummary();
Cost Optimization Strategies
1. Leverage Prompt Caching
Deepseek's caching can reduce costs by up to 95% for repeated prompts. Structure your code to reuse system prompts:
# System prompt is cached automatically after first use
system_prompt = "You are a data extraction expert..." # Cached
# Only the HTML content changes per request
for url in urls:
html = fetch_html(url)
result = scraper.extract_data(html, schema) # Reuses cached prompt
2. Clean HTML Aggressively
Remove unnecessary content before sending to the API:
def aggressive_clean(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove all attributes except class and id
for tag in soup.find_all(True):
tag.attrs = {k: v for k, v in tag.attrs.items() if k in ['class', 'id']}
# Convert to text only if structure isn't needed
return soup.get_text(separator=' ', strip=True)
3. Batch Similar Pages
Process multiple similar pages in one request:
prompt = f"""Extract data from these 3 product pages.
Return as array of JSON objects.
Page 1:
{html1}
Page 2:
{html2}
Page 3:
{html3}"""
4. Choose the Right Model
- Deepseek-Chat: Simple product pages, listings, articles (50% cheaper)
- Deepseek-V3: Complex layouts, tables, nested data (best accuracy)
- Deepseek-R1: When reasoning is needed (most expensive, use sparingly)
5. Combine with Traditional Scraping
Use traditional tools for navigation and structure extraction, then use Deepseek only for complex fields. Learn more about handling AJAX requests with Puppeteer for efficient page loading.
# Use Puppeteer/Selenium for page interaction
html = browser.get_html(url)
product_section = extract_section_with_css(html, '.product-details')
# Use Deepseek only for the relevant section
data = scraper.extract_data(product_section, schema)
Real-World Cost Examples
Example 1: E-commerce Product Scraping
Scenario: Scraping 50,000 product pages monthly
- Model: Deepseek-Chat
- Average tokens per page: 13,000 input, 300 output
- Monthly cost: ~$93
Example 2: News Article Extraction
Scenario: Scraping 10,000 articles monthly
- Model: Deepseek-V3 (for better content understanding)
- Average tokens per article: 20,000 input, 500 output
- Monthly cost: ~$60
Example 3: Real Estate Listings
Scenario: Scraping 5,000 property listings monthly
- Model: Deepseek-Chat
- Average tokens per listing: 8,000 input, 400 output
- Monthly cost: ~$6
Cost Comparison Table
| Use Case | Pages/Month | Deepseek | GPT-4o | Claude 3.5 | Savings vs GPT-4o | |----------|-------------|----------|--------|------------|-------------------| | Small (1K pages) | 1,000 | $1.87 | $5.25 | $4.15 | 64% | | Medium (10K pages) | 10,000 | $18.70 | $52.50 | $41.50 | 64% | | Large (100K pages) | 100,000 | $187 | $525 | $415 | 64% | | Enterprise (1M pages) | 1,000,000 | $1,870 | $5,250 | $4,150 | 64% |
When to Use Deepseek for Web Scraping
Ideal scenarios: - Large-scale scraping projects (10K+ pages/month) - Budget-conscious projects requiring AI extraction - Multi-language content extraction - Sites with frequently changing layouts - Extracting unstructured or semi-structured data
Consider alternatives when: - Scraping fewer than 100 pages/month (traditional methods may be simpler) - Real-time extraction with sub-second latency required - Dealing with simple, predictable HTML structures - Sites already provide structured APIs
Monitoring and Budget Management
Set up cost tracking and alerts:
class BudgetManager:
def __init__(self, daily_limit, monthly_limit):
self.daily_limit = daily_limit
self.monthly_limit = monthly_limit
self.daily_cost = 0
self.monthly_cost = 0
def check_budget(self, cost):
self.daily_cost += cost
self.monthly_cost += cost
if self.daily_cost > self.daily_limit * 0.9:
print(f"⚠️ Warning: 90% of daily budget used")
if self.daily_cost >= self.daily_limit:
raise Exception("Daily budget limit reached")
if self.monthly_cost >= self.monthly_limit:
raise Exception("Monthly budget limit reached")
return True
# Usage
budget = BudgetManager(daily_limit=50.00, monthly_limit=1000.00)
for url in urls:
result = scraper.extract_data(html, schema)
cost = scraper.get_cost_summary()['total_cost']
budget.check_budget(cost)
Advanced: Combining Deepseek Models
Use different models for different extraction tasks:
class HybridScraper:
def __init__(self, api_key):
self.chat_scraper = DeepseekScraper(api_key, "deepseek-chat")
self.v3_scraper = DeepseekScraper(api_key, "deepseek-v3")
def extract(self, html, complexity='simple'):
if complexity == 'simple':
# Use cheaper model for straightforward extraction
return self.chat_scraper.extract_data(html, simple_schema)
else:
# Use more powerful model for complex scenarios
return self.v3_scraper.extract_data(html, complex_schema)
Getting Started with Deepseek
- Sign up at platform.deepseek.com
- Get API key from your dashboard
- Add credits (starting from $5)
- Install dependencies:
# Python
pip install requests beautifulsoup4
# JavaScript
npm install axios cheerio
- Set environment variable:
export DEEPSEEK_API_KEY="your_api_key_here"
For more details on Deepseek's capabilities, explore what Deepseek V3 offers for data extraction.
Conclusion
Deepseek offers the most cost-effective AI-powered web scraping solution available today, with costs 50-70% lower than GPT-4o and Claude 3.5 while maintaining competitive extraction quality. For projects scraping 10,000 pages monthly, you can expect costs around $18-25/month compared to $50-60 with competing models.
Key takeaways: - Deepseek-Chat: Best for high-volume, straightforward extraction ($0.14/$0.28 per 1M tokens) - Deepseek-V3: Ideal for complex data structures ($0.27/$1.10 per 1M tokens) - Cache optimization: Can reduce costs by up to 95% for repeated prompts - HTML cleaning: Reduces token usage by 50-70% - Hybrid approaches: Combine with browser automation for optimal results
For production web scraping with predictable costs and managed infrastructure, consider specialized web scraping APIs that handle proxies, browser automation, and data extraction in one platform.