Table of contents

How Much Does It Cost to Use Deepseek for Web Scraping?

The cost of using Deepseek for web scraping is significantly lower than most competing AI models, making it an attractive option for large-scale data extraction projects. Deepseek offers competitive pricing while delivering performance comparable to GPT-4 and Claude, with costs ranging from $0.14 to $2.19 per million tokens depending on the model and caching configuration.

Understanding Deepseek's pricing structure is essential for budgeting your web scraping projects effectively, especially when processing thousands or millions of pages.

Deepseek API Pricing Structure

Deepseek charges based on tokens—text units that roughly correspond to 4 characters or 0.75 words in English. Like other LLM APIs, both input tokens (your HTML content and prompts) and output tokens (the extracted data) are counted and billed separately.

Current Pricing by Model (as of 2025)

| Model | Input Tokens (per 1M) | Output Tokens (per 1M) | Cache Hits (per 1M) | |-------|----------------------|------------------------|---------------------| | Deepseek-V3 | $0.27 | $1.10 | $0.014 | | Deepseek-R1 | $0.55 | $2.19 | $0.014 | | Deepseek-Chat | $0.14 | $0.28 | $0.07 |

Key advantages: - 70-90% cheaper than GPT-4 and Claude 3.5 Opus - Cache hits cost 95% less than regular input tokens - No minimum commitment or subscription required - Pay-as-you-go pricing model

For web scraping tasks, Deepseek-Chat offers the best cost-to-performance ratio for structured data extraction, while Deepseek-V3 provides superior accuracy for complex extraction scenarios.

Cost Calculation for Web Scraping

The total cost depends on several factors:

  1. HTML size: Larger pages consume more input tokens
  2. Extraction complexity: Complex prompts increase token usage
  3. Response format: JSON outputs are typically more concise
  4. Model selection: Different models have different pricing tiers
  5. Cache efficiency: Repeated prompts benefit from caching

Example Cost Calculation

Let's calculate the cost to scrape 10,000 product pages using Deepseek-Chat:

Assumptions: - Average HTML page size: 50 KB (compressed to ~12,500 tokens) - System prompt size: ~300 tokens (cached after first use) - User prompt size: ~200 tokens - Output JSON: ~250 tokens

Cost per page (first request): - Input: 13,000 tokens × $0.14 / 1,000,000 = $0.00182 - Output: 250 tokens × $0.28 / 1,000,000 = $0.00007 - Total: $0.00189

Cost per page (with cached prompt): - New input: 12,700 tokens × $0.14 / 1,000,000 = $0.00178 - Cached: 300 tokens × $0.07 / 1,000,000 = $0.000021 - Output: 250 tokens × $0.28 / 1,000,000 = $0.00007 - Total: $0.00187

Cost for 10,000 pages: ~$18.70

Comparison with Competing Models

For the same 10,000-page scraping project:

| Model | Cost per Page | Total Cost (10K pages) | Relative Cost | |-------|--------------|----------------------|---------------| | Deepseek-Chat | $0.00187 | $18.70 | 1x (baseline) | | Deepseek-V3 | $0.00213 | $21.30 | 1.14x | | GPT-4o-mini | $0.00207 | $20.70 | 1.11x | | GPT-4o | $0.00525 | $52.50 | 2.81x | | Claude 3.5 Sonnet | $0.00415 | $41.50 | 2.22x |

Deepseek-Chat delivers 40-65% cost savings compared to premium alternatives while maintaining competitive accuracy.

Practical Python Implementation with Cost Tracking

Here's a production-ready implementation with comprehensive cost tracking:

import requests
import json
from bs4 import BeautifulSoup
import tiktoken
from datetime import datetime

class DeepseekScraper:
    def __init__(self, api_key, model="deepseek-chat"):
        self.api_key = api_key
        self.model = model
        self.api_url = "https://api.deepseek.com/v1/chat/completions"

        # Track token usage
        self.total_input_tokens = 0
        self.total_output_tokens = 0
        self.total_cached_tokens = 0
        self.requests_made = 0

        # Pricing per 1M tokens (USD)
        self.pricing = {
            "deepseek-chat": {
                "input": 0.14,
                "output": 0.28,
                "cache": 0.07
            },
            "deepseek-v3": {
                "input": 0.27,
                "output": 1.10,
                "cache": 0.014
            },
            "deepseek-r1": {
                "input": 0.55,
                "output": 2.19,
                "cache": 0.014
            }
        }

        # System prompt (will be cached after first use)
        self.system_prompt = """You are a precise data extraction assistant.
Extract structured information from HTML content and return ONLY valid JSON.
Never include explanations, markdown formatting, or extra text."""

    def clean_html(self, html):
        """Remove unnecessary elements to reduce token count"""
        soup = BeautifulSoup(html, 'html.parser')

        # Remove scripts, styles, and other non-content elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'header', 'iframe', 'noscript']):
            tag.decompose()

        # Remove HTML comments
        for comment in soup.find_all(string=lambda text: isinstance(text, type(soup))):
            if hasattr(comment, 'extract'):
                comment.extract()

        # Get main content if identifiable
        main_content = (
            soup.find('main') or
            soup.find('article') or
            soup.find(class_=['content', 'main-content', 'product']) or
            soup.body
        )

        return str(main_content) if main_content else str(soup)

    def extract_data(self, html, schema, url=None):
        """Extract structured data from HTML using Deepseek"""

        # Clean HTML to reduce tokens
        cleaned_html = self.clean_html(html)

        # Truncate if still too large (optional)
        max_chars = 40000  # ~10k tokens
        if len(cleaned_html) > max_chars:
            cleaned_html = cleaned_html[:max_chars] + "..."

        # Build extraction prompt
        user_prompt = f"""Extract the following information from this HTML:

{schema}

HTML Content:
{cleaned_html}

Return ONLY valid JSON matching the schema."""

        # Make API request
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "Content-Type": "application/json"
        }

        payload = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": self.system_prompt},
                {"role": "user", "content": user_prompt}
            ],
            "temperature": 0.1,  # Low temperature for consistent extraction
            "max_tokens": 4000,
            "response_format": {"type": "json_object"}
        }

        try:
            response = requests.post(
                self.api_url,
                headers=headers,
                json=payload,
                timeout=30
            )
            response.raise_for_status()
            result = response.json()

            # Track token usage
            usage = result.get('usage', {})
            self.total_input_tokens += usage.get('prompt_tokens', 0)
            self.total_output_tokens += usage.get('completion_tokens', 0)

            # Track cached tokens if available
            if 'prompt_cache_hit_tokens' in usage:
                self.total_cached_tokens += usage['prompt_cache_hit_tokens']

            self.requests_made += 1

            # Parse and return extracted data
            extracted = json.loads(result['choices'][0]['message']['content'])

            return {
                "success": True,
                "data": extracted,
                "url": url,
                "tokens_used": {
                    "input": usage.get('prompt_tokens', 0),
                    "output": usage.get('completion_tokens', 0),
                    "cached": usage.get('prompt_cache_hit_tokens', 0)
                }
            }

        except requests.exceptions.RequestException as e:
            return {
                "success": False,
                "error": str(e),
                "url": url
            }
        except json.JSONDecodeError as e:
            return {
                "success": False,
                "error": f"Invalid JSON response: {str(e)}",
                "url": url
            }

    def get_cost_summary(self):
        """Calculate and return detailed cost breakdown"""
        pricing = self.pricing[self.model]

        # Calculate costs
        input_cost = (self.total_input_tokens / 1_000_000) * pricing["input"]
        output_cost = (self.total_output_tokens / 1_000_000) * pricing["output"]
        cache_cost = (self.total_cached_tokens / 1_000_000) * pricing["cache"]

        # Cache savings
        cache_savings = (self.total_cached_tokens / 1_000_000) * (pricing["input"] - pricing["cache"])

        total_cost = input_cost + output_cost + cache_cost

        return {
            "model": self.model,
            "total_cost": total_cost,
            "cost_breakdown": {
                "input_tokens_cost": input_cost,
                "output_tokens_cost": output_cost,
                "cached_tokens_cost": cache_cost,
                "cache_savings": cache_savings
            },
            "token_usage": {
                "input_tokens": self.total_input_tokens,
                "output_tokens": self.total_output_tokens,
                "cached_tokens": self.total_cached_tokens,
                "total_tokens": self.total_input_tokens + self.total_output_tokens
            },
            "requests": self.requests_made,
            "average_cost_per_request": total_cost / self.requests_made if self.requests_made > 0 else 0
        }

    def print_cost_summary(self):
        """Print formatted cost summary"""
        summary = self.get_cost_summary()

        print(f"\n{'='*50}")
        print(f"DEEPSEEK SCRAPING COST SUMMARY")
        print(f"{'='*50}")
        print(f"Model: {summary['model']}")
        print(f"Total Requests: {summary['requests']}")
        print(f"\nToken Usage:")
        print(f"  Input Tokens: {summary['token_usage']['input_tokens']:,}")
        print(f"  Output Tokens: {summary['token_usage']['output_tokens']:,}")
        print(f"  Cached Tokens: {summary['token_usage']['cached_tokens']:,}")
        print(f"  Total Tokens: {summary['token_usage']['total_tokens']:,}")
        print(f"\nCosts:")
        print(f"  Input Cost: ${summary['cost_breakdown']['input_tokens_cost']:.4f}")
        print(f"  Output Cost: ${summary['cost_breakdown']['output_tokens_cost']:.4f}")
        print(f"  Cache Cost: ${summary['cost_breakdown']['cached_tokens_cost']:.4f}")
        print(f"  Cache Savings: ${summary['cost_breakdown']['cache_savings']:.4f}")
        print(f"  Total Cost: ${summary['total_cost']:.4f}")
        print(f"  Average per Request: ${summary['average_cost_per_request']:.6f}")
        print(f"{'='*50}\n")


# Usage Example
if __name__ == "__main__":
    # Initialize scraper
    scraper = DeepseekScraper(
        api_key="your_deepseek_api_key",
        model="deepseek-chat"
    )

    # Define extraction schema
    schema = """
{
  "product_name": "string",
  "price": "number",
  "currency": "string",
  "rating": "number or null",
  "reviews_count": "integer",
  "availability": "boolean",
  "description": "string",
  "features": ["array of strings"]
}
"""

    # Example HTML (in production, fetch from URL)
    sample_html = """
    <div class="product-page">
        <h1 class="product-title">Premium Wireless Headphones</h1>
        <div class="price">$299.99</div>
        <div class="rating">4.5 stars from 1,234 reviews</div>
        <p class="description">High-quality noise-canceling headphones with 30-hour battery life.</p>
        <span class="stock">In Stock</span>
        <ul class="features">
            <li>Active noise cancellation</li>
            <li>30-hour battery life</li>
            <li>Bluetooth 5.0</li>
        </ul>
    </div>
    """

    # Extract data
    result = scraper.extract_data(
        html=sample_html,
        schema=schema,
        url="https://example.com/product/1"
    )

    if result["success"]:
        print("Extracted Data:")
        print(json.dumps(result["data"], indent=2))
        print(f"\nTokens used: Input={result['tokens_used']['input']}, Output={result['tokens_used']['output']}")
    else:
        print(f"Error: {result['error']}")

    # Print cost summary
    scraper.print_cost_summary()

JavaScript/Node.js Implementation

import axios from 'axios';
import * as cheerio from 'cheerio';

class DeepseekScraper {
    constructor(apiKey, model = 'deepseek-chat') {
        this.apiKey = apiKey;
        this.model = model;
        this.apiUrl = 'https://api.deepseek.com/v1/chat/completions';

        // Track usage
        this.totalInputTokens = 0;
        this.totalOutputTokens = 0;
        this.totalCachedTokens = 0;
        this.requestsMade = 0;

        // Pricing
        this.pricing = {
            'deepseek-chat': { input: 0.14, output: 0.28, cache: 0.07 },
            'deepseek-v3': { input: 0.27, output: 1.10, cache: 0.014 },
            'deepseek-r1': { input: 0.55, output: 2.19, cache: 0.014 }
        };

        this.systemPrompt = `You are a precise data extraction assistant.
Extract structured information from HTML content and return ONLY valid JSON.
Never include explanations, markdown formatting, or extra text.`;
    }

    cleanHtml(html) {
        const $ = cheerio.load(html);

        // Remove unwanted elements
        $('script, style, nav, footer, header, iframe, noscript').remove();

        // Get main content
        const mainContent = $('main').html() ||
                          $('article').html() ||
                          $('.content, .main-content, .product').html() ||
                          $('body').html();

        return mainContent || html;
    }

    async extractData(html, schema, url = null) {
        // Clean HTML
        const cleanedHtml = this.cleanHtml(html).substring(0, 40000);

        const userPrompt = `Extract the following information from this HTML:

${schema}

HTML Content:
${cleanedHtml}

Return ONLY valid JSON matching the schema.`;

        try {
            const response = await axios.post(
                this.apiUrl,
                {
                    model: this.model,
                    messages: [
                        { role: 'system', content: this.systemPrompt },
                        { role: 'user', content: userPrompt }
                    ],
                    temperature: 0.1,
                    max_tokens: 4000,
                    response_format: { type: 'json_object' }
                },
                {
                    headers: {
                        'Authorization': `Bearer ${this.apiKey}`,
                        'Content-Type': 'application/json'
                    },
                    timeout: 30000
                }
            );

            const usage = response.data.usage || {};
            this.totalInputTokens += usage.prompt_tokens || 0;
            this.totalOutputTokens += usage.completion_tokens || 0;
            this.totalCachedTokens += usage.prompt_cache_hit_tokens || 0;
            this.requestsMade++;

            return {
                success: true,
                data: JSON.parse(response.data.choices[0].message.content),
                url,
                tokensUsed: {
                    input: usage.prompt_tokens || 0,
                    output: usage.completion_tokens || 0,
                    cached: usage.prompt_cache_hit_tokens || 0
                }
            };

        } catch (error) {
            return {
                success: false,
                error: error.message,
                url
            };
        }
    }

    getCostSummary() {
        const pricing = this.pricing[this.model];

        const inputCost = (this.totalInputTokens / 1_000_000) * pricing.input;
        const outputCost = (this.totalOutputTokens / 1_000_000) * pricing.output;
        const cacheCost = (this.totalCachedTokens / 1_000_000) * pricing.cache;
        const cacheSavings = (this.totalCachedTokens / 1_000_000) * (pricing.input - pricing.cache);

        const totalCost = inputCost + outputCost + cacheCost;

        return {
            model: this.model,
            totalCost,
            costBreakdown: {
                inputTokensCost: inputCost,
                outputTokensCost: outputCost,
                cachedTokensCost: cacheCost,
                cacheSavings
            },
            tokenUsage: {
                inputTokens: this.totalInputTokens,
                outputTokens: this.totalOutputTokens,
                cachedTokens: this.totalCachedTokens,
                totalTokens: this.totalInputTokens + this.totalOutputTokens
            },
            requests: this.requestsMade,
            averageCostPerRequest: this.requestsMade > 0 ? totalCost / this.requestsMade : 0
        };
    }

    printCostSummary() {
        const summary = this.getCostSummary();

        console.log('\n' + '='.repeat(50));
        console.log('DEEPSEEK SCRAPING COST SUMMARY');
        console.log('='.repeat(50));
        console.log(`Model: ${summary.model}`);
        console.log(`Total Requests: ${summary.requests}`);
        console.log('\nToken Usage:');
        console.log(`  Input Tokens: ${summary.tokenUsage.inputTokens.toLocaleString()}`);
        console.log(`  Output Tokens: ${summary.tokenUsage.outputTokens.toLocaleString()}`);
        console.log(`  Cached Tokens: ${summary.tokenUsage.cachedTokens.toLocaleString()}`);
        console.log('\nCosts:');
        console.log(`  Input Cost: $${summary.costBreakdown.inputTokensCost.toFixed(4)}`);
        console.log(`  Output Cost: $${summary.costBreakdown.outputTokensCost.toFixed(4)}`);
        console.log(`  Cache Cost: $${summary.costBreakdown.cachedTokensCost.toFixed(4)}`);
        console.log(`  Cache Savings: $${summary.costBreakdown.cacheSavings.toFixed(4)}`);
        console.log(`  Total Cost: $${summary.totalCost.toFixed(4)}`);
        console.log(`  Average per Request: $${summary.averageCostPerRequest.toFixed(6)}`);
        console.log('='.repeat(50) + '\n');
    }
}

// Usage
const scraper = new DeepseekScraper('your_api_key', 'deepseek-chat');

const schema = `{
  "product_name": "string",
  "price": "number",
  "currency": "string",
  "rating": "number or null",
  "availability": "boolean"
}`;

const html = `<div class="product">
    <h1>Premium Headphones</h1>
    <span class="price">$299.99</span>
    <div class="rating">4.5 stars</div>
    <span class="stock">In Stock</span>
</div>`;

const result = await scraper.extractData(html, schema, 'https://example.com/product/1');
console.log(JSON.stringify(result.data, null, 2));
scraper.printCostSummary();

Cost Optimization Strategies

1. Leverage Prompt Caching

Deepseek's caching can reduce costs by up to 95% for repeated prompts. Structure your code to reuse system prompts:

# System prompt is cached automatically after first use
system_prompt = "You are a data extraction expert..."  # Cached

# Only the HTML content changes per request
for url in urls:
    html = fetch_html(url)
    result = scraper.extract_data(html, schema)  # Reuses cached prompt

2. Clean HTML Aggressively

Remove unnecessary content before sending to the API:

def aggressive_clean(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove all attributes except class and id
    for tag in soup.find_all(True):
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in ['class', 'id']}

    # Convert to text only if structure isn't needed
    return soup.get_text(separator=' ', strip=True)

3. Batch Similar Pages

Process multiple similar pages in one request:

prompt = f"""Extract data from these 3 product pages.
Return as array of JSON objects.

Page 1:
{html1}

Page 2:
{html2}

Page 3:
{html3}"""

4. Choose the Right Model

  • Deepseek-Chat: Simple product pages, listings, articles (50% cheaper)
  • Deepseek-V3: Complex layouts, tables, nested data (best accuracy)
  • Deepseek-R1: When reasoning is needed (most expensive, use sparingly)

5. Combine with Traditional Scraping

Use traditional tools for navigation and structure extraction, then use Deepseek only for complex fields. Learn more about handling AJAX requests with Puppeteer for efficient page loading.

# Use Puppeteer/Selenium for page interaction
html = browser.get_html(url)
product_section = extract_section_with_css(html, '.product-details')

# Use Deepseek only for the relevant section
data = scraper.extract_data(product_section, schema)

Real-World Cost Examples

Example 1: E-commerce Product Scraping

Scenario: Scraping 50,000 product pages monthly

  • Model: Deepseek-Chat
  • Average tokens per page: 13,000 input, 300 output
  • Monthly cost: ~$93

Example 2: News Article Extraction

Scenario: Scraping 10,000 articles monthly

  • Model: Deepseek-V3 (for better content understanding)
  • Average tokens per article: 20,000 input, 500 output
  • Monthly cost: ~$60

Example 3: Real Estate Listings

Scenario: Scraping 5,000 property listings monthly

  • Model: Deepseek-Chat
  • Average tokens per listing: 8,000 input, 400 output
  • Monthly cost: ~$6

Cost Comparison Table

| Use Case | Pages/Month | Deepseek | GPT-4o | Claude 3.5 | Savings vs GPT-4o | |----------|-------------|----------|--------|------------|-------------------| | Small (1K pages) | 1,000 | $1.87 | $5.25 | $4.15 | 64% | | Medium (10K pages) | 10,000 | $18.70 | $52.50 | $41.50 | 64% | | Large (100K pages) | 100,000 | $187 | $525 | $415 | 64% | | Enterprise (1M pages) | 1,000,000 | $1,870 | $5,250 | $4,150 | 64% |

When to Use Deepseek for Web Scraping

Ideal scenarios: - Large-scale scraping projects (10K+ pages/month) - Budget-conscious projects requiring AI extraction - Multi-language content extraction - Sites with frequently changing layouts - Extracting unstructured or semi-structured data

Consider alternatives when: - Scraping fewer than 100 pages/month (traditional methods may be simpler) - Real-time extraction with sub-second latency required - Dealing with simple, predictable HTML structures - Sites already provide structured APIs

Monitoring and Budget Management

Set up cost tracking and alerts:

class BudgetManager:
    def __init__(self, daily_limit, monthly_limit):
        self.daily_limit = daily_limit
        self.monthly_limit = monthly_limit
        self.daily_cost = 0
        self.monthly_cost = 0

    def check_budget(self, cost):
        self.daily_cost += cost
        self.monthly_cost += cost

        if self.daily_cost > self.daily_limit * 0.9:
            print(f"⚠️  Warning: 90% of daily budget used")

        if self.daily_cost >= self.daily_limit:
            raise Exception("Daily budget limit reached")

        if self.monthly_cost >= self.monthly_limit:
            raise Exception("Monthly budget limit reached")

        return True

# Usage
budget = BudgetManager(daily_limit=50.00, monthly_limit=1000.00)

for url in urls:
    result = scraper.extract_data(html, schema)
    cost = scraper.get_cost_summary()['total_cost']
    budget.check_budget(cost)

Advanced: Combining Deepseek Models

Use different models for different extraction tasks:

class HybridScraper:
    def __init__(self, api_key):
        self.chat_scraper = DeepseekScraper(api_key, "deepseek-chat")
        self.v3_scraper = DeepseekScraper(api_key, "deepseek-v3")

    def extract(self, html, complexity='simple'):
        if complexity == 'simple':
            # Use cheaper model for straightforward extraction
            return self.chat_scraper.extract_data(html, simple_schema)
        else:
            # Use more powerful model for complex scenarios
            return self.v3_scraper.extract_data(html, complex_schema)

Getting Started with Deepseek

  1. Sign up at platform.deepseek.com
  2. Get API key from your dashboard
  3. Add credits (starting from $5)
  4. Install dependencies:
# Python
pip install requests beautifulsoup4

# JavaScript
npm install axios cheerio
  1. Set environment variable:
export DEEPSEEK_API_KEY="your_api_key_here"

For more details on Deepseek's capabilities, explore what Deepseek V3 offers for data extraction.

Conclusion

Deepseek offers the most cost-effective AI-powered web scraping solution available today, with costs 50-70% lower than GPT-4o and Claude 3.5 while maintaining competitive extraction quality. For projects scraping 10,000 pages monthly, you can expect costs around $18-25/month compared to $50-60 with competing models.

Key takeaways: - Deepseek-Chat: Best for high-volume, straightforward extraction ($0.14/$0.28 per 1M tokens) - Deepseek-V3: Ideal for complex data structures ($0.27/$1.10 per 1M tokens) - Cache optimization: Can reduce costs by up to 95% for repeated prompts - HTML cleaning: Reduces token usage by 50-70% - Hybrid approaches: Combine with browser automation for optimal results

For production web scraping with predictable costs and managed infrastructure, consider specialized web scraping APIs that handle proxies, browser automation, and data extraction in one platform.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon