Deepseek API Tutorial for Web Scraping Beginners
This comprehensive tutorial will guide you through using the Deepseek API for web scraping tasks, from basic setup to advanced data extraction techniques. Deepseek offers powerful language models that can parse HTML, extract structured data, and understand complex web page layouts without requiring traditional CSS selectors or XPath expressions.
What is Deepseek?
Deepseek is a family of large language models (LLMs) that excel at understanding and processing structured and unstructured data. For web scraping, Deepseek models can intelligently extract information from HTML content by understanding context and semantics, making them ideal for scenarios where traditional parsing methods fall short.
Prerequisites
Before starting this tutorial, you should have:
- Basic knowledge of Python or JavaScript
- An API key from Deepseek (sign up at deepseek.com)
- Python 3.7+ or Node.js 14+ installed
- A text editor or IDE
Step 1: Getting Your Deepseek API Key
- Visit the Deepseek platform
- Sign up for an account
- Navigate to the API section
- Generate a new API key
- Store your API key securely (never commit it to version control)
Step 2: Installation and Setup
Python Setup
First, install the required packages:
pip install openai requests beautifulsoup4
The Deepseek API is compatible with the OpenAI SDK, making integration straightforward.
Create a new Python file and set up your environment:
import os
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
# Set your Deepseek API key
client = OpenAI(
api_key="YOUR_DEEPSEEK_API_KEY",
base_url="https://api.deepseek.com"
)
JavaScript Setup
Install the necessary packages:
npm install openai axios cheerio
Set up your JavaScript environment:
const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');
const client = new OpenAI({
apiKey: 'YOUR_DEEPSEEK_API_KEY',
baseURL: 'https://api.deepseek.com'
});
Step 3: Basic Web Scraping with Deepseek
Fetching HTML Content
Before using Deepseek, you need to fetch the HTML content from your target website:
Python:
def fetch_html(url):
"""Fetch HTML content from a URL"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return response.text
# Example usage
html_content = fetch_html('https://example.com/products')
JavaScript:
async function fetchHTML(url) {
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
return response.data;
}
// Example usage
const htmlContent = await fetchHTML('https://example.com/products');
Extracting Data with Deepseek
Now, use Deepseek to extract structured data from the HTML:
Python:
def extract_data_with_deepseek(html_content, extraction_prompt):
"""Extract structured data using Deepseek"""
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "You are a web scraping assistant. Extract data from HTML and return it in JSON format."
},
{
"role": "user",
"content": f"{extraction_prompt}\n\nHTML:\n{html_content}"
}
],
temperature=0.0, # Use 0 for deterministic output
response_format={"type": "json_object"}
)
return response.choices[0].message.content
# Example: Extract product information
prompt = """
Extract all products from this e-commerce page. For each product, extract:
- Product name
- Price
- Description
- Availability status
Return as a JSON array with key 'products'.
"""
result = extract_data_with_deepseek(html_content, prompt)
print(result)
JavaScript:
async function extractDataWithDeepseek(htmlContent, extractionPrompt) {
const response = await client.chat.completions.create({
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'You are a web scraping assistant. Extract data from HTML and return it in JSON format.'
},
{
role: 'user',
content: `${extractionPrompt}\n\nHTML:\n${htmlContent}`
}
],
temperature: 0.0,
response_format: { type: 'json_object' }
});
return JSON.parse(response.choices[0].message.content);
}
// Example usage
const prompt = `
Extract all products from this e-commerce page. For each product, extract:
- Product name
- Price
- Description
- Availability status
Return as a JSON array with key 'products'.
`;
const result = await extractDataWithDeepseek(htmlContent, prompt);
console.log(result);
Step 4: Advanced Techniques
Cleaning HTML Before Extraction
For better results and to reduce token usage, clean the HTML before sending it to Deepseek:
Python:
def clean_html(html_content):
"""Remove unnecessary elements from HTML"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Get text or cleaned HTML
return str(soup)
# Use cleaned HTML
cleaned_html = clean_html(html_content)
result = extract_data_with_deepseek(cleaned_html, prompt)
JavaScript:
function cleanHTML(htmlContent) {
const $ = cheerio.load(htmlContent);
// Remove unnecessary elements
$('script, style, nav, footer, header').remove();
return $.html();
}
// Use cleaned HTML
const cleanedHTML = cleanHTML(htmlContent);
const result = await extractDataWithDeepseek(cleanedHTML, prompt);
Handling Large Pages
When working with large HTML documents, you may exceed token limits. Here's how to handle this:
Python:
def extract_relevant_section(html_content, css_selector):
"""Extract only relevant section of the page"""
soup = BeautifulSoup(html_content, 'html.parser')
section = soup.select_one(css_selector)
return str(section) if section else html_content
# Extract only the main content area
relevant_html = extract_relevant_section(html_content, '.product-list')
result = extract_data_with_deepseek(relevant_html, prompt)
Structured Output with Function Calling
Use function calling for more reliable structured output:
Python:
def extract_with_function_calling(html_content):
"""Use function calling for structured extraction"""
tools = [{
"type": "function",
"function": {
"name": "extract_products",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
"in_stock": {"type": "boolean"}
},
"required": ["name", "price"]
}
}
},
"required": ["products"]
}
}
}]
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract all products from this HTML:\n{html_content}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_products"}}
)
return response.choices[0].message.tool_calls[0].function.arguments
Step 5: Error Handling and Best Practices
Implementing Retry Logic
Python:
import time
from openai import APIError, RateLimitError
def extract_with_retry(html_content, prompt, max_retries=3):
"""Extract data with retry logic"""
for attempt in range(max_retries):
try:
return extract_data_with_deepseek(html_content, prompt)
except RateLimitError:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Rate limited. Waiting {wait_time} seconds...")
time.sleep(wait_time)
else:
raise
except APIError as e:
print(f"API error: {e}")
raise
Cost Optimization
Monitor and optimize your token usage:
Python:
def extract_with_cost_tracking(html_content, prompt):
"""Track token usage and estimated costs"""
response = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "Extract data as JSON."},
{"role": "user", "content": f"{prompt}\n\n{html_content}"}
],
temperature=0.0
)
usage = response.usage
print(f"Tokens used: {usage.total_tokens}")
print(f"Prompt tokens: {usage.prompt_tokens}")
print(f"Completion tokens: {usage.completion_tokens}")
# Deepseek pricing (as of 2025)
cost = (usage.prompt_tokens * 0.14 + usage.completion_tokens * 0.28) / 1_000_000
print(f"Estimated cost: ${cost:.6f}")
return response.choices[0].message.content
Step 6: Complete Working Example
Here's a complete example that scrapes product data from an e-commerce site:
Python:
import os
import json
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
class DeepseekScraper:
def __init__(self, api_key):
self.client = OpenAI(
api_key=api_key,
base_url="https://api.deepseek.com"
)
def fetch_page(self, url):
"""Fetch and clean HTML content"""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
# Clean HTML
soup = BeautifulSoup(response.text, 'html.parser')
for element in soup(['script', 'style', 'nav', 'footer']):
element.decompose()
return str(soup)
def extract_data(self, html_content, schema_description):
"""Extract structured data using Deepseek"""
response = self.client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "You are a web scraping expert. Extract data accurately and return valid JSON."
},
{
"role": "user",
"content": f"{schema_description}\n\nHTML:\n{html_content}"
}
],
temperature=0.0,
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
def scrape(self, url, schema_description):
"""Complete scraping pipeline"""
print(f"Fetching {url}...")
html = self.fetch_page(url)
print("Extracting data with Deepseek...")
data = self.extract_data(html, schema_description)
return data
# Usage example
if __name__ == "__main__":
scraper = DeepseekScraper(api_key=os.environ.get("DEEPSEEK_API_KEY"))
schema = """
Extract all product listings from this page. For each product, extract:
- name: Product name (string)
- price: Numeric price value (number)
- currency: Currency symbol or code (string)
- rating: Customer rating if available (number or null)
- image_url: Main product image URL (string or null)
Return as: {"products": [...]}
"""
results = scraper.scrape("https://example.com/products", schema)
print(json.dumps(results, indent=2))
Comparing Deepseek to Traditional Methods
While traditional web scraping tools like Beautiful Soup and Selenium rely on CSS selectors and DOM traversal, Deepseek offers several advantages:
- No selector maintenance: Extract data without writing fragile CSS or XPath selectors
- Semantic understanding: Understands context and can handle layout variations
- Natural language queries: Describe what you want in plain English
- Adaptive to changes: More resilient to minor HTML structure changes
However, for simple, high-volume scraping tasks, traditional methods are still more cost-effective and faster.
Integration with WebScraping.AI
For production web scraping that combines the power of LLMs with traditional scraping infrastructure, consider using WebScraping.AI's API which handles:
- Proxy rotation and residential proxies
- JavaScript rendering and AJAX handling
- CAPTCHA bypassing
- Rate limiting and retry logic
- LLM-powered data extraction
This allows you to focus on data extraction while the infrastructure handles the complexities of modern web scraping.
Common Pitfalls to Avoid
- Sending entire HTML documents: Always clean and extract relevant sections first to reduce costs
- Ignoring rate limits: Implement proper retry logic with exponential backoff
- Vague prompts: Be specific about the data structure you want
- No validation: Always validate the extracted data before using it
- Ignoring costs: Monitor token usage to avoid unexpected bills
Next Steps
Now that you understand the basics of using Deepseek for web scraping, you can:
- Experiment with different model parameters (temperature, top_p)
- Build scrapers for specific use cases (e-commerce, news, real estate)
- Combine Deepseek with traditional scraping tools for optimal results
- Explore batch processing for scraping multiple pages efficiently
- Implement caching to reduce API calls for frequently accessed pages
Conclusion
The Deepseek API provides a powerful, AI-driven approach to web scraping that can handle complex extraction tasks with minimal code. By following this tutorial, you should now be able to set up a basic scraping pipeline, extract structured data, and implement best practices for production use.
Remember to always respect website terms of service, implement rate limiting, and use proxies when scraping at scale. For complex scraping needs requiring browser automation, consider learning about handling AJAX requests and working with dynamic content.