How do I Extract Data from Websites Using Deepseek?
Deepseek is a powerful large language model (LLM) that can be used to extract structured data from websites by interpreting HTML content and converting it into JSON or other formats. Unlike traditional web scraping that relies on brittle CSS selectors or XPath expressions, Deepseek can understand the semantic meaning of content and extract data even when page structure changes.
Understanding Deepseek for Web Scraping
Deepseek offers several models optimized for different tasks:
- Deepseek-V3: The latest general-purpose model with excellent reasoning capabilities
- Deepseek-Coder: Specialized for code generation and technical content
- Deepseek-R1: Enhanced reasoning model for complex extraction tasks
For web scraping, you'll typically use the Deepseek API to send HTML content along with instructions about what data to extract, and receive structured output in return.
Prerequisites
Before you start, you'll need:
- A Deepseek API key (obtain from platform.deepseek.com)
- Python 3.7+ or Node.js 14+ installed
- HTTP client library for making requests
- HTML fetching capability (requests, axios, or a web scraping API)
Method 1: Using Python with Deepseek
Here's a complete example of extracting product data from an e-commerce page:
import requests
from openai import OpenAI
# Initialize Deepseek client (compatible with OpenAI SDK)
client = OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
# Fetch HTML content
url = "https://example.com/product-page"
response = requests.get(url)
html_content = response.text
# Define extraction schema
extraction_prompt = """
Extract the following information from the HTML:
- Product name
- Price
- Description
- Availability status
- Customer rating
Return the data as a JSON object.
"""
# Call Deepseek API
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant. Extract structured data from HTML and return valid JSON."
},
{
"role": "user",
"content": f"{extraction_prompt}\n\nHTML:\n{html_content[:8000]}"
}
],
response_format={"type": "json_object"}
)
# Parse the extracted data
import json
extracted_data = json.loads(completion.choices[0].message.content)
print(json.dumps(extracted_data, indent=2))
Output Example
{
"product_name": "Wireless Bluetooth Headphones",
"price": "$79.99",
"description": "Premium noise-canceling headphones with 30-hour battery life",
"availability": "In Stock",
"rating": 4.5
}
Method 2: Using JavaScript/Node.js with Deepseek
For JavaScript developers, here's how to integrate Deepseek with a web scraping workflow:
const axios = require('axios');
async function extractDataWithDeepseek(html, schema) {
const apiKey = 'your-deepseek-api-key';
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'You are a web scraping assistant. Extract data from HTML and return structured JSON.'
},
{
role: 'user',
content: `Extract the following fields: ${schema.join(', ')}\n\nHTML:\n${html}`
}
],
response_format: { type: 'json_object' },
temperature: 0.1
},
{
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json'
}
}
);
return JSON.parse(response.data.choices[0].message.content);
}
// Example usage
async function scrapeWebsite() {
const url = 'https://example.com/articles';
const htmlResponse = await axios.get(url);
const schema = ['title', 'author', 'publish_date', 'content', 'tags'];
const extractedData = await extractDataWithDeepseek(
htmlResponse.data,
schema
);
console.log(extractedData);
}
scrapeWebsite();
Advanced Techniques
Handling Large HTML Documents
Deepseek has token limits, so for large pages, extract only relevant sections:
from bs4 import BeautifulSoup
def extract_relevant_html(full_html, selector):
"""Extract only the relevant portion of HTML"""
soup = BeautifulSoup(full_html, 'html.parser')
relevant_section = soup.select_one(selector)
return str(relevant_section) if relevant_section else full_html
# Usage
html = requests.get(url).text
focused_html = extract_relevant_html(html, '.product-details')
# Now send only focused_html to Deepseek
Batch Processing Multiple Pages
For scraping multiple pages efficiently:
import asyncio
from openai import AsyncOpenAI
async def extract_batch(urls, client):
"""Extract data from multiple URLs concurrently"""
async def process_url(url):
html = requests.get(url).text
response = await client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "Extract product data as JSON."},
{"role": "user", "content": f"HTML: {html[:8000]}"}
]
)
return json.loads(response.choices[0].message.content)
tasks = [process_url(url) for url in urls]
results = await asyncio.gather(*tasks)
return results
# Usage
client = AsyncOpenAI(
api_key="your-key",
base_url="https://api.deepseek.com"
)
urls = ['https://example.com/product1', 'https://example.com/product2']
results = asyncio.run(extract_batch(urls, client))
Structured Output with Schema Validation
Define precise JSON schemas for consistent extraction:
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
"images": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["title", "price"]
}
prompt = f"""
Extract data matching this JSON schema:
{json.dumps(schema, indent=2)}
Only return valid JSON matching this exact structure.
"""
Combining Deepseek with Browser Automation
For JavaScript-heavy websites, combine Deepseek with browser automation tools like Puppeteer:
from playwright.sync_api import sync_playwright
def scrape_dynamic_page(url):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# Wait for dynamic content to load
page.wait_for_selector('.product-data')
# Get rendered HTML
html = page.content()
browser.close()
# Now use Deepseek to extract data
extracted = extract_with_deepseek(html)
return extracted
This approach is particularly useful when dealing with AJAX requests and dynamically loaded content.
Error Handling and Validation
Always implement robust error handling:
def safe_extract(html, max_retries=3):
"""Extract data with retry logic and validation"""
for attempt in range(max_retries):
try:
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "Extract data as JSON."},
{"role": "user", "content": html}
],
timeout=30
)
data = json.loads(completion.choices[0].message.content)
# Validate required fields
required_fields = ['title', 'price']
if all(field in data for field in required_fields):
return data
else:
raise ValueError("Missing required fields")
except json.JSONDecodeError:
if attempt < max_retries - 1:
continue
raise
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
return None
Cost Optimization Tips
Deepseek is cost-effective, but you can optimize further:
- Preprocess HTML: Remove unnecessary tags, scripts, and styles before sending to the API
- Use focused selectors: Extract only relevant sections using BeautifulSoup or similar
- Cache results: Store extracted data to avoid re-processing identical pages
- Batch requests: Process multiple extractions in a single API call when possible
- Set lower temperature: Use
temperature=0
or0.1
for deterministic, focused extraction
def clean_html(raw_html):
"""Remove unnecessary elements to reduce token usage"""
soup = BeautifulSoup(raw_html, 'html.parser')
# Remove scripts, styles, comments
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
# Get text-focused HTML
return str(soup.get_text(separator=' ', strip=True))
Handling Authentication and Sessions
For pages requiring authentication, fetch HTML with proper session handling:
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0...'})
# Login
session.post('https://example.com/login', data={
'username': 'user',
'password': 'pass'
})
# Scrape authenticated page
html = session.get('https://example.com/protected-data').text
extracted = extract_with_deepseek(html)
For complex authentication scenarios with browser sessions, consider using browser automation tools before passing HTML to Deepseek.
When to Use Deepseek vs Traditional Scraping
Use Deepseek when: - Page structure changes frequently - Data is presented in varied formats - You need semantic understanding (e.g., extracting sentiment or categorizing content) - Working with unstructured content like articles or reviews
Use traditional CSS/XPath when: - Page structure is stable and predictable - You need maximum speed and minimal cost - Extracting simple, well-structured data - Processing millions of pages at scale
Conclusion
Deepseek provides a flexible, AI-powered approach to web scraping that can adapt to changing page structures and extract semantic information that traditional methods struggle with. By combining Deepseek with traditional web scraping tools and proper preprocessing, you can build robust data extraction pipelines that are both intelligent and cost-effective.
The key to success is finding the right balance between AI-powered extraction and traditional methods, preprocessing HTML to reduce token usage, and implementing proper error handling and validation throughout your scraping workflow.