How do I use LLM for web scraping with Deepseek?
Using Deepseek LLM for web scraping represents a paradigm shift from traditional selector-based extraction to intelligent, context-aware data parsing. Deepseek's powerful language models can understand page structure, extract relevant information, and transform unstructured HTML into structured data without requiring brittle XPath or CSS selectors.
What is LLM-based web scraping?
LLM-based web scraping uses large language models to interpret and extract data from web pages. Instead of writing complex selectors that break when page layouts change, you describe what data you want in natural language, and the LLM intelligently extracts it. Deepseek offers several models optimized for this task, including DeepSeek-V3 and DeepSeek-R1.
Setting up Deepseek for web scraping
Prerequisites
Before you begin, you'll need:
- A Deepseek API key (get one from platform.deepseek.com)
- Python 3.7+ or Node.js 14+ installed
- A library for making HTTP requests (requests, axios, or similar)
Installation
Python:
pip install openai requests beautifulsoup4
JavaScript:
npm install openai axios cheerio
Basic LLM web scraping workflow
The typical workflow for LLM-based web scraping with Deepseek involves:
- Fetch the HTML content from the target page
- Optionally clean or simplify the HTML
- Send the HTML to Deepseek with a prompt describing what to extract
- Parse the structured response
Python implementation
Here's a complete example using Python:
from openai import OpenAI
import requests
import json
# Initialize Deepseek client
client = OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
def scrape_with_deepseek(url, extraction_prompt):
"""
Scrape a webpage using Deepseek LLM
Args:
url: Target webpage URL
extraction_prompt: Natural language description of data to extract
Returns:
Extracted data as a dictionary
"""
# Fetch the webpage
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
html_content = response.text
# Create the prompt for Deepseek
system_prompt = """You are a web scraping assistant. Extract the requested
information from the HTML and return it as valid JSON. Be precise and only
extract data that is clearly present in the HTML."""
user_prompt = f"""Extract the following information from this HTML:
{extraction_prompt}
HTML Content:
{html_content[:8000]} # Limit to avoid token limits
Return the data as a JSON object."""
# Call Deepseek API
completion = client.chat.completions.create(
model="deepseek-chat", # or "deepseek-reasoner" for complex tasks
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
response_format={'type': 'json_object'}, # Ensure JSON output
temperature=0.1 # Lower temperature for more consistent extraction
)
# Parse and return the result
result = json.loads(completion.choices[0].message.content)
return result
# Example usage
url = "https://example.com/product-page"
prompt = """Extract:
- Product title
- Price
- Description
- Availability status
- Customer rating"""
data = scrape_with_deepseek(url, prompt)
print(json.dumps(data, indent=2))
JavaScript implementation
Here's the equivalent implementation in Node.js:
const OpenAI = require('openai');
const axios = require('axios');
// Initialize Deepseek client
const client = new OpenAI({
apiKey: 'your-deepseek-api-key',
baseURL: 'https://api.deepseek.com'
});
async function scrapeWithDeepseek(url, extractionPrompt) {
try {
// Fetch the webpage
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
const htmlContent = response.data;
// Create the prompt
const systemPrompt = `You are a web scraping assistant. Extract the requested
information from the HTML and return it as valid JSON. Be precise and only
extract data that is clearly present in the HTML.`;
const userPrompt = `Extract the following information from this HTML:
${extractionPrompt}
HTML Content:
${htmlContent.substring(0, 8000)}
Return the data as a JSON object.`;
// Call Deepseek API
const completion = await client.chat.completions.create({
model: 'deepseek-chat',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt }
],
response_format: { type: 'json_object' },
temperature: 0.1
});
// Parse and return result
const result = JSON.parse(completion.choices[0].message.content);
return result;
} catch (error) {
console.error('Scraping error:', error.message);
throw error;
}
}
// Example usage
const url = 'https://example.com/product-page';
const prompt = `Extract:
- Product title
- Price
- Description
- Availability status
- Customer rating`;
scrapeWithDeepseek(url, prompt)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error(error));
Advanced techniques
Using DeepSeek-R1 for complex reasoning
For pages with complex layouts or when you need the model to perform reasoning, use the deepseek-reasoner
model:
completion = client.chat.completions.create(
model="deepseek-reasoner",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
response_format={'type': 'json_object'}
)
# Access the reasoning process
reasoning = completion.choices[0].message.reasoning_content
result = json.loads(completion.choices[0].message.content)
print("Reasoning:", reasoning)
print("Result:", result)
Handling pagination and dynamic content
When scraping pages with dynamic content or AJAX requests, combine Deepseek with browser automation:
from playwright.sync_api import sync_playwright
def scrape_dynamic_page_with_llm(url, extraction_prompt):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# Navigate and wait for content
page.goto(url)
page.wait_for_load_state('networkidle')
# Get rendered HTML
html_content = page.content()
browser.close()
# Now use Deepseek to extract data
return extract_with_deepseek(html_content, extraction_prompt)
Extracting data from multiple pages
For scraping multiple pages efficiently:
import concurrent.futures
def scrape_multiple_urls(urls, extraction_prompt):
"""
Scrape multiple URLs in parallel using Deepseek
"""
def scrape_single(url):
try:
return {
'url': url,
'data': scrape_with_deepseek(url, extraction_prompt),
'success': True
}
except Exception as e:
return {
'url': url,
'error': str(e),
'success': False
}
# Use ThreadPoolExecutor for parallel processing
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
results = list(executor.map(scrape_single, urls))
return results
# Example
urls = [
'https://example.com/product/1',
'https://example.com/product/2',
'https://example.com/product/3'
]
results = scrape_multiple_urls(urls, "Extract product title and price")
Optimizing HTML for LLM processing
To reduce token usage and improve accuracy, clean the HTML before sending it to Deepseek:
from bs4 import BeautifulSoup
def clean_html_for_llm(html_content):
"""
Remove unnecessary elements to reduce token count
"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style', 'noscript', 'iframe']):
element.decompose()
# Remove comments
for comment in soup.findAll(text=lambda text: isinstance(text, Comment)):
comment.extract()
# Get text with some structure preserved
return str(soup)
# Use in scraping
html_content = requests.get(url).text
cleaned_html = clean_html_for_llm(html_content)
# Now send cleaned_html to Deepseek
Handling errors and retries
Implement robust error handling when working with LLMs:
import time
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
def scrape_with_retry(url, extraction_prompt):
"""
Scrape with automatic retries on failure
"""
try:
result = scrape_with_deepseek(url, extraction_prompt)
# Validate the result
if not result or len(result) == 0:
raise ValueError("Empty result from LLM")
return result
except Exception as e:
print(f"Error scraping {url}: {str(e)}")
raise
Cost optimization strategies
Deepseek offers competitive pricing, but for large-scale scraping, consider these optimizations:
- Pre-filter HTML: Extract only relevant sections before sending to the LLM
- Batch requests: Group multiple small extraction tasks into a single API call
- Cache results: Store extracted data to avoid re-processing identical pages
- Use appropriate models: DeepSeek-Chat for simple extraction, DeepSeek-R1 only when reasoning is needed
import hashlib
import json
from pathlib import Path
class CachedScraper:
def __init__(self, cache_dir='./scrape_cache'):
self.cache_dir = Path(cache_dir)
self.cache_dir.mkdir(exist_ok=True)
def get_cache_key(self, url, prompt):
"""Generate cache key from URL and prompt"""
combined = f"{url}:{prompt}"
return hashlib.md5(combined.encode()).hexdigest()
def scrape_cached(self, url, extraction_prompt):
"""Scrape with caching"""
cache_key = self.get_cache_key(url, extraction_prompt)
cache_file = self.cache_dir / f"{cache_key}.json"
# Check cache
if cache_file.exists():
with open(cache_file, 'r') as f:
return json.load(f)
# Scrape and cache
result = scrape_with_deepseek(url, extraction_prompt)
with open(cache_file, 'w') as f:
json.dump(result, f)
return result
Combining traditional scraping with LLM extraction
For best results, you can navigate to specific sections using traditional tools, then use Deepseek for intelligent extraction:
def hybrid_scraping_approach(url):
"""
Use BeautifulSoup to isolate relevant sections,
then Deepseek to extract structured data
"""
# Fetch page
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the product container using traditional selectors
product_section = soup.find('div', {'class': 'product-details'})
if product_section:
# Now use LLM to extract from this specific section
section_html = str(product_section)
result = extract_with_deepseek(
section_html,
"Extract product title, price, and specifications"
)
return result
return None
Best practices
- Be specific in prompts: Clearly describe the exact data you want and the expected format
- Use JSON mode: Enable
response_format={'type': 'json_object'}
for structured output - Set low temperature: Use temperature around 0.1 for consistent extraction
- Validate outputs: Always validate the LLM's response before using it
- Handle failures gracefully: Implement retries and fallback mechanisms
- Monitor token usage: Track API costs and optimize HTML preprocessing
- Respect rate limits: Implement proper rate limiting in your scraping code
Conclusion
Using Deepseek LLM for web scraping provides a flexible, maintainable alternative to traditional selector-based approaches. While it may be more expensive per request than simple parsing, the ability to handle layout changes, extract semantically similar data, and process unstructured content makes it invaluable for complex scraping tasks. Start with simple extractions, optimize your prompts and HTML preprocessing, and scale up as you become familiar with the model's capabilities.