How do I use Deepseek with Python for web scraping?
Using Deepseek with Python for web scraping combines the power of artificial intelligence with traditional scraping techniques to extract structured data from HTML content intelligently. Deepseek is a cost-effective large language model (LLM) that excels at understanding and parsing unstructured web data, making it ideal for complex scraping tasks where traditional CSS selectors or XPath may fall short.
Why Use Deepseek for Web Scraping?
Deepseek offers several advantages for Python web scraping projects:
- Cost-effective: Significantly cheaper than GPT-4 and other premium LLMs
- Semantic understanding: Extracts data based on meaning rather than rigid selectors
- Adaptive parsing: Handles layout changes and inconsistent HTML structures
- Structured output: Converts messy HTML into clean JSON data
- OpenAI-compatible API: Easy to integrate with existing Python code
Traditional web scraping relies on CSS selectors that break when websites change their structure. AI web scraping with Deepseek provides a more resilient solution by understanding content semantically rather than structurally.
Prerequisites and Installation
Before you start, ensure you have Python 3.7 or higher installed. You'll need to install the necessary libraries:
# Install the OpenAI library (Deepseek uses OpenAI-compatible API)
pip install openai
# Install web scraping libraries
pip install requests beautifulsoup4
# Optional: For handling dynamic websites
pip install selenium playwright
You'll also need a Deepseek API key. If you don't have one yet, check out our guide on how to get a Deepseek API key.
Basic Python Setup with Deepseek
Configuring Your Environment
Store your API key securely using environment variables:
import os
from openai import OpenAI
# Configure the Deepseek client
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"), # Never hardcode API keys
base_url="https://api.deepseek.com"
)
For local development, create a .env
file:
DEEPSEEK_API_KEY=your-api-key-here
Then load it in your Python script:
from dotenv import load_dotenv
load_dotenv()
Simple Web Scraping Example
Here's a basic example of extracting product information from a webpage:
import requests
from openai import OpenAI
import json
client = OpenAI(
api_key=os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
def scrape_product_page(url):
"""Extract product data from a webpage using Deepseek"""
# Fetch the HTML content
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
})
html_content = response.text
# Create a prompt for Deepseek
prompt = f"""
Extract the following product information from this HTML and return as JSON:
- name: Product name
- price: Price as a number
- currency: Currency code (USD, EUR, etc.)
- description: Product description
- availability: In stock status (true/false)
- images: Array of image URLs
HTML Content:
{html_content[:8000]}
Return ONLY valid JSON, no additional text.
"""
# Call Deepseek API
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "You are a web scraping assistant that extracts structured data from HTML. Always return valid JSON."
},
{
"role": "user",
"content": prompt
}
],
temperature=0.0 # Use 0 for consistent, deterministic output
)
# Parse the JSON response
product_data = json.loads(completion.choices[0].message.content)
return product_data
# Example usage
url = "https://example.com/product/12345"
product = scrape_product_page(url)
print(json.dumps(product, indent=2))
Advanced Web Scraping Techniques
Preprocessing HTML for Better Results
To optimize token usage and improve accuracy, clean the HTML before sending it to Deepseek:
from bs4 import BeautifulSoup
def clean_html_for_llm(html_content):
"""Remove unnecessary elements to reduce tokens and improve extraction"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script tags, styles, and navigation elements
for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
element.decompose()
# Remove comments
for comment in soup.find_all(text=lambda text: isinstance(text, str)):
if '<!--' in str(comment):
comment.extract()
# Get cleaner HTML or just text
return str(soup)
def scrape_with_preprocessing(url):
"""Scrape with HTML preprocessing"""
response = requests.get(url)
cleaned_html = clean_html_for_llm(response.text)
# Now use cleaned HTML with Deepseek
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract article title, author, date, and content as JSON:\n\n{cleaned_html[:8000]}"
}
],
temperature=0.0
)
return json.loads(completion.choices[0].message.content)
Using Function Calling for Structured Output
Function calling ensures you always get properly structured data. This is particularly useful when you need to get structured output from an LLM:
def scrape_with_function_calling(url):
"""Extract data using function calling for guaranteed structure"""
html = requests.get(url).text
# Define the expected output structure
tools = [
{
"type": "function",
"function": {
"name": "extract_product_info",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Product price as a number"
},
"currency": {
"type": "string",
"description": "Currency code (USD, EUR, GBP, etc.)"
},
"in_stock": {
"type": "boolean",
"description": "Whether the product is in stock"
},
"rating": {
"type": "number",
"description": "Product rating (0-5 scale)"
},
"reviews_count": {
"type": "integer",
"description": "Number of customer reviews"
},
"images": {
"type": "array",
"items": {"type": "string"},
"description": "Array of product image URLs"
}
},
"required": ["name", "price", "currency"]
}
}
}
]
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract product data from this HTML:\n\n{html[:8000]}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_product_info"}}
)
# Extract the structured data
function_args = json.loads(
completion.choices[0].message.tool_calls[0].function.arguments
)
return function_args
# Example usage
product_data = scrape_with_function_calling("https://example.com/product/12345")
print(f"Product: {product_data['name']}")
print(f"Price: {product_data['currency']} {product_data['price']}")
Batch Processing Multiple Pages
When scraping multiple pages, use concurrent processing to improve performance:
import concurrent.futures
from typing import List, Dict
from urllib.parse import urljoin
def extract_data_from_page(url: str) -> Dict:
"""Extract data from a single page"""
try:
html = requests.get(url, timeout=10).text
cleaned = clean_html_for_llm(html)
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract key data points as JSON:\n\n{cleaned[:8000]}"
}
],
temperature=0.0
)
return {
"url": url,
"data": json.loads(completion.choices[0].message.content),
"success": True
}
except Exception as e:
return {
"url": url,
"error": str(e),
"success": False
}
def scrape_multiple_pages(urls: List[str], max_workers: int = 5) -> List[Dict]:
"""Scrape multiple pages concurrently"""
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
# Submit all tasks
future_to_url = {
executor.submit(extract_data_from_page, url): url
for url in urls
}
# Collect results as they complete
for future in concurrent.futures.as_completed(future_to_url):
results.append(future.result())
return results
# Example: Scrape product listing pages
base_url = "https://example.com/products"
urls = [f"{base_url}?page={i}" for i in range(1, 11)]
all_products = scrape_multiple_pages(urls, max_workers=5)
successful = [r for r in all_products if r['success']]
print(f"Successfully scraped {len(successful)}/{len(urls)} pages")
Combining Deepseek with Dynamic Content Scraping
For JavaScript-heavy websites, combine Deepseek with browser automation. This approach is essential when you need to handle dynamic websites with LLM-based scraping:
Using Selenium with Deepseek
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_dynamic_page_with_selenium(url):
"""Scrape JavaScript-rendered content with Selenium and Deepseek"""
# Configure headless Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Chrome(options=chrome_options)
try:
# Load the page
driver.get(url)
# Wait for JavaScript to render content
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# Optional: Wait for specific elements
# WebDriverWait(driver, 10).until(
# EC.presence_of_element_located((By.CLASS_NAME, "product-listing"))
# )
# Get fully rendered HTML
html_content = driver.page_source
# Process with Deepseek
cleaned_html = clean_html_for_llm(html_content)
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract all product listings as a JSON array:\n\n{cleaned_html[:8000]}"
}
],
temperature=0.0
)
return json.loads(completion.choices[0].message.content)
finally:
driver.quit()
# Example usage
products = scrape_dynamic_page_with_selenium("https://example.com/search?q=laptop")
print(f"Found {len(products)} products")
Using Playwright with Deepseek
Playwright is a modern alternative to Selenium:
from playwright.sync_api import sync_playwright
def scrape_with_playwright(url):
"""Scrape with Playwright and Deepseek"""
with sync_playwright() as p:
# Launch browser
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate to URL
page.goto(url, wait_until="networkidle")
# Optional: Wait for specific content
page.wait_for_selector(".product-grid", timeout=10000)
# Get rendered HTML
html_content = page.content()
browser.close()
# Extract data with Deepseek
cleaned = clean_html_for_llm(html_content)
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract product data as JSON:\n\n{cleaned[:8000]}"
}
],
temperature=0.0
)
return json.loads(completion.choices[0].message.content)
# Example usage
data = scrape_with_playwright("https://example.com/products")
Error Handling and Retry Logic
Robust error handling is crucial for production web scraping. Understanding what error handling strategies to use when scraping with LLMs is essential:
import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
class ScrapingError(Exception):
"""Custom exception for scraping errors"""
pass
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10),
retry=retry_if_exception_type((requests.RequestException, json.JSONDecodeError))
)
def scrape_with_retry(url: str) -> Dict:
"""Scrape with automatic retry logic"""
try:
# Fetch content
response = requests.get(url, timeout=15, headers={
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
})
response.raise_for_status()
html_content = response.text
cleaned = clean_html_for_llm(html_content)
# Extract with Deepseek
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "Extract data and return valid JSON only."
},
{
"role": "user",
"content": f"Extract structured data:\n\n{cleaned[:8000]}"
}
],
temperature=0.0,
timeout=30.0
)
response_text = completion.choices[0].message.content
# Parse JSON response
try:
data = json.loads(response_text)
except json.JSONDecodeError:
# Try to extract JSON from response if wrapped in markdown
import re
json_match = re.search(r'\{.*\}|\[.*\]', response_text, re.DOTALL)
if json_match:
data = json.loads(json_match.group())
else:
raise ScrapingError("Could not parse JSON from LLM response")
return {
"url": url,
"data": data,
"success": True
}
except requests.RequestException as e:
print(f"Request error for {url}: {e}")
raise
except Exception as e:
print(f"Extraction error for {url}: {e}")
return {
"url": url,
"error": str(e),
"success": False
}
# Example usage with retry
result = scrape_with_retry("https://example.com/product/12345")
if result['success']:
print("Scraping successful:", result['data'])
else:
print("Scraping failed:", result['error'])
Optimizing Token Usage and Costs
Since Deepseek charges based on tokens, optimizing usage is important. Learn more about optimizing LLM costs when scraping:
def estimate_tokens(text: str) -> int:
"""Rough token estimation (1 token ≈ 4 characters for English)"""
return len(text) // 4
def chunk_html_content(html: str, max_tokens: int = 6000) -> List[str]:
"""Split large HTML into smaller chunks"""
max_chars = max_tokens * 4
chunks = []
soup = BeautifulSoup(html, 'html.parser')
# Split by main sections
sections = soup.find_all(['article', 'section', 'div'], class_=True)
current_chunk = ""
for section in sections:
section_html = str(section)
if len(current_chunk) + len(section_html) < max_chars:
current_chunk += section_html
else:
if current_chunk:
chunks.append(current_chunk)
current_chunk = section_html
if current_chunk:
chunks.append(current_chunk)
return chunks
def scrape_large_page(url: str) -> List[Dict]:
"""Handle large pages by chunking"""
html = requests.get(url).text
chunks = chunk_html_content(html, max_tokens=6000)
results = []
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}")
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "user",
"content": f"Extract data from this HTML section:\n\n{chunk}"
}
],
temperature=0.0
)
results.append(json.loads(completion.choices[0].message.content))
# Rate limiting
time.sleep(0.5)
return results
Complete Production-Ready Example
Here's a full example combining all best practices:
import os
import json
import time
import logging
from typing import Dict, List, Optional
from dataclasses import dataclass
import requests
from bs4 import BeautifulSoup
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential
from dotenv import load_dotenv
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Load environment variables
load_dotenv()
@dataclass
class ScrapingResult:
"""Data class for scraping results"""
url: str
data: Optional[Dict]
success: bool
error: Optional[str] = None
tokens_used: Optional[int] = None
class DeepseekScraper:
"""Production-ready web scraper using Deepseek"""
def __init__(self, api_key: Optional[str] = None):
self.client = OpenAI(
api_key=api_key or os.environ.get("DEEPSEEK_API_KEY"),
base_url="https://api.deepseek.com"
)
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
})
def clean_html(self, html: str) -> str:
"""Remove unnecessary HTML elements"""
soup = BeautifulSoup(html, 'html.parser')
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
return str(soup)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
def extract_data(self, html: str, extraction_schema: Dict) -> Dict:
"""Extract data using Deepseek with retry logic"""
cleaned = self.clean_html(html)
# Truncate to avoid token limits
max_chars = 30000
if len(cleaned) > max_chars:
cleaned = cleaned[:max_chars]
logger.warning(f"HTML truncated to {max_chars} characters")
prompt = f"""
Extract data matching this schema and return ONLY valid JSON:
Schema: {json.dumps(extraction_schema, indent=2)}
Rules:
- Return only JSON, no markdown or explanations
- Use null for missing values
- Maintain exact field names from schema
HTML:
{cleaned}
"""
completion = self.client.chat.completions.create(
model="deepseek-chat",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant. Return only valid JSON."
},
{
"role": "user",
"content": prompt
}
],
temperature=0.0
)
response_text = completion.choices[0].message.content
# Extract JSON from response
import re
json_match = re.search(r'\{.*\}|\[.*\]', response_text, re.DOTALL)
if json_match:
return json.loads(json_match.group())
else:
return json.loads(response_text)
def scrape_page(self, url: str, schema: Dict) -> ScrapingResult:
"""Scrape a single page"""
logger.info(f"Scraping: {url}")
try:
# Fetch HTML
response = self.session.get(url, timeout=15)
response.raise_for_status()
# Extract data
data = self.extract_data(response.text, schema)
logger.info(f"Successfully scraped: {url}")
return ScrapingResult(
url=url,
data=data,
success=True
)
except Exception as e:
logger.error(f"Error scraping {url}: {str(e)}")
return ScrapingResult(
url=url,
data=None,
success=False,
error=str(e)
)
def scrape_multiple(
self,
urls: List[str],
schema: Dict,
delay: float = 1.0
) -> List[ScrapingResult]:
"""Scrape multiple URLs with rate limiting"""
results = []
for i, url in enumerate(urls):
result = self.scrape_page(url, schema)
results.append(result)
# Rate limiting
if i < len(urls) - 1:
time.sleep(delay)
return results
# Example usage
if __name__ == "__main__":
# Initialize scraper
scraper = DeepseekScraper()
# Define extraction schema
product_schema = {
"name": "string",
"price": "number",
"currency": "string",
"description": "string",
"in_stock": "boolean",
"images": ["array of strings"]
}
# Scrape single page
result = scraper.scrape_page(
"https://example.com/product/12345",
product_schema
)
if result.success:
print("Product data:")
print(json.dumps(result.data, indent=2))
else:
print(f"Error: {result.error}")
# Scrape multiple pages
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3"
]
results = scraper.scrape_multiple(urls, product_schema, delay=1.0)
successful = [r for r in results if r.success]
print(f"\nSuccessfully scraped {len(successful)}/{len(urls)} pages")
Best Practices Summary
When using Deepseek with Python for web scraping:
- Always preprocess HTML - Remove unnecessary elements to reduce token usage
- Use function calling - Ensures consistent structured output
- Implement retry logic - Handle API failures gracefully
- Respect rate limits - Add delays between requests
- Monitor token usage - Track costs and optimize prompts
- Validate outputs - Always validate JSON responses
- Use environment variables - Never hardcode API keys
- Log everything - Maintain detailed logs for debugging
- Handle errors gracefully - Return meaningful error messages
- Test incrementally - Start small and scale up
Conclusion
Deepseek provides a powerful, cost-effective solution for Python web scraping projects. By combining it with traditional scraping tools like BeautifulSoup and Selenium, you can build robust data extraction pipelines that handle complex, dynamic websites intelligently. The key is to optimize token usage, implement proper error handling, and follow best practices for production deployments.
Whether you're extracting product data, parsing news articles, or monitoring competitor websites, Deepseek's semantic understanding capabilities make it an excellent choice for modern web scraping challenges.