How do I use function calling with Deepseek LLM for web scraping?
Function calling with Deepseek LLM enables structured, reliable web scraping by allowing the model to invoke predefined functions for data extraction. This approach combines the intelligence of large language models with the precision of traditional programming, making it ideal for complex web scraping tasks where you need both flexibility and consistency.
Understanding Function Calling in Deepseek
Function calling is a feature in modern LLMs that allows the model to generate structured function calls based on natural language instructions. Instead of returning unstructured text, Deepseek can identify when to call specific functions and return properly formatted JSON parameters that your code can execute.
For web scraping, this means you can: - Define extraction functions with specific schemas - Let Deepseek determine which function to call based on page content - Receive structured data in a predictable format - Chain multiple extraction steps together
Setting Up Function Calling with Deepseek
Prerequisites
First, install the required dependencies:
pip install openai requests beautifulsoup4
The Deepseek API is compatible with the OpenAI SDK, which makes integration straightforward.
Defining Scraping Functions
Start by defining the functions you want Deepseep to call. Here's an example for extracting product information:
import openai
import requests
from bs4 import BeautifulSoup
# Configure Deepseek API
client = openai.OpenAI(
api_key="your-deepseek-api-key",
base_url="https://api.deepseek.com"
)
# Define function schemas
functions = [
{
"name": "extract_product_data",
"description": "Extract structured product information from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The product name"
},
"price": {
"type": "number",
"description": "The product price as a number"
},
"currency": {
"type": "string",
"description": "The currency code (e.g., USD, EUR)"
},
"availability": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "preorder"],
"description": "Product availability status"
},
"rating": {
"type": "number",
"description": "Product rating (0-5)"
},
"reviews_count": {
"type": "integer",
"description": "Number of customer reviews"
}
},
"required": ["name", "price", "currency"]
}
},
{
"name": "extract_article_metadata",
"description": "Extract metadata from article or blog post pages",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string"},
"author": {"type": "string"},
"publish_date": {"type": "string"},
"tags": {
"type": "array",
"items": {"type": "string"}
},
"summary": {"type": "string"}
},
"required": ["title"]
}
}
]
Implementing the Scraping Workflow
Basic Function Calling Example
Here's a complete example that fetches a webpage and uses Deepseek to extract structured data:
def scrape_with_function_calling(url):
# Fetch the HTML content
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
html_content = response.text
# Use BeautifulSoup to extract clean text
soup = BeautifulSoup(html_content, 'html.parser')
# Remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
page_text = soup.get_text(separator='\n', strip=True)
# Truncate if too long (Deepseek has token limits)
max_chars = 10000
if len(page_text) > max_chars:
page_text = page_text[:max_chars]
# Call Deepseek with function definitions
messages = [
{
"role": "system",
"content": "You are a web scraping assistant. Extract data from the provided HTML content using the available functions."
},
{
"role": "user",
"content": f"Extract all relevant information from this page:\n\n{page_text}"
}
]
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
functions=functions,
function_call="auto" # Let the model decide which function to call
)
return response
# Example usage
result = scrape_with_function_calling("https://example.com/product/12345")
message = result.choices[0].message
if message.function_call:
print(f"Function called: {message.function_call.name}")
print(f"Arguments: {message.function_call.arguments}")
Processing Function Call Results
After Deepseek returns a function call, you need to process the results:
import json
def process_extraction_result(response):
message = response.choices[0].message
if not message.function_call:
# No function was called, return the text response
return {"type": "text", "content": message.content}
# Parse the function call
function_name = message.function_call.name
arguments = json.loads(message.function_call.arguments)
# Execute the extraction based on function name
if function_name == "extract_product_data":
return {
"type": "product",
"data": arguments
}
elif function_name == "extract_article_metadata":
return {
"type": "article",
"data": arguments
}
return arguments
# Process the result
extracted_data = process_extraction_result(result)
print(json.dumps(extracted_data, indent=2))
JavaScript Implementation
Here's how to implement function calling with Deepseek in JavaScript:
const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');
const client = new OpenAI({
apiKey: 'your-deepseek-api-key',
baseURL: 'https://api.deepseek.com'
});
const functions = [
{
name: 'extract_product_data',
description: 'Extract structured product information from HTML',
parameters: {
type: 'object',
properties: {
name: { type: 'string', description: 'The product name' },
price: { type: 'number', description: 'The product price' },
currency: { type: 'string', description: 'Currency code' },
availability: {
type: 'string',
enum: ['in_stock', 'out_of_stock', 'preorder']
}
},
required: ['name', 'price', 'currency']
}
}
];
async function scrapeWithFunctionCalling(url) {
// Fetch HTML
const response = await axios.get(url, {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
});
// Parse HTML
const $ = cheerio.load(response.data);
// Remove scripts and styles
$('script, style').remove();
const pageText = $('body').text().trim().substring(0, 10000);
// Call Deepseek API
const completion = await client.chat.completions.create({
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'Extract product data from the provided page content.'
},
{
role: 'user',
content: `Extract information from this page:\n\n${pageText}`
}
],
functions: functions,
function_call: 'auto'
});
const message = completion.choices[0].message;
if (message.function_call) {
return {
function: message.function_call.name,
data: JSON.parse(message.function_call.arguments)
};
}
return { type: 'text', content: message.content };
}
// Usage
scrapeWithFunctionCalling('https://example.com/product/12345')
.then(result => console.log(JSON.stringify(result, null, 2)))
.catch(error => console.error('Error:', error));
Advanced Patterns for Web Scraping
Multi-Step Extraction with Function Chaining
For complex pages, you might need multiple function calls:
def multi_step_scraping(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
page_text = soup.get_text(separator='\n', strip=True)[:10000]
messages = [
{
"role": "system",
"content": "You are a web scraping expert. Analyze pages and extract data systematically."
},
{
"role": "user",
"content": f"First, identify what type of page this is:\n\n{page_text}"
}
]
# First call: Identify page type
response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
functions=[
{
"name": "identify_page_type",
"description": "Identify the type of webpage",
"parameters": {
"type": "object",
"properties": {
"page_type": {
"type": "string",
"enum": ["product", "article", "listing", "other"]
}
},
"required": ["page_type"]
}
}
],
function_call={"name": "identify_page_type"}
)
page_type_data = json.loads(response.choices[0].message.function_call.arguments)
page_type = page_type_data["page_type"]
# Second call: Extract data based on page type
messages.append({
"role": "assistant",
"content": None,
"function_call": response.choices[0].message.function_call
})
messages.append({
"role": "user",
"content": f"Now extract all {page_type} data from the page."
})
# Select appropriate function based on page type
extraction_functions = {
"product": functions[0], # extract_product_data
"article": functions[1] # extract_article_metadata
}
final_response = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
functions=[extraction_functions.get(page_type, functions[0])],
function_call="auto"
)
return process_extraction_result(final_response)
Batch Processing Multiple Pages
When scraping multiple pages, implement batch processing with rate limiting:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
def scrape_multiple_urls(urls, max_workers=3, delay=1):
results = []
def scrape_single(url):
try:
result = scrape_with_function_calling(url)
extracted = process_extraction_result(result)
time.sleep(delay) # Rate limiting
return {"url": url, "data": extracted, "success": True}
except Exception as e:
return {"url": url, "error": str(e), "success": False}
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {executor.submit(scrape_single, url): url for url in urls}
for future in as_completed(futures):
results.append(future.result())
return results
# Example
urls = [
"https://example.com/product/1",
"https://example.com/product/2",
"https://example.com/product/3"
]
batch_results = scrape_multiple_urls(urls)
for result in batch_results:
if result["success"]:
print(f"Successfully scraped {result['url']}")
print(json.dumps(result["data"], indent=2))
Error Handling and Validation
Implement robust error handling for production use:
def safe_function_calling(url, max_retries=3):
for attempt in range(max_retries):
try:
# Fetch content
response = requests.get(url, timeout=10)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
page_text = soup.get_text(separator='\n', strip=True)[:10000]
# Call Deepseek
completion = client.chat.completions.create(
model="deepseek-chat",
messages=[
{"role": "system", "content": "Extract structured data."},
{"role": "user", "content": f"Extract data:\n\n{page_text}"}
],
functions=functions,
function_call="auto",
temperature=0 # Use 0 for deterministic results
)
message = completion.choices[0].message
if not message.function_call:
raise ValueError("No function call returned")
# Validate the extracted data
data = json.loads(message.function_call.arguments)
# Basic validation
if message.function_call.name == "extract_product_data":
if not data.get("name") or not data.get("price"):
raise ValueError("Missing required product fields")
return {
"success": True,
"function": message.function_call.name,
"data": data
}
except requests.RequestException as e:
if attempt == max_retries - 1:
return {"success": False, "error": f"HTTP error: {str(e)}"}
time.sleep(2 ** attempt) # Exponential backoff
except json.JSONDecodeError as e:
return {"success": False, "error": f"JSON parsing error: {str(e)}"}
except Exception as e:
if attempt == max_retries - 1:
return {"success": False, "error": f"Extraction error: {str(e)}"}
time.sleep(1)
return {"success": False, "error": "Max retries exceeded"}
Best Practices
1. Design Clear Function Schemas
Make your function parameters specific and well-documented:
{
"name": "extract_contact_info",
"description": "Extract contact information from a business or contact page",
"parameters": {
"type": "object",
"properties": {
"email": {
"type": "string",
"description": "Email address in standard format (e.g., contact@example.com)"
},
"phone": {
"type": "string",
"description": "Phone number with country code if available"
},
"address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"country": {"type": "string"},
"postal_code": {"type": "string"}
}
}
}
}
}
2. Optimize Token Usage
Reduce costs by sending only relevant content to the API:
def extract_relevant_content(html, content_type="product"):
soup = BeautifulSoup(html, 'html.parser')
# Target specific sections based on common patterns
if content_type == "product":
# Look for product containers
product_section = soup.find('div', {'class': ['product', 'item', 'product-detail']})
if product_section:
return product_section.get_text(separator='\n', strip=True)
# Fallback to general extraction
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
return soup.get_text(separator='\n', strip=True)[:8000]
3. Use Temperature=0 for Consistency
For web scraping, you want deterministic results:
completion = client.chat.completions.create(
model="deepseek-chat",
messages=messages,
functions=functions,
function_call="auto",
temperature=0 # Ensures consistent extraction
)
4. Combine with Traditional Scraping
Use traditional web scraping tools for structure and Deepseek for understanding:
def hybrid_scraping(url):
# Use requests/BeautifulSoup for structure
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract structured parts traditionally
title = soup.find('h1')
price_element = soup.find('span', {'class': 'price'})
# Use Deepseek for complex parts
description_html = soup.find('div', {'class': 'description'})
if description_html:
description_text = description_html.get_text()
# Use function calling for intelligent extraction
features_response = client.chat.completions.create(
model="deepseek-chat",
messages=[{
"role": "user",
"content": f"Extract key features as a list:\n{description_text}"
}],
functions=[{
"name": "extract_features",
"parameters": {
"type": "object",
"properties": {
"features": {
"type": "array",
"items": {"type": "string"}
}
}
}
}],
function_call={"name": "extract_features"}
)
features_data = json.loads(
features_response.choices[0].message.function_call.arguments
)
return {
"title": title.text if title else None,
"price": price_element.text if price_element else None,
"features": features_data.get("features", [])
}
When to Use Function Calling vs. Other Methods
Function calling with Deepseek is ideal when:
- You need structured, validated output
- The page structure varies but semantic content is consistent
- You're extracting complex entities that require understanding
- You want to avoid maintaining brittle CSS selectors
For simpler tasks or when handling dynamic JavaScript content, traditional tools might be more efficient and cost-effective.
Conclusion
Function calling with Deepseek LLM provides a powerful middle ground between fully manual parsing and completely AI-driven extraction. By defining clear function schemas and combining LLM intelligence with traditional scraping techniques, you can build robust, maintainable web scraping solutions that handle real-world complexity while maintaining structure and reliability.
The key to success is thoughtful function design, proper error handling, and knowing when to use AI versus traditional methods. Start with clear, simple functions and gradually expand as you understand your data extraction needs better.