What are Effective Prompt Templates for Deepseek Web Scraping?
When using Deepseek for web scraping and data extraction, crafting effective prompts is crucial for getting accurate, structured results. A well-designed prompt template can dramatically improve extraction quality, reduce hallucinations, and ensure consistent output formatting. This guide covers battle-tested prompt templates specifically optimized for Deepseek's architecture.
Understanding Deepseek's Prompt Requirements
Deepseek models excel at following structured instructions when prompts are clear, specific, and provide concrete examples. Unlike traditional web scraping tools that rely on CSS selectors or XPath, Deepseek uses natural language understanding to identify and extract data patterns from HTML or text content.
The key principles for effective Deepseek prompting include:
- Specificity: Clearly define what data to extract
- Structure: Specify the exact output format (usually JSON)
- Examples: Provide sample inputs and expected outputs
- Constraints: Define data types, required fields, and validation rules
Basic Extraction Prompt Template
Here's a foundational template for extracting structured data with Deepseek:
import requests
import json
def create_extraction_prompt(html_content, fields):
prompt = f"""Extract the following information from the HTML content below and return it as valid JSON.
Required fields:
{json.dumps(fields, indent=2)}
Rules:
- Return only valid JSON, no additional text
- Use null for missing values
- Ensure all field names match exactly as specified
- Extract text content, not HTML tags
HTML Content:
{html_content}
JSON Output:"""
return prompt
# Example usage
fields_to_extract = {
"title": "string - The main product title",
"price": "number - The price as a numeric value",
"availability": "string - In stock status",
"rating": "number - Average rating (0-5)"
}
html = """
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$149.99</span>
<div class="stock">In Stock</div>
<span class="rating">4.5 stars</span>
</div>
"""
prompt = create_extraction_prompt(html, fields_to_extract)
# Call Deepseek API
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": "Bearer YOUR_API_KEY",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": [
{"role": "user", "content": prompt}
],
"temperature": 0.1
}
)
result = json.loads(response.json()['choices'][0]['message']['content'])
print(json.dumps(result, indent=2))
List Extraction Template
For extracting multiple items (like product listings or search results):
const axios = require('axios');
async function extractList(html, itemSchema) {
const prompt = `Extract all items from the HTML below that match this schema. Return a JSON array.
Item Schema:
${JSON.stringify(itemSchema, null, 2)}
Instructions:
- Return a JSON array of all matching items
- Each item must follow the schema exactly
- Preserve the order of items as they appear
- Skip items with insufficient data
HTML Content:
${html}
JSON Array:`;
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{ role: 'user', content: prompt }
],
temperature: 0.1,
max_tokens: 4000
},
{
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
}
}
);
return JSON.parse(response.data.choices[0].message.content);
}
// Example usage
const itemSchema = {
name: "string - Product name",
price: "number - Price in USD",
url: "string - Relative or absolute URL"
};
const htmlList = `
<div class="product-grid">
<div class="item">
<a href="/product/1">Laptop Pro</a>
<span class="price">$999</span>
</div>
<div class="item">
<a href="/product/2">Mouse Wireless</a>
<span class="price">$29.99</span>
</div>
</div>
`;
extractList(htmlList, itemSchema).then(items => {
console.log(JSON.stringify(items, null, 2));
});
Few-Shot Learning Template
Providing examples significantly improves accuracy, especially for complex or ambiguous data:
def create_few_shot_prompt(html, examples):
prompt = """Extract structured data from HTML content following these examples.
Example 1:
Input HTML: <div class="product"><h2>Blue Shirt</h2><p class="price">$25.00</p></div>
Output JSON: {"name": "Blue Shirt", "price": 25.00}
Example 2:
Input HTML: <article><h1>Red Shoes</h1><span class="cost">$79.99</span></article>
Output JSON: {"name": "Red Shoes", "price": 79.99}
Example 3:
Input HTML: <section><div class="title">Green Hat</div><div class="amount">$15</div></section>
Output JSON: {"name": "Green Hat", "price": 15.00}
Now extract from this HTML following the same pattern:
Input HTML: {html}
Output JSON:"""
return prompt.format(html=html)
Table Extraction Template
Tables require special handling to preserve row-column relationships:
def extract_table(html_table):
prompt = f"""Extract the data from this HTML table and return it as a JSON array of objects.
Requirements:
- First row is headers (use as JSON keys)
- Convert headers to lowercase snake_case
- Each subsequent row becomes an object
- Detect and convert data types (numbers, booleans, strings)
- Handle empty cells as null
HTML Table:
{html_table}
Return only the JSON array, no additional text.
JSON Array:"""
# API call here
return prompt
Multi-Field Validation Template
For critical data that requires validation and multiple attempts:
function createValidatedPrompt(html, schema) {
return `Extract data from HTML and validate against schema.
Schema with validation rules:
${JSON.stringify(schema, null, 2)}
Validation requirements:
- email: must be valid email format
- url: must start with http:// or https://
- phone: must contain 10+ digits
- date: must be parseable date format
- price: must be positive number
HTML Content:
${html}
Return JSON with:
{
"data": {extracted fields},
"validation": {
"valid": true/false,
"errors": ["list of validation errors if any"]
}
}
JSON Response:`;
}
// Example with validation schema
const schema = {
email: "string (email format)",
website: "string (URL format)",
phone: "string (phone number)",
established: "string (year)",
revenue: "number (positive)"
};
Nested Data Extraction Template
For hierarchical data structures like product categories with items:
def extract_nested_data(html):
prompt = f"""Extract hierarchical data from this HTML, preserving parent-child relationships.
Return JSON with this structure:
{{
"categories": [
{{
"category_name": "string",
"items": [
{{
"name": "string",
"details": "string"
}}
]
}}
]
}}
HTML Content:
{html}
JSON Output:"""
return prompt
# Example
html_nested = """
<div class="category">
<h2>Electronics</h2>
<div class="item">Laptop - $999</div>
<div class="item">Phone - $699</div>
</div>
<div class="category">
<h2>Books</h2>
<div class="item">Python Guide - $39</div>
</div>
"""
Handling Dynamic Content
When scraping JavaScript-rendered content, you may need to combine tools like handling AJAX requests using Puppeteer to capture the full HTML before passing it to Deepseek:
from playwright.sync_api import sync_playwright
def scrape_and_extract_dynamic(url, extraction_prompt):
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
# Wait for dynamic content
page.wait_for_selector('.product-list')
html = page.content()
browser.close()
# Now use Deepseek to extract from rendered HTML
prompt = f"""{extraction_prompt}
HTML Content:
{html}
JSON Output:"""
# Call Deepseek API
return call_deepseek_api(prompt)
Error Handling and Retry Template
Build resilience into your prompts:
async function robustExtraction(html, schema, maxRetries = 3) {
const prompt = `Extract data from HTML. If any field cannot be found, use null.
Schema:
${JSON.stringify(schema, null, 2)}
Critical rules:
- Return ONLY valid JSON
- Use null for missing data, never omit fields
- Never return empty strings, use null instead
- Verify all required fields are present
HTML:
${html}
Valid JSON only:`;
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await callDeepseekAPI(prompt);
const parsed = JSON.parse(response);
// Validate against schema
const hasAllFields = Object.keys(schema).every(
key => key in parsed
);
if (hasAllFields) {
return parsed;
}
} catch (error) {
if (attempt === maxRetries - 1) throw error;
// Add delay before retry
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
}
Specialized Templates for Common Scenarios
E-commerce Product Extraction
ECOMMERCE_PROMPT = """Extract product information from this e-commerce page.
Required fields:
- product_name: string
- brand: string or null
- price: number (current price)
- original_price: number or null (if on sale)
- currency: string (USD, EUR, etc.)
- in_stock: boolean
- rating: number (0-5) or null
- review_count: integer or null
- images: array of image URLs
- description: string (first 200 chars)
HTML:
{html}
Return valid JSON with all fields:"""
Article/Blog Post Extraction
ARTICLE_PROMPT = """Extract article metadata and content.
Required structure:
{{
"title": "string",
"author": "string or null",
"publish_date": "ISO date string or null",
"category": "string or null",
"tags": ["array of strings"],
"content": "full article text",
"featured_image": "URL or null",
"word_count": integer
}}
HTML:
{html}
JSON Output:"""
Contact Information Extraction
CONTACT_PROMPT = """Extract all contact information from this page.
Fields to extract:
- company_name: string
- email: string or null
- phone: string or null
- address: string or null
- social_media: object with platforms as keys and URLs as values
HTML:
{html}
JSON Response:"""
Best Practices for Prompt Optimization
- Keep temperature low (0.1-0.3) for consistent, deterministic outputs
- Use system messages to set extraction behavior globally
- Specify output format explicitly - JSON schema, field types
- Include edge case handling in your prompts
- Test with diverse HTML structures before production use
- Monitor token usage - compress HTML when possible by removing unnecessary tags
- Implement validation on extracted data
- Use retries with exponential backoff for failed extractions
Integrating with Web Scraping Workflows
Deepseek works best when combined with traditional scraping tools for page navigation and HTML retrieval. You can monitor network requests in Puppeteer to capture API responses and then use Deepseek to parse the JSON or HTML payloads.
Here's a complete workflow example:
import requests
from bs4 import BeautifulSoup
class DeepseekExtractor:
def __init__(self, api_key):
self.api_key = api_key
self.base_url = "https://api.deepseek.com/v1/chat/completions"
def create_prompt(self, template, **kwargs):
return template.format(**kwargs)
def extract(self, prompt, temperature=0.1):
response = requests.post(
self.base_url,
headers={
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature
}
)
content = response.json()['choices'][0]['message']['content']
return json.loads(content)
def scrape_and_extract(self, url, extraction_template):
# Fetch HTML
html_response = requests.get(url)
# Optional: clean HTML with BeautifulSoup
soup = BeautifulSoup(html_response.text, 'html.parser')
# Remove scripts, styles
for element in soup(['script', 'style', 'nav', 'footer']):
element.decompose()
clean_html = str(soup)
# Create prompt and extract
prompt = self.create_prompt(extraction_template, html=clean_html)
return self.extract(prompt)
# Usage
extractor = DeepseekExtractor("your-api-key")
data = extractor.scrape_and_extract(
"https://example.com/product",
ECOMMERCE_PROMPT
)
print(data)
Conclusion
Effective prompt engineering for Deepseek web scraping requires a balance of clear instructions, structured output requirements, and validation. Start with basic templates and iterate based on your specific use cases. Always validate extracted data, implement error handling, and consider combining Deepseek with traditional scraping tools like Puppeteer for handling browser sessions and dynamic content.
The templates provided here serve as starting points—customize them for your specific domains and data structures to achieve the best results. Remember that prompt optimization is an iterative process; monitor accuracy and adjust your templates based on real-world performance.