How do I do Prompt Engineering with Deepseek for Better Data Extraction?
Prompt engineering is crucial for maximizing the accuracy and reliability of Deepseek when extracting data from web pages. Unlike traditional web scraping tools that rely on CSS selectors or XPath, Deepseek uses natural language instructions to understand and extract data, making prompt quality essential for successful data extraction.
Understanding Deepseek's Prompt Structure
Deepseek models excel at following structured instructions when properly formatted. The key to effective prompt engineering is clarity, specificity, and providing context about the data you want to extract.
Basic Prompt Template
Here's a foundational template for data extraction prompts:
prompt = """
Extract the following information from the provided HTML:
1. [Field name 1]: [Description of what to extract]
2. [Field name 2]: [Description of what to extract]
3. [Field name 3]: [Description of what to extract]
Return the data as a JSON object with these exact keys: field1, field2, field3.
If a field is not found, use null as the value.
"""
Best Practices for Deepseek Prompt Engineering
1. Be Explicit About Output Format
Always specify the exact format you want. Deepseek performs better when you explicitly define the structure:
import requests
import json
def extract_product_data(html_content):
prompt = """
Extract product information from this HTML and return a JSON object with these fields:
{
"name": "product name as a string",
"price": "price as a number (without currency symbols)",
"currency": "currency code (USD, EUR, etc.)",
"availability": "in_stock or out_of_stock",
"rating": "rating as a number between 0-5, or null if not available",
"review_count": "number of reviews as an integer, or null if not available"
}
Important: Return ONLY valid JSON, no additional text or explanations.
"""
api_payload = {
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "You are a precise data extraction assistant. Extract only the requested information and return valid JSON."
},
{
"role": "user",
"content": f"{prompt}\n\nHTML:\n{html_content}"
}
],
"temperature": 0.1
}
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={"Authorization": f"Bearer YOUR_API_KEY"},
json=api_payload
)
return json.loads(response.json()['choices'][0]['message']['content'])
2. Use Few-Shot Learning for Complex Extraction
When dealing with inconsistent HTML structures, provide examples of the expected output:
const axios = require('axios');
async function extractArticleData(html) {
const prompt = `
Extract article metadata from the HTML. Here are examples of the expected format:
Example 1:
Input: <article><h1>Sample Title</h1><span class="date">2024-01-15</span></article>
Output: {"title": "Sample Title", "date": "2024-01-15", "author": null}
Example 2:
Input: <div class="post"><h2>Another Article</h2><p class="meta">By John Doe on 2024-02-20</p></div>
Output: {"title": "Another Article", "date": "2024-02-20", "author": "John Doe"}
Now extract the same information from this HTML:
${html}
Return ONLY the JSON object, nothing else.
`;
const response = await axios.post(
'https://api.deepseek.com/v1/chat/completions',
{
model: 'deepseek-chat',
messages: [
{
role: 'system',
content: 'You are a data extraction specialist. Follow the examples precisely and return only valid JSON.'
},
{
role: 'user',
content: prompt
}
],
temperature: 0.0
},
{
headers: {
'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
'Content-Type': 'application/json'
}
}
);
return JSON.parse(response.data.choices[0].message.content);
}
3. Handle Edge Cases Explicitly
Define how to handle missing data, multiple values, or special formats:
def extract_listing_data(html_content):
prompt = """
Extract real estate listing information with these rules:
1. price: Extract numeric value only. If range (e.g., "$100k-$150k"), take the lower value.
2. bedrooms: Extract as integer. If "Studio", return 0.
3. bathrooms: Extract as float (e.g., "2.5" for 2.5 baths).
4. address: Full address as single string.
5. features: Array of strings. Common features: "parking", "pool", "gym", etc.
Edge cases:
- If price says "Contact for price" or similar, return null
- If bedrooms/bathrooms not specified, return null
- If multiple addresses found, use the first one
- Normalize feature names to lowercase
Return format:
{
"price": number or null,
"bedrooms": integer or null,
"bathrooms": float or null,
"address": string or null,
"features": array of strings (empty array if none)
}
HTML content:
""" + html_content
# API call with low temperature for consistency
return call_deepseek_api(prompt, temperature=0.0)
4. Optimize for Token Usage
When scraping multiple pages, reduce token consumption by preprocessing HTML:
from bs4 import BeautifulSoup
def preprocess_html_for_extraction(raw_html):
"""Remove unnecessary elements before sending to Deepseek"""
soup = BeautifulSoup(raw_html, 'html.parser')
# Remove script and style elements
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Keep only the main content area if identifiable
main_content = soup.find('main') or soup.find('article') or soup.find(id='content')
if main_content:
return str(main_content)
return str(soup)
def efficient_extraction(url):
# First, scrape the page
response = requests.get(url)
# Preprocess to reduce tokens
clean_html = preprocess_html_for_extraction(response.text)
prompt = """
Extract product details: name, price, description (first 200 chars), and image URL.
Return as JSON with keys: name, price, description, image_url.
"""
return extract_with_deepseek(clean_html, prompt)
Advanced Prompt Engineering Techniques
Chain-of-Thought Prompting
For complex extraction tasks, guide Deepseek through the reasoning process:
chain_of_thought_prompt = """
Extract the contact information from this business listing HTML.
Think through this step-by-step:
1. First, identify all text that looks like phone numbers (formats: (123) 456-7890, 123-456-7890, +1-123-456-7890)
2. Then, find email addresses (look for @ symbol and valid email format)
3. Next, locate physical addresses (typically include street, city, state, zip)
4. Finally, find social media links (Facebook, Twitter, LinkedIn URLs)
After analyzing, return a JSON object:
{
"phone": "primary phone number in format +1-XXX-XXX-XXXX or null",
"email": "primary email address or null",
"address": "full address as string or null",
"social_media": {
"facebook": "URL or null",
"twitter": "URL or null",
"linkedin": "URL or null"
}
}
HTML:
[HTML_CONTENT]
"""
Validation and Error Handling
Implement validation to ensure Deepseek returns the expected format:
async function extractWithValidation(html, maxRetries = 3) {
const prompt = `
Extract e-commerce product data:
- product_id: alphanumeric string
- name: product name
- price: number (positive)
- category: string
Return valid JSON only.
HTML: ${html}
`;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await callDeepseekAPI(prompt);
const data = JSON.parse(response);
// Validate required fields
if (!data.product_id || !data.name || typeof data.price !== 'number') {
throw new Error('Invalid data structure');
}
// Validate data types and constraints
if (data.price <= 0) {
throw new Error('Invalid price value');
}
return data;
} catch (error) {
console.log(`Attempt ${attempt} failed: ${error.message}`);
if (attempt === maxRetries) {
throw new Error('Max retries exceeded for data extraction');
}
// Add more specific instructions for retry
prompt += `\n\nPrevious attempt failed. Ensure all fields are present and price is a positive number.`;
}
}
}
Combining Deepseek with Traditional Web Scraping
For optimal results, combine LLM-based extraction with traditional scraping methods. When handling AJAX requests using Puppeteer, you can extract the rendered HTML and pass it to Deepseek for intelligent parsing:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
def scrape_dynamic_content(url):
# Use Selenium for dynamic content
driver = webdriver.Chrome()
driver.get(url)
# Wait for dynamic content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "product-details"))
)
# Get rendered HTML
html_content = driver.page_source
driver.quit()
# Use Deepseek to extract structured data
prompt = """
This HTML is from a dynamically loaded product page.
Extract: product name, price, specifications (as object), availability.
Return JSON format:
{
"name": string,
"price": number,
"specifications": object,
"in_stock": boolean
}
"""
return extract_with_deepseek(html_content, prompt)
Temperature and Parameter Tuning
For data extraction, use low temperature values to ensure consistency:
def extract_with_optimal_settings(html, prompt):
"""
Optimal Deepseek settings for data extraction:
- temperature: 0.0-0.1 for deterministic output
- top_p: 0.95 for focused responses
- max_tokens: Estimate based on expected output size
"""
response = requests.post(
"https://api.deepseek.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {DEEPSEEK_API_KEY}",
"Content-Type": "application/json"
},
json={
"model": "deepseek-chat",
"messages": [
{
"role": "system",
"content": "You are a precise data extraction system. Return only valid JSON."
},
{
"role": "user",
"content": f"{prompt}\n\nHTML:\n{html}"
}
],
"temperature": 0.0, # Deterministic output
"top_p": 0.95, # Focus on high-probability tokens
"max_tokens": 1000, # Adjust based on expected output
"response_format": {"type": "json_object"} # If supported
}
)
return response.json()
Prompt Templates for Common Scenarios
E-commerce Product Extraction
PRODUCT_EXTRACTION_PROMPT = """
Extract product information from this e-commerce page.
Required fields:
- title: Product name/title
- price: Numeric price value (without currency symbol)
- currency: Currency code (USD, EUR, GBP, etc.)
- sku: Product SKU/ID if available
- description: Product description (max 500 characters)
- images: Array of image URLs
- variants: Array of variant objects if multiple options exist (size, color, etc.)
- rating: Average rating (0-5) or null
- reviews_count: Number of reviews or null
Return as valid JSON. Use null for unavailable fields.
"""
Article/Blog Post Extraction
ARTICLE_EXTRACTION_PROMPT = """
Extract article metadata and content.
Fields to extract:
- headline: Main article title
- author: Author name(s)
- published_date: Publication date in ISO 8601 format (YYYY-MM-DD)
- modified_date: Last modified date or null
- categories: Array of category/tag strings
- content: Full article text (preserve paragraphs, remove ads/navigation)
- featured_image: Main article image URL or null
- word_count: Approximate word count
Return valid JSON only.
"""
Testing and Iteration
Always test your prompts with various HTML structures:
def test_prompt_effectiveness():
test_cases = [
{
"html": "<div class='price'>$99.99</div>",
"expected": {"price": 99.99, "currency": "USD"}
},
{
"html": "<span class='cost'>€75.50</span>",
"expected": {"price": 75.50, "currency": "EUR"}
},
{
"html": "<p>Price: Contact us</p>",
"expected": {"price": None, "currency": None}
}
]
for i, test in enumerate(test_cases):
result = extract_with_deepseek(test["html"], PRICE_EXTRACTION_PROMPT)
assert result == test["expected"], f"Test {i+1} failed"
print(f"Test {i+1} passed ✓")
Conclusion
Effective prompt engineering with Deepseek for web scraping requires clarity, structure, and iteration. By following these best practices—specifying exact output formats, handling edge cases, using appropriate temperature settings, and combining with traditional scraping tools when needed—you can achieve highly accurate and reliable data extraction.
Remember to preprocess HTML to reduce token costs, implement validation for extracted data, and continuously refine your prompts based on real-world results. When working with dynamic content, consider monitoring network requests in Puppeteer to better understand the data flow before crafting your extraction prompts.
Start with simple, clear prompts and gradually add complexity as needed. The key is balancing specificity with flexibility to handle varying HTML structures while maintaining consistent, structured output.