What are Effective Prompt Templates for Deepseek Web Scraping?

When using Deepseek for web scraping and data extraction, crafting effective prompts is crucial for getting accurate, structured results. A well-designed prompt template can dramatically improve extraction quality, reduce hallucinations, and ensure consistent output formatting. This guide covers battle-tested prompt templates specifically optimized for Deepseek's architecture.

Understanding Deepseek's Prompt Requirements

Deepseek models excel at following structured instructions when prompts are clear, specific, and provide concrete examples. Unlike traditional web scraping tools that rely on CSS selectors or XPath, Deepseek uses natural language understanding to identify and extract data patterns from HTML or text content.

The key principles for effective Deepseek prompting include:

Specificity: Clearly define what data to extract
Structure: Specify the exact output format (usually JSON)
Examples: Provide sample inputs and expected outputs
Constraints: Define data types, required fields, and validation rules

Basic Extraction Prompt Template

Here's a foundational template for extracting structured data with Deepseek:

import requests
import json

def create_extraction_prompt(html_content, fields):
    prompt = f"""Extract the following information from the HTML content below and return it as valid JSON.

Required fields:
{json.dumps(fields, indent=2)}

Rules:
- Return only valid JSON, no additional text
- Use null for missing values
- Ensure all field names match exactly as specified
- Extract text content, not HTML tags

HTML Content:
{html_content}

JSON Output:"""

    return prompt

# Example usage
fields_to_extract = {
    "title": "string - The main product title",
    "price": "number - The price as a numeric value",
    "availability": "string - In stock status",
    "rating": "number - Average rating (0-5)"
}

html = """
<div class="product">
    <h1>Premium Wireless Headphones</h1>
    <span class="price">$149.99</span>
    <div class="stock">In Stock</div>
    <span class="rating">4.5 stars</span>
</div>
"""

prompt = create_extraction_prompt(html, fields_to_extract)

# Call Deepseek API
response = requests.post(
    "https://api.deepseek.com/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_API_KEY",
        "Content-Type": "application/json"
    },
    json={
        "model": "deepseek-chat",
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "temperature": 0.1
    }
)

result = json.loads(response.json()['choices'][0]['message']['content'])
print(json.dumps(result, indent=2))

List Extraction Template

For extracting multiple items (like product listings or search results):

const axios = require('axios');

async function extractList(html, itemSchema) {
    const prompt = `Extract all items from the HTML below that match this schema. Return a JSON array.

Item Schema:
${JSON.stringify(itemSchema, null, 2)}

Instructions:
- Return a JSON array of all matching items
- Each item must follow the schema exactly
- Preserve the order of items as they appear
- Skip items with insufficient data

HTML Content:
${html}

JSON Array:`;

    const response = await axios.post(
        'https://api.deepseek.com/v1/chat/completions',
        {
            model: 'deepseek-chat',
            messages: [
                { role: 'user', content: prompt }
            ],
            temperature: 0.1,
            max_tokens: 4000
        },
        {
            headers: {
                'Authorization': 'Bearer YOUR_API_KEY',
                'Content-Type': 'application/json'
            }
        }
    );

    return JSON.parse(response.data.choices[0].message.content);
}

// Example usage
const itemSchema = {
    name: "string - Product name",
    price: "number - Price in USD",
    url: "string - Relative or absolute URL"
};

const htmlList = `
<div class="product-grid">
    <div class="item">
        <a href="/product/1">Laptop Pro</a>
        <span class="price">$999</span>
    </div>
    <div class="item">
        <a href="/product/2">Mouse Wireless</a>
        <span class="price">$29.99</span>
    </div>
</div>
`;

extractList(htmlList, itemSchema).then(items => {
    console.log(JSON.stringify(items, null, 2));
});

Few-Shot Learning Template

Providing examples significantly improves accuracy, especially for complex or ambiguous data:

def create_few_shot_prompt(html, examples):
    prompt = """Extract structured data from HTML content following these examples.

Example 1:
Input HTML: <div class="product"><h2>Blue Shirt</h2><p class="price">$25.00</p></div>
Output JSON: {"name": "Blue Shirt", "price": 25.00}

Example 2:
Input HTML: <article><h1>Red Shoes</h1><span class="cost">$79.99</span></article>
Output JSON: {"name": "Red Shoes", "price": 79.99}

Example 3:
Input HTML: <section><div class="title">Green Hat</div><div class="amount">$15</div></section>
Output JSON: {"name": "Green Hat", "price": 15.00}

Now extract from this HTML following the same pattern:
Input HTML: {html}
Output JSON:"""

    return prompt.format(html=html)

Table Extraction Template

Tables require special handling to preserve row-column relationships:

def extract_table(html_table):
    prompt = f"""Extract the data from this HTML table and return it as a JSON array of objects.

Requirements:
- First row is headers (use as JSON keys)
- Convert headers to lowercase snake_case
- Each subsequent row becomes an object
- Detect and convert data types (numbers, booleans, strings)
- Handle empty cells as null

HTML Table:
{html_table}

Return only the JSON array, no additional text.

JSON Array:"""

    # API call here
    return prompt

Multi-Field Validation Template

For critical data that requires validation and multiple attempts:

function createValidatedPrompt(html, schema) {
    return `Extract data from HTML and validate against schema.

Schema with validation rules:
${JSON.stringify(schema, null, 2)}

Validation requirements:
- email: must be valid email format
- url: must start with http:// or https://
- phone: must contain 10+ digits
- date: must be parseable date format
- price: must be positive number

HTML Content:
${html}

Return JSON with:
{
  "data": {extracted fields},
  "validation": {
    "valid": true/false,
    "errors": ["list of validation errors if any"]
  }
}

JSON Response:`;
}

// Example with validation schema
const schema = {
    email: "string (email format)",
    website: "string (URL format)",
    phone: "string (phone number)",
    established: "string (year)",
    revenue: "number (positive)"
};

Nested Data Extraction Template

For hierarchical data structures like product categories with items:

def extract_nested_data(html):
    prompt = f"""Extract hierarchical data from this HTML, preserving parent-child relationships.

Return JSON with this structure:
{{
  "categories": [
    {{
      "category_name": "string",
      "items": [
        {{
          "name": "string",
          "details": "string"
        }}
      ]
    }}
  ]
}}

HTML Content:
{html}

JSON Output:"""

    return prompt

# Example
html_nested = """
<div class="category">
    <h2>Electronics</h2>
    <div class="item">Laptop - $999</div>
    <div class="item">Phone - $699</div>
</div>
<div class="category">
    <h2>Books</h2>
    <div class="item">Python Guide - $39</div>
</div>
"""

Handling Dynamic Content

When scraping JavaScript-rendered content, you may need to combine tools like handling AJAX requests using Puppeteer to capture the full HTML before passing it to Deepseek:

from playwright.sync_api import sync_playwright

def scrape_and_extract_dynamic(url, extraction_prompt):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Wait for dynamic content
        page.wait_for_selector('.product-list')
        html = page.content()
        browser.close()

        # Now use Deepseek to extract from rendered HTML
        prompt = f"""{extraction_prompt}

HTML Content:
{html}

JSON Output:"""

        # Call Deepseek API
        return call_deepseek_api(prompt)

Error Handling and Retry Template

Build resilience into your prompts:

async function robustExtraction(html, schema, maxRetries = 3) {
    const prompt = `Extract data from HTML. If any field cannot be found, use null.

Schema:
${JSON.stringify(schema, null, 2)}

Critical rules:
- Return ONLY valid JSON
- Use null for missing data, never omit fields
- Never return empty strings, use null instead
- Verify all required fields are present

HTML:
${html}

Valid JSON only:`;

    for (let attempt = 0; attempt < maxRetries; attempt++) {
        try {
            const response = await callDeepseekAPI(prompt);
            const parsed = JSON.parse(response);

            // Validate against schema
            const hasAllFields = Object.keys(schema).every(
                key => key in parsed
            );

            if (hasAllFields) {
                return parsed;
            }
        } catch (error) {
            if (attempt === maxRetries - 1) throw error;
            // Add delay before retry
            await new Promise(resolve => setTimeout(resolve, 1000));
        }
    }
}

Specialized Templates for Common Scenarios

E-commerce Product Extraction

ECOMMERCE_PROMPT = """Extract product information from this e-commerce page.

Required fields:
- product_name: string
- brand: string or null
- price: number (current price)
- original_price: number or null (if on sale)
- currency: string (USD, EUR, etc.)
- in_stock: boolean
- rating: number (0-5) or null
- review_count: integer or null
- images: array of image URLs
- description: string (first 200 chars)

HTML:
{html}

Return valid JSON with all fields:"""

Article/Blog Post Extraction

ARTICLE_PROMPT = """Extract article metadata and content.

Required structure:
{{
  "title": "string",
  "author": "string or null",
  "publish_date": "ISO date string or null",
  "category": "string or null",
  "tags": ["array of strings"],
  "content": "full article text",
  "featured_image": "URL or null",
  "word_count": integer
}}

HTML:
{html}

JSON Output:"""

Contact Information Extraction

CONTACT_PROMPT = """Extract all contact information from this page.

Fields to extract:
- company_name: string
- email: string or null
- phone: string or null
- address: string or null
- social_media: object with platforms as keys and URLs as values

HTML:
{html}

JSON Response:"""

Best Practices for Prompt Optimization

Keep temperature low (0.1-0.3) for consistent, deterministic outputs
Use system messages to set extraction behavior globally
Specify output format explicitly - JSON schema, field types
Include edge case handling in your prompts
Test with diverse HTML structures before production use
Monitor token usage - compress HTML when possible by removing unnecessary tags
Implement validation on extracted data
Use retries with exponential backoff for failed extractions

Integrating with Web Scraping Workflows

Deepseek works best when combined with traditional scraping tools for page navigation and HTML retrieval. You can monitor network requests in Puppeteer to capture API responses and then use Deepseek to parse the JSON or HTML payloads.

Here's a complete workflow example:

import requests
from bs4 import BeautifulSoup

class DeepseekExtractor:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.deepseek.com/v1/chat/completions"

    def create_prompt(self, template, **kwargs):
        return template.format(**kwargs)

    def extract(self, prompt, temperature=0.1):
        response = requests.post(
            self.base_url,
            headers={
                "Authorization": f"Bearer {self.api_key}",
                "Content-Type": "application/json"
            },
            json={
                "model": "deepseek-chat",
                "messages": [{"role": "user", "content": prompt}],
                "temperature": temperature
            }
        )

        content = response.json()['choices'][0]['message']['content']
        return json.loads(content)

    def scrape_and_extract(self, url, extraction_template):
        # Fetch HTML
        html_response = requests.get(url)

        # Optional: clean HTML with BeautifulSoup
        soup = BeautifulSoup(html_response.text, 'html.parser')

        # Remove scripts, styles
        for element in soup(['script', 'style', 'nav', 'footer']):
            element.decompose()

        clean_html = str(soup)

        # Create prompt and extract
        prompt = self.create_prompt(extraction_template, html=clean_html)
        return self.extract(prompt)

# Usage
extractor = DeepseekExtractor("your-api-key")
data = extractor.scrape_and_extract(
    "https://example.com/product",
    ECOMMERCE_PROMPT
)
print(data)

Conclusion

Effective prompt engineering for Deepseek web scraping requires a balance of clear instructions, structured output requirements, and validation. Start with basic templates and iterate based on your specific use cases. Always validate extracted data, implement error handling, and consider combining Deepseek with traditional scraping tools like Puppeteer for handling browser sessions and dynamic content.

The templates provided here serve as starting points—customize them for your specific domains and data structures to achieve the best results. Remember that prompt optimization is an iterative process; monitor accuracy and adjust your templates based on real-world performance.

Table of contents

What are Effective Prompt Templates for Deepseek Web Scraping?

Understanding Deepseek's Prompt Requirements

Basic Extraction Prompt Template

List Extraction Template

Few-Shot Learning Template

Table Extraction Template

Multi-Field Validation Template

Nested Data Extraction Template

Handling Dynamic Content

Error Handling and Retry Template

Specialized Templates for Common Scenarios

E-commerce Product Extraction

Article/Blog Post Extraction

Contact Information Extraction

Best Practices for Prompt Optimization

Integrating with Web Scraping Workflows

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I do prompt engineering with Deepseek for better data extraction?

How does Deepseek compare to GPT models for web scraping tasks?

What is Deepseek Coder and can it be used for web scraping scripts?

Get Started Now

Support