How do I do Prompt Engineering with Deepseek for Better Data Extraction?

Prompt engineering is crucial for maximizing the accuracy and reliability of Deepseek when extracting data from web pages. Unlike traditional web scraping tools that rely on CSS selectors or XPath, Deepseek uses natural language instructions to understand and extract data, making prompt quality essential for successful data extraction.

Understanding Deepseek's Prompt Structure

Deepseek models excel at following structured instructions when properly formatted. The key to effective prompt engineering is clarity, specificity, and providing context about the data you want to extract.

Basic Prompt Template

Here's a foundational template for data extraction prompts:

prompt = """
Extract the following information from the provided HTML:

1. [Field name 1]: [Description of what to extract]
2. [Field name 2]: [Description of what to extract]
3. [Field name 3]: [Description of what to extract]

Return the data as a JSON object with these exact keys: field1, field2, field3.
If a field is not found, use null as the value.
"""

Best Practices for Deepseek Prompt Engineering

1. Be Explicit About Output Format

Always specify the exact format you want. Deepseek performs better when you explicitly define the structure:

import requests
import json

def extract_product_data(html_content):
    prompt = """
    Extract product information from this HTML and return a JSON object with these fields:

    {
        "name": "product name as a string",
        "price": "price as a number (without currency symbols)",
        "currency": "currency code (USD, EUR, etc.)",
        "availability": "in_stock or out_of_stock",
        "rating": "rating as a number between 0-5, or null if not available",
        "review_count": "number of reviews as an integer, or null if not available"
    }

    Important: Return ONLY valid JSON, no additional text or explanations.
    """

    api_payload = {
        "model": "deepseek-chat",
        "messages": [
            {
                "role": "system",
                "content": "You are a precise data extraction assistant. Extract only the requested information and return valid JSON."
            },
            {
                "role": "user",
                "content": f"{prompt}\n\nHTML:\n{html_content}"
            }
        ],
        "temperature": 0.1
    }

    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={"Authorization": f"Bearer YOUR_API_KEY"},
        json=api_payload
    )

    return json.loads(response.json()['choices'][0]['message']['content'])

2. Use Few-Shot Learning for Complex Extraction

When dealing with inconsistent HTML structures, provide examples of the expected output:

const axios = require('axios');

async function extractArticleData(html) {
    const prompt = `
Extract article metadata from the HTML. Here are examples of the expected format:

Example 1:
Input: <article><h1>Sample Title</h1><span class="date">2024-01-15</span></article>
Output: {"title": "Sample Title", "date": "2024-01-15", "author": null}

Example 2:
Input: <div class="post"><h2>Another Article</h2><p class="meta">By John Doe on 2024-02-20</p></div>
Output: {"title": "Another Article", "date": "2024-02-20", "author": "John Doe"}

Now extract the same information from this HTML:
${html}

Return ONLY the JSON object, nothing else.
    `;

    const response = await axios.post(
        'https://api.deepseek.com/v1/chat/completions',
        {
            model: 'deepseek-chat',
            messages: [
                {
                    role: 'system',
                    content: 'You are a data extraction specialist. Follow the examples precisely and return only valid JSON.'
                },
                {
                    role: 'user',
                    content: prompt
                }
            ],
            temperature: 0.0
        },
        {
            headers: {
                'Authorization': `Bearer ${process.env.DEEPSEEK_API_KEY}`,
                'Content-Type': 'application/json'
            }
        }
    );

    return JSON.parse(response.data.choices[0].message.content);
}

3. Handle Edge Cases Explicitly

Define how to handle missing data, multiple values, or special formats:

def extract_listing_data(html_content):
    prompt = """
    Extract real estate listing information with these rules:

    1. price: Extract numeric value only. If range (e.g., "$100k-$150k"), take the lower value.
    2. bedrooms: Extract as integer. If "Studio", return 0.
    3. bathrooms: Extract as float (e.g., "2.5" for 2.5 baths).
    4. address: Full address as single string.
    5. features: Array of strings. Common features: "parking", "pool", "gym", etc.

    Edge cases:
    - If price says "Contact for price" or similar, return null
    - If bedrooms/bathrooms not specified, return null
    - If multiple addresses found, use the first one
    - Normalize feature names to lowercase

    Return format:
    {
        "price": number or null,
        "bedrooms": integer or null,
        "bathrooms": float or null,
        "address": string or null,
        "features": array of strings (empty array if none)
    }

    HTML content:
    """ + html_content

    # API call with low temperature for consistency
    return call_deepseek_api(prompt, temperature=0.0)

4. Optimize for Token Usage

When scraping multiple pages, reduce token consumption by preprocessing HTML:

from bs4 import BeautifulSoup

def preprocess_html_for_extraction(raw_html):
    """Remove unnecessary elements before sending to Deepseek"""
    soup = BeautifulSoup(raw_html, 'html.parser')

    # Remove script and style elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Keep only the main content area if identifiable
    main_content = soup.find('main') or soup.find('article') or soup.find(id='content')

    if main_content:
        return str(main_content)

    return str(soup)

def efficient_extraction(url):
    # First, scrape the page
    response = requests.get(url)

    # Preprocess to reduce tokens
    clean_html = preprocess_html_for_extraction(response.text)

    prompt = """
    Extract product details: name, price, description (first 200 chars), and image URL.
    Return as JSON with keys: name, price, description, image_url.
    """

    return extract_with_deepseek(clean_html, prompt)

Advanced Prompt Engineering Techniques

Chain-of-Thought Prompting

For complex extraction tasks, guide Deepseek through the reasoning process:

chain_of_thought_prompt = """
Extract the contact information from this business listing HTML.

Think through this step-by-step:
1. First, identify all text that looks like phone numbers (formats: (123) 456-7890, 123-456-7890, +1-123-456-7890)
2. Then, find email addresses (look for @ symbol and valid email format)
3. Next, locate physical addresses (typically include street, city, state, zip)
4. Finally, find social media links (Facebook, Twitter, LinkedIn URLs)

After analyzing, return a JSON object:
{
    "phone": "primary phone number in format +1-XXX-XXX-XXXX or null",
    "email": "primary email address or null",
    "address": "full address as string or null",
    "social_media": {
        "facebook": "URL or null",
        "twitter": "URL or null",
        "linkedin": "URL or null"
    }
}

HTML:
[HTML_CONTENT]
"""

Validation and Error Handling

Implement validation to ensure Deepseek returns the expected format:

async function extractWithValidation(html, maxRetries = 3) {
    const prompt = `
Extract e-commerce product data:
- product_id: alphanumeric string
- name: product name
- price: number (positive)
- category: string

Return valid JSON only.

HTML: ${html}
    `;

    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            const response = await callDeepseekAPI(prompt);
            const data = JSON.parse(response);

            // Validate required fields
            if (!data.product_id || !data.name || typeof data.price !== 'number') {
                throw new Error('Invalid data structure');
            }

            // Validate data types and constraints
            if (data.price <= 0) {
                throw new Error('Invalid price value');
            }

            return data;

        } catch (error) {
            console.log(`Attempt ${attempt} failed: ${error.message}`);

            if (attempt === maxRetries) {
                throw new Error('Max retries exceeded for data extraction');
            }

            // Add more specific instructions for retry
            prompt += `\n\nPrevious attempt failed. Ensure all fields are present and price is a positive number.`;
        }
    }
}

Combining Deepseek with Traditional Web Scraping

For optimal results, combine LLM-based extraction with traditional scraping methods. When handling AJAX requests using Puppeteer, you can extract the rendered HTML and pass it to Deepseek for intelligent parsing:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json

def scrape_dynamic_content(url):
    # Use Selenium for dynamic content
    driver = webdriver.Chrome()
    driver.get(url)

    # Wait for dynamic content to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "product-details"))
    )

    # Get rendered HTML
    html_content = driver.page_source
    driver.quit()

    # Use Deepseek to extract structured data
    prompt = """
    This HTML is from a dynamically loaded product page.
    Extract: product name, price, specifications (as object), availability.

    Return JSON format:
    {
        "name": string,
        "price": number,
        "specifications": object,
        "in_stock": boolean
    }
    """

    return extract_with_deepseek(html_content, prompt)

Temperature and Parameter Tuning

For data extraction, use low temperature values to ensure consistency:

def extract_with_optimal_settings(html, prompt):
    """
    Optimal Deepseek settings for data extraction:
    - temperature: 0.0-0.1 for deterministic output
    - top_p: 0.95 for focused responses
    - max_tokens: Estimate based on expected output size
    """

    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {DEEPSEEK_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {
                    "role": "system",
                    "content": "You are a precise data extraction system. Return only valid JSON."
                },
                {
                    "role": "user",
                    "content": f"{prompt}\n\nHTML:\n{html}"
                }
            ],
            "temperature": 0.0,  # Deterministic output
            "top_p": 0.95,       # Focus on high-probability tokens
            "max_tokens": 1000,  # Adjust based on expected output
            "response_format": {"type": "json_object"}  # If supported
        }
    )

    return response.json()

Prompt Templates for Common Scenarios

E-commerce Product Extraction

PRODUCT_EXTRACTION_PROMPT = """
Extract product information from this e-commerce page.

Required fields:
- title: Product name/title
- price: Numeric price value (without currency symbol)
- currency: Currency code (USD, EUR, GBP, etc.)
- sku: Product SKU/ID if available
- description: Product description (max 500 characters)
- images: Array of image URLs
- variants: Array of variant objects if multiple options exist (size, color, etc.)
- rating: Average rating (0-5) or null
- reviews_count: Number of reviews or null

Return as valid JSON. Use null for unavailable fields.
"""

Article/Blog Post Extraction

ARTICLE_EXTRACTION_PROMPT = """
Extract article metadata and content.

Fields to extract:
- headline: Main article title
- author: Author name(s)
- published_date: Publication date in ISO 8601 format (YYYY-MM-DD)
- modified_date: Last modified date or null
- categories: Array of category/tag strings
- content: Full article text (preserve paragraphs, remove ads/navigation)
- featured_image: Main article image URL or null
- word_count: Approximate word count

Return valid JSON only.
"""

Testing and Iteration

Always test your prompts with various HTML structures:

def test_prompt_effectiveness():
    test_cases = [
        {
            "html": "<div class='price'>$99.99</div>",
            "expected": {"price": 99.99, "currency": "USD"}
        },
        {
            "html": "<span class='cost'>€75.50</span>",
            "expected": {"price": 75.50, "currency": "EUR"}
        },
        {
            "html": "<p>Price: Contact us</p>",
            "expected": {"price": None, "currency": None}
        }
    ]

    for i, test in enumerate(test_cases):
        result = extract_with_deepseek(test["html"], PRICE_EXTRACTION_PROMPT)
        assert result == test["expected"], f"Test {i+1} failed"
        print(f"Test {i+1} passed ✓")

Conclusion

Effective prompt engineering with Deepseek for web scraping requires clarity, structure, and iteration. By following these best practices—specifying exact output formats, handling edge cases, using appropriate temperature settings, and combining with traditional scraping tools when needed—you can achieve highly accurate and reliable data extraction.

Remember to preprocess HTML to reduce token costs, implement validation for extracted data, and continuously refine your prompts based on real-world results. When working with dynamic content, consider monitoring network requests in Puppeteer to better understand the data flow before crafting your extraction prompts.

Start with simple, clear prompts and gradually add complexity as needed. The key is balancing specificity with flexibility to handle varying HTML structures while maintaining consistent, structured output.

Table of contents

How do I do Prompt Engineering with Deepseek for Better Data Extraction?

Understanding Deepseek's Prompt Structure

Basic Prompt Template

Best Practices for Deepseek Prompt Engineering

1. Be Explicit About Output Format

2. Use Few-Shot Learning for Complex Extraction

3. Handle Edge Cases Explicitly

4. Optimize for Token Usage

Advanced Prompt Engineering Techniques

Chain-of-Thought Prompting

Validation and Error Handling

Combining Deepseek with Traditional Web Scraping

Temperature and Parameter Tuning

Prompt Templates for Common Scenarios

E-commerce Product Extraction

Article/Blog Post Extraction

Testing and Iteration

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How does Deepseek compare to GPT models for web scraping tasks?

What is Deepseek Coder and can it be used for web scraping scripts?

How does Deepseek vs OpenAI compare for web scraping use cases?

Get Started Now

Support