Table of contents

How do I Parse HTML Using Deepseek for Data Extraction?

Parsing HTML with Deepseek offers a powerful AI-driven approach to data extraction that goes beyond traditional CSS selectors or XPath. Deepseek can understand HTML structure contextually, making it ideal for extracting data from complex, dynamic, or inconsistent web pages.

Understanding Deepseek for HTML Parsing

Deepseek is a large language model (LLM) that can process HTML content and extract structured data based on natural language instructions. Unlike traditional parsing libraries that require you to specify exact selectors, Deepseek can intelligently identify and extract relevant information from HTML markup.

This approach is particularly useful when: - HTML structure changes frequently - You need to extract semantic information rather than just text - The data isn't consistently formatted across pages - You want to extract relationships between data elements

Basic HTML Parsing with Deepseek API

Python Example

Here's how to parse HTML using Deepseek's API in Python:

import requests
import json

def parse_html_with_deepseek(html_content, extraction_prompt):
    """
    Parse HTML content using Deepseek API

    Args:
        html_content: Raw HTML string
        extraction_prompt: Instructions for what to extract

    Returns:
        Extracted data as JSON
    """
    api_key = "your-deepseek-api-key"
    api_url = "https://api.deepseek.com/v1/chat/completions"

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }

    # Construct the prompt
    system_prompt = """You are an HTML parsing assistant. Extract data from HTML
    and return it as valid JSON. Be precise and only extract the requested information."""

    user_prompt = f"""Extract the following from this HTML:
    {extraction_prompt}

    HTML:
    {html_content}

    Return only valid JSON."""

    payload = {
        "model": "deepseek-chat",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt}
        ],
        "temperature": 0.1,  # Low temperature for consistent extraction
        "response_format": {"type": "json_object"}
    }

    response = requests.post(api_url, headers=headers, json=payload)
    result = response.json()

    # Parse the JSON response
    extracted_data = json.loads(result['choices'][0]['message']['content'])
    return extracted_data

# Example usage
html = """
<div class="product">
    <h2>Wireless Headphones</h2>
    <span class="price">$79.99</span>
    <div class="rating">4.5 stars</div>
    <p class="description">Premium noise-canceling headphones</p>
</div>
"""

extraction_instructions = """
- product_name: The product title
- price: The numeric price value
- rating: The rating value as a number
- description: The product description text
"""

data = parse_html_with_deepseek(html, extraction_instructions)
print(json.dumps(data, indent=2))

JavaScript/Node.js Example

Here's the equivalent implementation in JavaScript:

const axios = require('axios');

async function parseHtmlWithDeepseek(htmlContent, extractionPrompt) {
    const apiKey = 'your-deepseek-api-key';
    const apiUrl = 'https://api.deepseek.com/v1/chat/completions';

    const systemPrompt = `You are an HTML parsing assistant. Extract data from HTML
    and return it as valid JSON. Be precise and only extract the requested information.`;

    const userPrompt = `Extract the following from this HTML:
    ${extractionPrompt}

    HTML:
    ${htmlContent}

    Return only valid JSON.`;

    try {
        const response = await axios.post(apiUrl, {
            model: 'deepseek-chat',
            messages: [
                { role: 'system', content: systemPrompt },
                { role: 'user', content: userPrompt }
            ],
            temperature: 0.1,
            response_format: { type: 'json_object' }
        }, {
            headers: {
                'Content-Type': 'application/json',
                'Authorization': `Bearer ${apiKey}`
            }
        });

        const extractedData = JSON.parse(
            response.data.choices[0].message.content
        );
        return extractedData;

    } catch (error) {
        console.error('Error parsing HTML:', error.message);
        throw error;
    }
}

// Example usage
const html = `
<article class="blog-post">
    <h1>10 Web Scraping Tips</h1>
    <span class="author">John Doe</span>
    <time>2025-01-15</time>
    <div class="content">Learn essential web scraping techniques...</div>
</article>
`;

const instructions = `
- title: The article title
- author: The author name
- publish_date: The publication date
- content_preview: First sentence of the content
`;

parseHtmlWithDeepseek(html, instructions)
    .then(data => console.log(JSON.stringify(data, null, 2)))
    .catch(err => console.error(err));

Advanced HTML Parsing Techniques

Extracting Lists and Tables

Deepseek excels at extracting structured data from tables and lists:

def extract_table_data(html_table):
    """Extract data from HTML table using Deepseek"""

    prompt = """
    Extract all rows from this HTML table as an array of objects.
    Each object should have keys matching the table headers.
    """

    api_key = "your-deepseek-api-key"

    response = requests.post(
        "https://api.deepseek.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json={
            "model": "deepseek-chat",
            "messages": [
                {"role": "system", "content": "You are an HTML table parser."},
                {"role": "user", "content": f"{prompt}\n\nHTML:\n{html_table}"}
            ],
            "temperature": 0,
            "response_format": {"type": "json_object"}
        }
    )

    return response.json()['choices'][0]['message']['content']

# Example with product table
table_html = """
<table class="products">
    <thead>
        <tr><th>Name</th><th>Price</th><th>Stock</th></tr>
    </thead>
    <tbody>
        <tr><td>Laptop</td><td>$999</td><td>In Stock</td></tr>
        <tr><td>Mouse</td><td>$29</td><td>Low Stock</td></tr>
    </tbody>
</table>
"""

table_data = extract_table_data(table_html)
print(table_data)

Handling Complex Nested Structures

For deeply nested HTML, Deepseek can understand hierarchical relationships:

def extract_nested_data(html_content):
    """Extract data from complex nested HTML structures"""

    prompt = """
    From this HTML, extract:
    1. All comment threads with their replies
    2. Include comment author, timestamp, text, and all nested replies
    3. Maintain the hierarchical structure
    """

    # Implementation similar to previous examples
    # The key is crafting clear prompts that specify the desired structure
    pass

Combining Deepseek with Traditional Web Scraping

For optimal results, combine Deepseek with traditional scraping tools. First, use a headless browser or HTTP client to fetch the HTML, then use Deepseek for intelligent parsing:

import requests
from bs4 import BeautifulSoup

def scrape_and_parse_with_deepseek(url):
    """
    Fetch HTML and parse with Deepseek
    """
    # Step 1: Fetch the HTML
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    html_content = response.text

    # Step 2: Optional - Clean HTML with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()

    # Extract the main content area
    main_content = soup.find('main') or soup.find('article') or soup.body
    cleaned_html = str(main_content)

    # Step 3: Parse with Deepseek
    extraction_prompt = """
    Extract all product listings with:
    - name
    - price
    - availability
    - image URL
    """

    return parse_html_with_deepseek(cleaned_html, extraction_prompt)

For more complex scenarios involving JavaScript-rendered content, you might want to use browser automation tools to handle AJAX requests before passing the HTML to Deepseek.

Best Practices for HTML Parsing with Deepseek

1. Optimize Token Usage

HTML can be verbose. Minimize token consumption by:

def clean_html_for_parsing(html_content):
    """Remove unnecessary elements to reduce token usage"""
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove non-content elements
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()

    # Remove empty tags
    for tag in soup.find_all():
        if len(tag.get_text(strip=True)) == 0 and not tag.find_all('img'):
            tag.decompose()

    # Remove attributes that aren't needed for extraction
    for tag in soup.find_all(True):
        # Keep only class and id attributes
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in ['class', 'id']}

    return str(soup)

2. Use Specific Prompts

Be explicit about what you want to extract:

# Good prompt
prompt = """
Extract product information as JSON with these exact fields:
- product_name (string): The product title
- price (number): Numeric price value without currency symbol
- in_stock (boolean): true if available, false otherwise
- rating (number): Star rating as decimal (e.g., 4.5)
"""

# Avoid vague prompts like:
# "Get the product data" - too ambiguous

3. Handle Errors Gracefully

Implement robust error handling:

def safe_parse_html(html_content, extraction_prompt, max_retries=3):
    """Parse HTML with retry logic and error handling"""

    for attempt in range(max_retries):
        try:
            result = parse_html_with_deepseek(html_content, extraction_prompt)

            # Validate the response
            if not result or not isinstance(result, dict):
                raise ValueError("Invalid response format")

            return result

        except json.JSONDecodeError as e:
            print(f"JSON parsing error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:
                raise

        except requests.exceptions.RequestException as e:
            print(f"API request error (attempt {attempt + 1}): {e}")
            if attempt == max_retries - 1:
                raise

        # Wait before retrying
        import time
        time.sleep(2 ** attempt)  # Exponential backoff

    return None

4. Set Appropriate Temperature

Use low temperature values for consistent extraction:

payload = {
    "model": "deepseek-chat",
    "messages": messages,
    "temperature": 0.1,  # Low for factual extraction
    "response_format": {"type": "json_object"}
}

Batch Processing Multiple Pages

For scraping multiple pages, implement batch processing:

import asyncio
import aiohttp

async def parse_html_async(session, html_content, prompt):
    """Async version for parallel processing"""

    api_key = "your-deepseek-api-key"
    api_url = "https://api.deepseek.com/v1/chat/completions"

    payload = {
        "model": "deepseek-chat",
        "messages": [
            {"role": "system", "content": "You are an HTML parser."},
            {"role": "user", "content": f"{prompt}\n\nHTML:\n{html_content}"}
        ],
        "temperature": 0.1,
        "response_format": {"type": "json_object"}
    }

    async with session.post(
        api_url,
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json"
        },
        json=payload
    ) as response:
        result = await response.json()
        return result['choices'][0]['message']['content']

async def batch_parse_pages(html_pages, prompt):
    """Parse multiple HTML pages in parallel"""

    async with aiohttp.ClientSession() as session:
        tasks = [
            parse_html_async(session, html, prompt)
            for html in html_pages
        ]
        results = await asyncio.gather(*tasks)
        return results

# Usage
html_pages = [page1_html, page2_html, page3_html]
extraction_prompt = "Extract all article titles and dates"

results = asyncio.run(batch_parse_pages(html_pages, extraction_prompt))

Cost Optimization Strategies

1. Cache Results

import hashlib
import json
from functools import lru_cache

def get_cache_key(html_content, prompt):
    """Generate cache key from HTML and prompt"""
    content = f"{html_content}{prompt}"
    return hashlib.md5(content.encode()).hexdigest()

def parse_with_cache(html_content, prompt, cache_file='parse_cache.json'):
    """Parse HTML with caching to avoid duplicate API calls"""

    cache_key = get_cache_key(html_content, prompt)

    # Load cache
    try:
        with open(cache_file, 'r') as f:
            cache = json.load(f)
    except FileNotFoundError:
        cache = {}

    # Check cache
    if cache_key in cache:
        print(f"Cache hit for key: {cache_key}")
        return cache[cache_key]

    # Parse with Deepseek
    result = parse_html_with_deepseek(html_content, prompt)

    # Save to cache
    cache[cache_key] = result
    with open(cache_file, 'w') as f:
        json.dump(cache, f)

    return result

2. Pre-filter Content

Extract only relevant sections before sending to Deepseek:

from bs4 import BeautifulSoup

def extract_relevant_section(html_content, section_selector):
    """Extract only the relevant section before parsing"""

    soup = BeautifulSoup(html_content, 'html.parser')
    relevant_section = soup.select_one(section_selector)

    if not relevant_section:
        return html_content

    return str(relevant_section)

# Example: Only parse product listings
html_section = extract_relevant_section(full_html, 'div.product-grid')
data = parse_html_with_deepseek(html_section, extraction_prompt)

Monitoring and Debugging

When working with complex pages, you may need to monitor network requests to ensure you're capturing the complete HTML, especially for dynamically loaded content.

Add logging to track your parsing operations:

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def parse_with_logging(html_content, prompt):
    """Parse HTML with detailed logging"""

    logger.info(f"Starting HTML parse - Content length: {len(html_content)}")
    logger.info(f"Extraction prompt: {prompt[:100]}...")

    try:
        result = parse_html_with_deepseek(html_content, prompt)
        logger.info(f"Parse successful - Extracted {len(result)} fields")
        return result

    except Exception as e:
        logger.error(f"Parse failed: {str(e)}")
        logger.debug(f"HTML content: {html_content[:500]}...")
        raise

Conclusion

Parsing HTML with Deepseek provides a flexible, AI-powered alternative to traditional parsing methods. By combining Deepseek's natural language understanding with proper HTML preprocessing and error handling, you can build robust data extraction pipelines that adapt to changing website structures.

Key takeaways: - Use low temperature settings (0.1-0.2) for consistent extraction - Clean and minimize HTML before sending to the API to reduce costs - Implement caching and batch processing for large-scale scraping - Combine Deepseek with traditional tools for optimal results - Always validate and handle errors in extracted data

For production web scraping applications, consider using specialized APIs that handle both HTML fetching and parsing, providing a complete solution for your data extraction needs.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon