Table of contents

How do I use Deepseek with Python for web scraping?

Using Deepseek with Python for web scraping combines the power of artificial intelligence with traditional scraping techniques to extract structured data from HTML content intelligently. Deepseek is a cost-effective large language model (LLM) that excels at understanding and parsing unstructured web data, making it ideal for complex scraping tasks where traditional CSS selectors or XPath may fall short.

Why Use Deepseek for Web Scraping?

Deepseek offers several advantages for Python web scraping projects:

  • Cost-effective: Significantly cheaper than GPT-4 and other premium LLMs
  • Semantic understanding: Extracts data based on meaning rather than rigid selectors
  • Adaptive parsing: Handles layout changes and inconsistent HTML structures
  • Structured output: Converts messy HTML into clean JSON data
  • OpenAI-compatible API: Easy to integrate with existing Python code

Traditional web scraping relies on CSS selectors that break when websites change their structure. AI web scraping with Deepseek provides a more resilient solution by understanding content semantically rather than structurally.

Prerequisites and Installation

Before you start, ensure you have Python 3.7 or higher installed. You'll need to install the necessary libraries:

# Install the OpenAI library (Deepseek uses OpenAI-compatible API)
pip install openai

# Install web scraping libraries
pip install requests beautifulsoup4

# Optional: For handling dynamic websites
pip install selenium playwright

You'll also need a Deepseek API key. If you don't have one yet, check out our guide on how to get a Deepseek API key.

Basic Python Setup with Deepseek

Configuring Your Environment

Store your API key securely using environment variables:

import os
from openai import OpenAI

# Configure the Deepseek client
client = OpenAI(
    api_key=os.environ.get("DEEPSEEK_API_KEY"),  # Never hardcode API keys
    base_url="https://api.deepseek.com"
)

For local development, create a .env file:

DEEPSEEK_API_KEY=your-api-key-here

Then load it in your Python script:

from dotenv import load_dotenv
load_dotenv()

Simple Web Scraping Example

Here's a basic example of extracting product information from a webpage:

import requests
from openai import OpenAI
import json

client = OpenAI(
    api_key=os.environ.get("DEEPSEEK_API_KEY"),
    base_url="https://api.deepseek.com"
)

def scrape_product_page(url):
    """Extract product data from a webpage using Deepseek"""

    # Fetch the HTML content
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
    })
    html_content = response.text

    # Create a prompt for Deepseek
    prompt = f"""
    Extract the following product information from this HTML and return as JSON:
    - name: Product name
    - price: Price as a number
    - currency: Currency code (USD, EUR, etc.)
    - description: Product description
    - availability: In stock status (true/false)
    - images: Array of image URLs

    HTML Content:
    {html_content[:8000]}

    Return ONLY valid JSON, no additional text.
    """

    # Call Deepseek API
    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "system",
                "content": "You are a web scraping assistant that extracts structured data from HTML. Always return valid JSON."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0.0  # Use 0 for consistent, deterministic output
    )

    # Parse the JSON response
    product_data = json.loads(completion.choices[0].message.content)
    return product_data

# Example usage
url = "https://example.com/product/12345"
product = scrape_product_page(url)
print(json.dumps(product, indent=2))

Advanced Web Scraping Techniques

Preprocessing HTML for Better Results

To optimize token usage and improve accuracy, clean the HTML before sending it to Deepseek:

from bs4 import BeautifulSoup

def clean_html_for_llm(html_content):
    """Remove unnecessary elements to reduce tokens and improve extraction"""

    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script tags, styles, and navigation elements
    for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
        element.decompose()

    # Remove comments
    for comment in soup.find_all(text=lambda text: isinstance(text, str)):
        if '<!--' in str(comment):
            comment.extract()

    # Get cleaner HTML or just text
    return str(soup)

def scrape_with_preprocessing(url):
    """Scrape with HTML preprocessing"""

    response = requests.get(url)
    cleaned_html = clean_html_for_llm(response.text)

    # Now use cleaned HTML with Deepseek
    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "user",
                "content": f"Extract article title, author, date, and content as JSON:\n\n{cleaned_html[:8000]}"
            }
        ],
        temperature=0.0
    )

    return json.loads(completion.choices[0].message.content)

Using Function Calling for Structured Output

Function calling ensures you always get properly structured data. This is particularly useful when you need to get structured output from an LLM:

def scrape_with_function_calling(url):
    """Extract data using function calling for guaranteed structure"""

    html = requests.get(url).text

    # Define the expected output structure
    tools = [
        {
            "type": "function",
            "function": {
                "name": "extract_product_info",
                "description": "Extract product information from HTML",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "name": {
                            "type": "string",
                            "description": "Product name"
                        },
                        "price": {
                            "type": "number",
                            "description": "Product price as a number"
                        },
                        "currency": {
                            "type": "string",
                            "description": "Currency code (USD, EUR, GBP, etc.)"
                        },
                        "in_stock": {
                            "type": "boolean",
                            "description": "Whether the product is in stock"
                        },
                        "rating": {
                            "type": "number",
                            "description": "Product rating (0-5 scale)"
                        },
                        "reviews_count": {
                            "type": "integer",
                            "description": "Number of customer reviews"
                        },
                        "images": {
                            "type": "array",
                            "items": {"type": "string"},
                            "description": "Array of product image URLs"
                        }
                    },
                    "required": ["name", "price", "currency"]
                }
            }
        }
    ]

    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "user",
                "content": f"Extract product data from this HTML:\n\n{html[:8000]}"
            }
        ],
        tools=tools,
        tool_choice={"type": "function", "function": {"name": "extract_product_info"}}
    )

    # Extract the structured data
    function_args = json.loads(
        completion.choices[0].message.tool_calls[0].function.arguments
    )

    return function_args

# Example usage
product_data = scrape_with_function_calling("https://example.com/product/12345")
print(f"Product: {product_data['name']}")
print(f"Price: {product_data['currency']} {product_data['price']}")

Batch Processing Multiple Pages

When scraping multiple pages, use concurrent processing to improve performance:

import concurrent.futures
from typing import List, Dict
from urllib.parse import urljoin

def extract_data_from_page(url: str) -> Dict:
    """Extract data from a single page"""
    try:
        html = requests.get(url, timeout=10).text
        cleaned = clean_html_for_llm(html)

        completion = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "user",
                    "content": f"Extract key data points as JSON:\n\n{cleaned[:8000]}"
                }
            ],
            temperature=0.0
        )

        return {
            "url": url,
            "data": json.loads(completion.choices[0].message.content),
            "success": True
        }
    except Exception as e:
        return {
            "url": url,
            "error": str(e),
            "success": False
        }

def scrape_multiple_pages(urls: List[str], max_workers: int = 5) -> List[Dict]:
    """Scrape multiple pages concurrently"""

    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        future_to_url = {
            executor.submit(extract_data_from_page, url): url
            for url in urls
        }

        # Collect results as they complete
        for future in concurrent.futures.as_completed(future_to_url):
            results.append(future.result())

    return results

# Example: Scrape product listing pages
base_url = "https://example.com/products"
urls = [f"{base_url}?page={i}" for i in range(1, 11)]

all_products = scrape_multiple_pages(urls, max_workers=5)
successful = [r for r in all_products if r['success']]
print(f"Successfully scraped {len(successful)}/{len(urls)} pages")

Combining Deepseek with Dynamic Content Scraping

For JavaScript-heavy websites, combine Deepseek with browser automation. This approach is essential when you need to handle dynamic websites with LLM-based scraping:

Using Selenium with Deepseek

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_dynamic_page_with_selenium(url):
    """Scrape JavaScript-rendered content with Selenium and Deepseek"""

    # Configure headless Chrome
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")

    driver = webdriver.Chrome(options=chrome_options)

    try:
        # Load the page
        driver.get(url)

        # Wait for JavaScript to render content
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )

        # Optional: Wait for specific elements
        # WebDriverWait(driver, 10).until(
        #     EC.presence_of_element_located((By.CLASS_NAME, "product-listing"))
        # )

        # Get fully rendered HTML
        html_content = driver.page_source

        # Process with Deepseek
        cleaned_html = clean_html_for_llm(html_content)

        completion = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "user",
                    "content": f"Extract all product listings as a JSON array:\n\n{cleaned_html[:8000]}"
                }
            ],
            temperature=0.0
        )

        return json.loads(completion.choices[0].message.content)

    finally:
        driver.quit()

# Example usage
products = scrape_dynamic_page_with_selenium("https://example.com/search?q=laptop")
print(f"Found {len(products)} products")

Using Playwright with Deepseek

Playwright is a modern alternative to Selenium:

from playwright.sync_api import sync_playwright

def scrape_with_playwright(url):
    """Scrape with Playwright and Deepseek"""

    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate to URL
        page.goto(url, wait_until="networkidle")

        # Optional: Wait for specific content
        page.wait_for_selector(".product-grid", timeout=10000)

        # Get rendered HTML
        html_content = page.content()
        browser.close()

        # Extract data with Deepseek
        cleaned = clean_html_for_llm(html_content)

        completion = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "user",
                    "content": f"Extract product data as JSON:\n\n{cleaned[:8000]}"
                }
            ],
            temperature=0.0
        )

        return json.loads(completion.choices[0].message.content)

# Example usage
data = scrape_with_playwright("https://example.com/products")

Error Handling and Retry Logic

Robust error handling is crucial for production web scraping. Understanding what error handling strategies to use when scraping with LLMs is essential:

import time
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class ScrapingError(Exception):
    """Custom exception for scraping errors"""
    pass

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type((requests.RequestException, json.JSONDecodeError))
)
def scrape_with_retry(url: str) -> Dict:
    """Scrape with automatic retry logic"""

    try:
        # Fetch content
        response = requests.get(url, timeout=15, headers={
            'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
        })
        response.raise_for_status()

        html_content = response.text
        cleaned = clean_html_for_llm(html_content)

        # Extract with Deepseek
        completion = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "system",
                    "content": "Extract data and return valid JSON only."
                },
                {
                    "role": "user",
                    "content": f"Extract structured data:\n\n{cleaned[:8000]}"
                }
            ],
            temperature=0.0,
            timeout=30.0
        )

        response_text = completion.choices[0].message.content

        # Parse JSON response
        try:
            data = json.loads(response_text)
        except json.JSONDecodeError:
            # Try to extract JSON from response if wrapped in markdown
            import re
            json_match = re.search(r'\{.*\}|\[.*\]', response_text, re.DOTALL)
            if json_match:
                data = json.loads(json_match.group())
            else:
                raise ScrapingError("Could not parse JSON from LLM response")

        return {
            "url": url,
            "data": data,
            "success": True
        }

    except requests.RequestException as e:
        print(f"Request error for {url}: {e}")
        raise
    except Exception as e:
        print(f"Extraction error for {url}: {e}")
        return {
            "url": url,
            "error": str(e),
            "success": False
        }

# Example usage with retry
result = scrape_with_retry("https://example.com/product/12345")
if result['success']:
    print("Scraping successful:", result['data'])
else:
    print("Scraping failed:", result['error'])

Optimizing Token Usage and Costs

Since Deepseek charges based on tokens, optimizing usage is important. Learn more about optimizing LLM costs when scraping:

def estimate_tokens(text: str) -> int:
    """Rough token estimation (1 token ≈ 4 characters for English)"""
    return len(text) // 4

def chunk_html_content(html: str, max_tokens: int = 6000) -> List[str]:
    """Split large HTML into smaller chunks"""

    max_chars = max_tokens * 4
    chunks = []

    soup = BeautifulSoup(html, 'html.parser')

    # Split by main sections
    sections = soup.find_all(['article', 'section', 'div'], class_=True)

    current_chunk = ""
    for section in sections:
        section_html = str(section)

        if len(current_chunk) + len(section_html) < max_chars:
            current_chunk += section_html
        else:
            if current_chunk:
                chunks.append(current_chunk)
            current_chunk = section_html

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

def scrape_large_page(url: str) -> List[Dict]:
    """Handle large pages by chunking"""

    html = requests.get(url).text
    chunks = chunk_html_content(html, max_tokens=6000)

    results = []
    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}")

        completion = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "user",
                    "content": f"Extract data from this HTML section:\n\n{chunk}"
                }
            ],
            temperature=0.0
        )

        results.append(json.loads(completion.choices[0].message.content))

        # Rate limiting
        time.sleep(0.5)

    return results

Complete Production-Ready Example

Here's a full example combining all best practices:

import os
import json
import time
import logging
from typing import Dict, List, Optional
from dataclasses import dataclass
import requests
from bs4 import BeautifulSoup
from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential
from dotenv import load_dotenv

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Load environment variables
load_dotenv()

@dataclass
class ScrapingResult:
    """Data class for scraping results"""
    url: str
    data: Optional[Dict]
    success: bool
    error: Optional[str] = None
    tokens_used: Optional[int] = None

class DeepseekScraper:
    """Production-ready web scraper using Deepseek"""

    def __init__(self, api_key: Optional[str] = None):
        self.client = OpenAI(
            api_key=api_key or os.environ.get("DEEPSEEK_API_KEY"),
            base_url="https://api.deepseek.com"
        )
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; WebScraper/1.0)'
        })

    def clean_html(self, html: str) -> str:
        """Remove unnecessary HTML elements"""
        soup = BeautifulSoup(html, 'html.parser')

        for element in soup(['script', 'style', 'nav', 'footer', 'header']):
            element.decompose()

        return str(soup)

    @retry(
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10)
    )
    def extract_data(self, html: str, extraction_schema: Dict) -> Dict:
        """Extract data using Deepseek with retry logic"""

        cleaned = self.clean_html(html)

        # Truncate to avoid token limits
        max_chars = 30000
        if len(cleaned) > max_chars:
            cleaned = cleaned[:max_chars]
            logger.warning(f"HTML truncated to {max_chars} characters")

        prompt = f"""
        Extract data matching this schema and return ONLY valid JSON:

        Schema: {json.dumps(extraction_schema, indent=2)}

        Rules:
        - Return only JSON, no markdown or explanations
        - Use null for missing values
        - Maintain exact field names from schema

        HTML:
        {cleaned}
        """

        completion = self.client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "system",
                    "content": "You are a data extraction assistant. Return only valid JSON."
                },
                {
                    "role": "user",
                    "content": prompt
                }
            ],
            temperature=0.0
        )

        response_text = completion.choices[0].message.content

        # Extract JSON from response
        import re
        json_match = re.search(r'\{.*\}|\[.*\]', response_text, re.DOTALL)
        if json_match:
            return json.loads(json_match.group())
        else:
            return json.loads(response_text)

    def scrape_page(self, url: str, schema: Dict) -> ScrapingResult:
        """Scrape a single page"""

        logger.info(f"Scraping: {url}")

        try:
            # Fetch HTML
            response = self.session.get(url, timeout=15)
            response.raise_for_status()

            # Extract data
            data = self.extract_data(response.text, schema)

            logger.info(f"Successfully scraped: {url}")
            return ScrapingResult(
                url=url,
                data=data,
                success=True
            )

        except Exception as e:
            logger.error(f"Error scraping {url}: {str(e)}")
            return ScrapingResult(
                url=url,
                data=None,
                success=False,
                error=str(e)
            )

    def scrape_multiple(
        self,
        urls: List[str],
        schema: Dict,
        delay: float = 1.0
    ) -> List[ScrapingResult]:
        """Scrape multiple URLs with rate limiting"""

        results = []

        for i, url in enumerate(urls):
            result = self.scrape_page(url, schema)
            results.append(result)

            # Rate limiting
            if i < len(urls) - 1:
                time.sleep(delay)

        return results

# Example usage
if __name__ == "__main__":
    # Initialize scraper
    scraper = DeepseekScraper()

    # Define extraction schema
    product_schema = {
        "name": "string",
        "price": "number",
        "currency": "string",
        "description": "string",
        "in_stock": "boolean",
        "images": ["array of strings"]
    }

    # Scrape single page
    result = scraper.scrape_page(
        "https://example.com/product/12345",
        product_schema
    )

    if result.success:
        print("Product data:")
        print(json.dumps(result.data, indent=2))
    else:
        print(f"Error: {result.error}")

    # Scrape multiple pages
    urls = [
        "https://example.com/product/1",
        "https://example.com/product/2",
        "https://example.com/product/3"
    ]

    results = scraper.scrape_multiple(urls, product_schema, delay=1.0)

    successful = [r for r in results if r.success]
    print(f"\nSuccessfully scraped {len(successful)}/{len(urls)} pages")

Best Practices Summary

When using Deepseek with Python for web scraping:

  1. Always preprocess HTML - Remove unnecessary elements to reduce token usage
  2. Use function calling - Ensures consistent structured output
  3. Implement retry logic - Handle API failures gracefully
  4. Respect rate limits - Add delays between requests
  5. Monitor token usage - Track costs and optimize prompts
  6. Validate outputs - Always validate JSON responses
  7. Use environment variables - Never hardcode API keys
  8. Log everything - Maintain detailed logs for debugging
  9. Handle errors gracefully - Return meaningful error messages
  10. Test incrementally - Start small and scale up

Conclusion

Deepseek provides a powerful, cost-effective solution for Python web scraping projects. By combining it with traditional scraping tools like BeautifulSoup and Selenium, you can build robust data extraction pipelines that handle complex, dynamic websites intelligently. The key is to optimize token usage, implement proper error handling, and follow best practices for production deployments.

Whether you're extracting product data, parsing news articles, or monitoring competitor websites, Deepseek's semantic understanding capabilities make it an excellent choice for modern web scraping challenges.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon