Table of contents

What is a Good Deepseek Tutorial for Beginners in Web Scraping?

Deepseek is an advanced AI language model that can be leveraged for intelligent web scraping tasks, particularly when dealing with unstructured data or complex HTML layouts. This tutorial will guide you through using Deepseek for web scraping, from basic setup to advanced data extraction techniques.

Understanding Deepseek for Web Scraping

Deepseek is a large language model (LLM) that excels at understanding and extracting structured information from unstructured content. Unlike traditional web scraping tools that rely on CSS selectors or XPath, Deepseek can intelligently parse HTML content and extract relevant data based on natural language instructions.

Why Use Deepseek for Web Scraping?

  • Flexible parsing: Works with changing HTML structures without brittle selectors
  • Natural language queries: Describe what you want to extract in plain English
  • Complex data extraction: Handles nested structures and contextual relationships
  • Cost-effective: Competitive pricing compared to other LLM providers
  • High accuracy: Strong performance on data extraction tasks

Getting Started: Setting Up Deepseek

Step 1: Obtain API Access

First, you'll need to get a Deepseek API key:

  1. Visit the Deepseek platform website
  2. Create an account or sign in
  3. Navigate to the API section
  4. Generate your API key
  5. Note your API endpoint URL

Step 2: Install Required Libraries

For Python:

pip install requests beautifulsoup4 openai

For JavaScript/Node.js:

npm install axios cheerio openai

Basic Deepseek Web Scraping Tutorial

Python Example: Extracting Product Information

Here's a complete example of using Deepseek to extract product data from an e-commerce page:

import requests
from bs4 import BeautifulSoup
from openai import OpenAI

# Initialize Deepseek client
client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

def scrape_with_deepseek(url, extraction_prompt):
    # Fetch the webpage
    response = requests.get(url)
    html_content = response.text

    # Optional: Clean HTML with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    # Remove scripts and styles
    for script in soup(["script", "style"]):
        script.decompose()
    clean_html = soup.get_text()

    # Use Deepseek to extract data
    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "system",
                "content": "You are a web scraping assistant. Extract structured data from HTML content as requested."
            },
            {
                "role": "user",
                "content": f"{extraction_prompt}\n\nHTML Content:\n{clean_html[:8000]}"
            }
        ],
        response_format={"type": "json_object"}
    )

    return completion.choices[0].message.content

# Example usage
url = "https://example.com/product/123"
prompt = """
Extract the following product information and return as JSON:
- product_name
- price
- description
- availability
- rating
"""

result = scrape_with_deepseek(url, prompt)
print(result)

JavaScript Example: Extracting Article Data

const axios = require('axios');
const cheerio = require('cheerio');
const OpenAI = require('openai');

const client = new OpenAI({
    apiKey: 'your-deepseek-api-key',
    baseURL: 'https://api.deepseek.com'
});

async function scrapeWithDeepseek(url, extractionPrompt) {
    // Fetch the webpage
    const response = await axios.get(url);
    const html = response.data;

    // Clean HTML with Cheerio
    const $ = cheerio.load(html);
    $('script, style').remove();
    const cleanText = $('body').text();

    // Use Deepseek for extraction
    const completion = await client.chat.completions.create({
        model: 'deepseek-chat',
        messages: [
            {
                role: 'system',
                content: 'You are a web scraping assistant. Extract structured data from HTML content as requested.'
            },
            {
                role: 'user',
                content: `${extractionPrompt}\n\nHTML Content:\n${cleanText.substring(0, 8000)}`
            }
        ],
        response_format: { type: 'json_object' }
    });

    return JSON.parse(completion.choices[0].message.content);
}

// Example usage
const url = 'https://example.com/article/456';
const prompt = `
Extract the following article information and return as JSON:
- title
- author
- publish_date
- content
- tags
`;

scrapeWithDeepseek(url, prompt)
    .then(result => console.log(result))
    .catch(err => console.error(err));

Advanced Techniques

Working with Dynamic Content

When scraping JavaScript-rendered pages, combine Deepseek with browser automation tools like Puppeteer for handling AJAX requests:

from playwright.sync_api import sync_playwright
from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

def scrape_dynamic_page(url, extraction_prompt):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Wait for content to load
        page.wait_for_load_state('networkidle')

        # Get rendered HTML
        html_content = page.content()
        browser.close()

        # Extract data with Deepseek
        completion = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {
                    "role": "user",
                    "content": f"{extraction_prompt}\n\nHTML:\n{html_content[:8000]}"
                }
            ],
            response_format={"type": "json_object"}
        )

        return completion.choices[0].message.content

# Usage
url = "https://example.com/dynamic-page"
prompt = "Extract all product listings with name, price, and image URL as JSON array"
result = scrape_dynamic_page(url, prompt)

Batch Processing Multiple Pages

For scraping multiple pages efficiently:

import concurrent.futures
from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

def extract_data(html_content, prompt):
    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "user", "content": f"{prompt}\n\n{html_content}"}
        ],
        response_format={"type": "json_object"}
    )
    return completion.choices[0].message.content

def scrape_multiple_pages(urls, extraction_prompt, max_workers=5):
    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Fetch all pages
        future_to_url = {
            executor.submit(requests.get, url): url
            for url in urls
        }

        html_contents = []
        for future in concurrent.futures.as_completed(future_to_url):
            response = future.result()
            html_contents.append(response.text[:8000])

        # Extract data from all pages
        extraction_futures = [
            executor.submit(extract_data, html, extraction_prompt)
            for html in html_contents
        ]

        for future in concurrent.futures.as_completed(extraction_futures):
            results.append(future.result())

    return results

# Example
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]
prompt = "Extract product title, price, and rating as JSON"
results = scrape_multiple_pages(urls, prompt)

Handling Pagination

When dealing with paginated content, you can combine traditional scraping with Deepseek:

const axios = require('axios');
const OpenAI = require('openai');

const client = new OpenAI({
    apiKey: 'your-deepseek-api-key',
    baseURL: 'https://api.deepseek.com'
});

async function scrapePaginatedSite(baseUrl, maxPages = 10) {
    const allResults = [];

    for (let page = 1; page <= maxPages; page++) {
        const url = `${baseUrl}?page=${page}`;
        const response = await axios.get(url);

        const completion = await client.chat.completions.create({
            model: 'deepseek-chat',
            messages: [
                {
                    role: 'user',
                    content: `Extract all items from this page as JSON array with fields: title, price, url\n\n${response.data.substring(0, 8000)}`
                }
            ],
            response_format: { type: 'json_object' }
        });

        const pageResults = JSON.parse(completion.choices[0].message.content);
        allResults.push(...pageResults.items);

        // Check if there's a next page
        const hasNextPage = response.data.includes('next-page') ||
                           response.data.includes(`page=${page + 1}`);

        if (!hasNextPage) break;

        // Rate limiting
        await new Promise(resolve => setTimeout(resolve, 1000));
    }

    return allResults;
}

Best Practices for Deepseek Web Scraping

1. Optimize HTML Input

Reduce token usage by cleaning HTML before sending to Deepseek:

from bs4 import BeautifulSoup

def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Remove unnecessary elements
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()

    # Remove attributes to reduce size
    for tag in soup.find_all(True):
        tag.attrs = {}

    return str(soup)

2. Use Structured Output

Always request JSON format for consistent parsing:

prompt = """
Extract product information in the following JSON format:
{
    "products": [
        {
            "name": "string",
            "price": "number",
            "currency": "string",
            "in_stock": "boolean"
        }
    ]
}
"""

3. Implement Error Handling

def safe_scrape(url, prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()

            completion = client.chat.completions.create(
                model="deepseek-chat",
                messages=[
                    {"role": "user", "content": f"{prompt}\n\n{response.text[:8000]}"}
                ],
                response_format={"type": "json_object"}
            )

            return completion.choices[0].message.content

        except requests.RequestException as e:
            print(f"Request failed (attempt {attempt + 1}): {e}")
        except Exception as e:
            print(f"Extraction failed (attempt {attempt + 1}): {e}")

        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff

    return None

4. Monitor Token Usage and Costs

def scrape_with_cost_tracking(html, prompt):
    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {"role": "user", "content": f"{prompt}\n\n{html}"}
        ],
        response_format={"type": "json_object"}
    )

    usage = completion.usage
    print(f"Tokens used - Input: {usage.prompt_tokens}, Output: {usage.completion_tokens}")
    print(f"Total tokens: {usage.total_tokens}")

    return completion.choices[0].message.content

Combining Deepseek with Traditional Tools

For optimal results, combine Deepseek with traditional scraping tools. Use Puppeteer for browser automation to handle dynamic content, then use Deepseek for intelligent extraction:

from playwright.sync_api import sync_playwright

def hybrid_scraping_approach(url):
    # Step 1: Use Playwright for navigation and dynamic content
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        page.wait_for_selector('.product-list')
        html = page.content()
        browser.close()

    # Step 2: Use Deepseek for intelligent extraction
    completion = client.chat.completions.create(
        model="deepseek-chat",
        messages=[
            {
                "role": "user",
                "content": f"Extract all products with name, price, and rating as JSON array\n\n{html[:8000]}"
            }
        ],
        response_format={"type": "json_object"}
    )

    return completion.choices[0].message.content

Conclusion

Deepseek offers a powerful, cost-effective approach to web scraping, especially for complex or frequently-changing websites. By combining Deepseek's AI capabilities with traditional scraping tools like Puppeteer for handling dynamic content, you can build robust scraping solutions that adapt to layout changes without constant maintenance.

Remember to always respect robots.txt files, implement rate limiting, and follow the website's terms of service when scraping. Start with small projects to understand token usage and costs, then scale up as you become more comfortable with the Deepseek API.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon