Table of contents

How do I use function calling with Deepseek LLM for web scraping?

Function calling with Deepseek LLM enables structured, reliable web scraping by allowing the model to invoke predefined functions for data extraction. This approach combines the intelligence of large language models with the precision of traditional programming, making it ideal for complex web scraping tasks where you need both flexibility and consistency.

Understanding Function Calling in Deepseek

Function calling is a feature in modern LLMs that allows the model to generate structured function calls based on natural language instructions. Instead of returning unstructured text, Deepseek can identify when to call specific functions and return properly formatted JSON parameters that your code can execute.

For web scraping, this means you can: - Define extraction functions with specific schemas - Let Deepseek determine which function to call based on page content - Receive structured data in a predictable format - Chain multiple extraction steps together

Setting Up Function Calling with Deepseek

Prerequisites

First, install the required dependencies:

pip install openai requests beautifulsoup4

The Deepseek API is compatible with the OpenAI SDK, which makes integration straightforward.

Defining Scraping Functions

Start by defining the functions you want Deepseep to call. Here's an example for extracting product information:

import openai
import requests
from bs4 import BeautifulSoup

# Configure Deepseek API
client = openai.OpenAI(
    api_key="your-deepseek-api-key",
    base_url="https://api.deepseek.com"
)

# Define function schemas
functions = [
    {
        "name": "extract_product_data",
        "description": "Extract structured product information from HTML",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "string",
                    "description": "The product name"
                },
                "price": {
                    "type": "number",
                    "description": "The product price as a number"
                },
                "currency": {
                    "type": "string",
                    "description": "The currency code (e.g., USD, EUR)"
                },
                "availability": {
                    "type": "string",
                    "enum": ["in_stock", "out_of_stock", "preorder"],
                    "description": "Product availability status"
                },
                "rating": {
                    "type": "number",
                    "description": "Product rating (0-5)"
                },
                "reviews_count": {
                    "type": "integer",
                    "description": "Number of customer reviews"
                }
            },
            "required": ["name", "price", "currency"]
        }
    },
    {
        "name": "extract_article_metadata",
        "description": "Extract metadata from article or blog post pages",
        "parameters": {
            "type": "object",
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "publish_date": {"type": "string"},
                "tags": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "summary": {"type": "string"}
            },
            "required": ["title"]
        }
    }
]

Implementing the Scraping Workflow

Basic Function Calling Example

Here's a complete example that fetches a webpage and uses Deepseek to extract structured data:

def scrape_with_function_calling(url):
    # Fetch the HTML content
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    html_content = response.text

    # Use BeautifulSoup to extract clean text
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove script and style elements
    for script in soup(["script", "style"]):
        script.decompose()

    page_text = soup.get_text(separator='\n', strip=True)

    # Truncate if too long (Deepseek has token limits)
    max_chars = 10000
    if len(page_text) > max_chars:
        page_text = page_text[:max_chars]

    # Call Deepseek with function definitions
    messages = [
        {
            "role": "system",
            "content": "You are a web scraping assistant. Extract data from the provided HTML content using the available functions."
        },
        {
            "role": "user",
            "content": f"Extract all relevant information from this page:\n\n{page_text}"
        }
    ]

    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        functions=functions,
        function_call="auto"  # Let the model decide which function to call
    )

    return response

# Example usage
result = scrape_with_function_calling("https://example.com/product/12345")
message = result.choices[0].message

if message.function_call:
    print(f"Function called: {message.function_call.name}")
    print(f"Arguments: {message.function_call.arguments}")

Processing Function Call Results

After Deepseek returns a function call, you need to process the results:

import json

def process_extraction_result(response):
    message = response.choices[0].message

    if not message.function_call:
        # No function was called, return the text response
        return {"type": "text", "content": message.content}

    # Parse the function call
    function_name = message.function_call.name
    arguments = json.loads(message.function_call.arguments)

    # Execute the extraction based on function name
    if function_name == "extract_product_data":
        return {
            "type": "product",
            "data": arguments
        }
    elif function_name == "extract_article_metadata":
        return {
            "type": "article",
            "data": arguments
        }

    return arguments

# Process the result
extracted_data = process_extraction_result(result)
print(json.dumps(extracted_data, indent=2))

JavaScript Implementation

Here's how to implement function calling with Deepseek in JavaScript:

const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');

const client = new OpenAI({
  apiKey: 'your-deepseek-api-key',
  baseURL: 'https://api.deepseek.com'
});

const functions = [
  {
    name: 'extract_product_data',
    description: 'Extract structured product information from HTML',
    parameters: {
      type: 'object',
      properties: {
        name: { type: 'string', description: 'The product name' },
        price: { type: 'number', description: 'The product price' },
        currency: { type: 'string', description: 'Currency code' },
        availability: {
          type: 'string',
          enum: ['in_stock', 'out_of_stock', 'preorder']
        }
      },
      required: ['name', 'price', 'currency']
    }
  }
];

async function scrapeWithFunctionCalling(url) {
  // Fetch HTML
  const response = await axios.get(url, {
    headers: {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
  });

  // Parse HTML
  const $ = cheerio.load(response.data);

  // Remove scripts and styles
  $('script, style').remove();
  const pageText = $('body').text().trim().substring(0, 10000);

  // Call Deepseek API
  const completion = await client.chat.completions.create({
    model: 'deepseek-chat',
    messages: [
      {
        role: 'system',
        content: 'Extract product data from the provided page content.'
      },
      {
        role: 'user',
        content: `Extract information from this page:\n\n${pageText}`
      }
    ],
    functions: functions,
    function_call: 'auto'
  });

  const message = completion.choices[0].message;

  if (message.function_call) {
    return {
      function: message.function_call.name,
      data: JSON.parse(message.function_call.arguments)
    };
  }

  return { type: 'text', content: message.content };
}

// Usage
scrapeWithFunctionCalling('https://example.com/product/12345')
  .then(result => console.log(JSON.stringify(result, null, 2)))
  .catch(error => console.error('Error:', error));

Advanced Patterns for Web Scraping

Multi-Step Extraction with Function Chaining

For complex pages, you might need multiple function calls:

def multi_step_scraping(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    page_text = soup.get_text(separator='\n', strip=True)[:10000]

    messages = [
        {
            "role": "system",
            "content": "You are a web scraping expert. Analyze pages and extract data systematically."
        },
        {
            "role": "user",
            "content": f"First, identify what type of page this is:\n\n{page_text}"
        }
    ]

    # First call: Identify page type
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        functions=[
            {
                "name": "identify_page_type",
                "description": "Identify the type of webpage",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "page_type": {
                            "type": "string",
                            "enum": ["product", "article", "listing", "other"]
                        }
                    },
                    "required": ["page_type"]
                }
            }
        ],
        function_call={"name": "identify_page_type"}
    )

    page_type_data = json.loads(response.choices[0].message.function_call.arguments)
    page_type = page_type_data["page_type"]

    # Second call: Extract data based on page type
    messages.append({
        "role": "assistant",
        "content": None,
        "function_call": response.choices[0].message.function_call
    })
    messages.append({
        "role": "user",
        "content": f"Now extract all {page_type} data from the page."
    })

    # Select appropriate function based on page type
    extraction_functions = {
        "product": functions[0],  # extract_product_data
        "article": functions[1]    # extract_article_metadata
    }

    final_response = client.chat.completions.create(
        model="deepseek-chat",
        messages=messages,
        functions=[extraction_functions.get(page_type, functions[0])],
        function_call="auto"
    )

    return process_extraction_result(final_response)

Batch Processing Multiple Pages

When scraping multiple pages, implement batch processing with rate limiting:

import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def scrape_multiple_urls(urls, max_workers=3, delay=1):
    results = []

    def scrape_single(url):
        try:
            result = scrape_with_function_calling(url)
            extracted = process_extraction_result(result)
            time.sleep(delay)  # Rate limiting
            return {"url": url, "data": extracted, "success": True}
        except Exception as e:
            return {"url": url, "error": str(e), "success": False}

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(scrape_single, url): url for url in urls}

        for future in as_completed(futures):
            results.append(future.result())

    return results

# Example
urls = [
    "https://example.com/product/1",
    "https://example.com/product/2",
    "https://example.com/product/3"
]

batch_results = scrape_multiple_urls(urls)
for result in batch_results:
    if result["success"]:
        print(f"Successfully scraped {result['url']}")
        print(json.dumps(result["data"], indent=2))

Error Handling and Validation

Implement robust error handling for production use:

def safe_function_calling(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            # Fetch content
            response = requests.get(url, timeout=10)
            response.raise_for_status()

            # Parse HTML
            soup = BeautifulSoup(response.text, 'html.parser')
            page_text = soup.get_text(separator='\n', strip=True)[:10000]

            # Call Deepseek
            completion = client.chat.completions.create(
                model="deepseek-chat",
                messages=[
                    {"role": "system", "content": "Extract structured data."},
                    {"role": "user", "content": f"Extract data:\n\n{page_text}"}
                ],
                functions=functions,
                function_call="auto",
                temperature=0  # Use 0 for deterministic results
            )

            message = completion.choices[0].message

            if not message.function_call:
                raise ValueError("No function call returned")

            # Validate the extracted data
            data = json.loads(message.function_call.arguments)

            # Basic validation
            if message.function_call.name == "extract_product_data":
                if not data.get("name") or not data.get("price"):
                    raise ValueError("Missing required product fields")

            return {
                "success": True,
                "function": message.function_call.name,
                "data": data
            }

        except requests.RequestException as e:
            if attempt == max_retries - 1:
                return {"success": False, "error": f"HTTP error: {str(e)}"}
            time.sleep(2 ** attempt)  # Exponential backoff

        except json.JSONDecodeError as e:
            return {"success": False, "error": f"JSON parsing error: {str(e)}"}

        except Exception as e:
            if attempt == max_retries - 1:
                return {"success": False, "error": f"Extraction error: {str(e)}"}
            time.sleep(1)

    return {"success": False, "error": "Max retries exceeded"}

Best Practices

1. Design Clear Function Schemas

Make your function parameters specific and well-documented:

{
    "name": "extract_contact_info",
    "description": "Extract contact information from a business or contact page",
    "parameters": {
        "type": "object",
        "properties": {
            "email": {
                "type": "string",
                "description": "Email address in standard format (e.g., contact@example.com)"
            },
            "phone": {
                "type": "string",
                "description": "Phone number with country code if available"
            },
            "address": {
                "type": "object",
                "properties": {
                    "street": {"type": "string"},
                    "city": {"type": "string"},
                    "country": {"type": "string"},
                    "postal_code": {"type": "string"}
                }
            }
        }
    }
}

2. Optimize Token Usage

Reduce costs by sending only relevant content to the API:

def extract_relevant_content(html, content_type="product"):
    soup = BeautifulSoup(html, 'html.parser')

    # Target specific sections based on common patterns
    if content_type == "product":
        # Look for product containers
        product_section = soup.find('div', {'class': ['product', 'item', 'product-detail']})
        if product_section:
            return product_section.get_text(separator='\n', strip=True)

    # Fallback to general extraction
    for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
        tag.decompose()

    return soup.get_text(separator='\n', strip=True)[:8000]

3. Use Temperature=0 for Consistency

For web scraping, you want deterministic results:

completion = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    functions=functions,
    function_call="auto",
    temperature=0  # Ensures consistent extraction
)

4. Combine with Traditional Scraping

Use traditional web scraping tools for structure and Deepseek for understanding:

def hybrid_scraping(url):
    # Use requests/BeautifulSoup for structure
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract structured parts traditionally
    title = soup.find('h1')
    price_element = soup.find('span', {'class': 'price'})

    # Use Deepseek for complex parts
    description_html = soup.find('div', {'class': 'description'})

    if description_html:
        description_text = description_html.get_text()

        # Use function calling for intelligent extraction
        features_response = client.chat.completions.create(
            model="deepseek-chat",
            messages=[{
                "role": "user",
                "content": f"Extract key features as a list:\n{description_text}"
            }],
            functions=[{
                "name": "extract_features",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "features": {
                            "type": "array",
                            "items": {"type": "string"}
                        }
                    }
                }
            }],
            function_call={"name": "extract_features"}
        )

        features_data = json.loads(
            features_response.choices[0].message.function_call.arguments
        )

    return {
        "title": title.text if title else None,
        "price": price_element.text if price_element else None,
        "features": features_data.get("features", [])
    }

When to Use Function Calling vs. Other Methods

Function calling with Deepseek is ideal when:

  • You need structured, validated output
  • The page structure varies but semantic content is consistent
  • You're extracting complex entities that require understanding
  • You want to avoid maintaining brittle CSS selectors

For simpler tasks or when handling dynamic JavaScript content, traditional tools might be more efficient and cost-effective.

Conclusion

Function calling with Deepseek LLM provides a powerful middle ground between fully manual parsing and completely AI-driven extraction. By defining clear function schemas and combining LLM intelligence with traditional scraping techniques, you can build robust, maintainable web scraping solutions that handle real-world complexity while maintaining structure and reliability.

The key to success is thoughtful function design, proper error handling, and knowing when to use AI versus traditional methods. Start with clear, simple functions and gradually expand as you understand your data extraction needs better.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon