What is Function Calling in LLMs and Why is It Useful for Data Extraction?

Function calling (also known as tool calling) is a feature in modern Large Language Models that enables the model to generate structured outputs conforming to predefined schemas. For data extraction and web scraping, this means you can instruct an LLM to extract information in a specific format with guaranteed structure, type safety, and consistency—eliminating the common problem of unreliable or malformed responses.

Instead of hoping the LLM returns valid JSON in free-form text, function calling ensures the model's output matches your exact data schema. This makes it invaluable for production web scraping pipelines, API integrations, and automated data processing workflows where reliability and consistency are critical.

Understanding Function Calling

Function calling allows you to describe one or more functions with specific parameters and data types to the LLM. The model then intelligently extracts information from the provided content and structures it to match those function parameters. Essentially, the model treats data extraction as "calling a function" with the extracted values as arguments.

Key Capabilities of Function Calling

The function calling mechanism provides several powerful capabilities:

Guaranteed Schema Compliance: Output always matches your predefined JSON schema
Type Validation: Fields are validated as specific types (string, number, boolean, array, object)
Required Fields: Enforce that critical data points must be present in the output
Nested Structures: Support complex data hierarchies with nested objects and arrays
Enum Constraints: Limit values to predefined options for classification tasks
Multiple Items: Reliably extract arrays of objects (like product lists or search results)
Production Reliability: Consistent output format enables automated processing without manual validation

How Function Calling Works

The process involves three main steps:

Define the function schema: Describe the structure of data you want to extract, including field names, types, descriptions, and constraints
Provide content: Send the content to analyze (HTML, text, JSON, or any data)
Receive structured data: Get back data that perfectly matches your schema

The LLM analyzes the content and generates output that conforms to the defined schema, effectively "calling" your function with the extracted data as arguments.

Basic Function Calling for Data Extraction

Python Example: Extracting Product Information

Here's a practical example of using function calling with OpenAI's API to extract product data from a webpage:

from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import json

client = OpenAI(api_key='your-api-key')

# Fetch webpage content
url = 'https://example.com/product'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Get clean text content
content = soup.get_text(separator=' ', strip=True)[:4000]

# Define the function schema for extraction
tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_product_data",
            "description": "Extract product information from webpage content",
            "parameters": {
                "type": "object",
                "properties": {
                    "product_name": {
                        "type": "string",
                        "description": "The name or title of the product"
                    },
                    "price": {
                        "type": "number",
                        "description": "Product price as a numeric value"
                    },
                    "currency": {
                        "type": "string",
                        "description": "Currency code like USD, EUR, GBP"
                    },
                    "in_stock": {
                        "type": "boolean",
                        "description": "Whether the product is currently available"
                    },
                    "rating": {
                        "type": "number",
                        "description": "Average customer rating (0-5)"
                    },
                    "description": {
                        "type": "string",
                        "description": "Brief product description"
                    }
                },
                "required": ["product_name", "price", "currency"]
            }
        }
    }
]

# Call the API with function calling enabled
completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system",
            "content": "You are a data extraction assistant. Extract product information accurately."
        },
        {
            "role": "user",
            "content": f"Extract product data from this content:\n\n{content}"
        }
    ],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_product_data"}}
)

# Parse the function call result
tool_call = completion.choices[0].message.tool_calls[0]
product_data = json.loads(tool_call.function.arguments)

print(json.dumps(product_data, indent=2))

Expected Output:

{
  "product_name": "Premium Wireless Headphones",
  "price": 299.99,
  "currency": "USD",
  "in_stock": true,
  "rating": 4.5,
  "description": "High-quality wireless headphones with active noise cancellation"
}

JavaScript Example: Extracting Article Metadata

Using Anthropic's Claude API for article data extraction:

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const cheerio = require('cheerio');

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

async function extractArticleData(url) {
  // Fetch webpage
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);
  const content = $('body').text().trim().substring(0, 4000);

  // Define extraction schema using Claude's tool use
  const tools = [
    {
      name: "extract_article",
      description: "Extract structured article information from webpage content",
      input_schema: {
        type: "object",
        properties: {
          title: {
            type: "string",
            description: "The article headline or title"
          },
          author: {
            type: "string",
            description: "Author name"
          },
          publish_date: {
            type: "string",
            description: "Publication date in ISO format"
          },
          summary: {
            type: "string",
            description: "Brief summary of the article content"
          },
          tags: {
            type: "array",
            items: { type: "string" },
            description: "Article tags or categories"
          },
          word_count: {
            type: "number",
            description: "Approximate word count"
          }
        },
        required: ["title", "author"]
      }
    }
  ];

  // Make API call with tool use
  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    tools: tools,
    messages: [{
      role: 'user',
      content: `Extract article information from this content:\n\n${content}`
    }]
  });

  // Parse the tool use result
  const toolUse = message.content.find(block => block.type === 'tool_use');
  return toolUse.input;
}

// Usage
extractArticleData('https://example.com/blog/article')
  .then(data => console.log(JSON.stringify(data, null, 2)))
  .catch(error => console.error('Error:', error));

Advanced Use Cases

Extracting Arrays of Items

When scraping listing pages or search results, you need to extract multiple items. Function calling excels at this:

from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import json

client = OpenAI(api_key='your-api-key')

# Fetch product listing page
url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
content = soup.get_text(separator=' ', strip=True)[:8000]

# Define schema for multiple products
tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_product_list",
            "description": "Extract all products from a listing page",
            "parameters": {
                "type": "object",
                "properties": {
                    "products": {
                        "type": "array",
                        "description": "List of all products found",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {
                                    "type": "string",
                                    "description": "Product name"
                                },
                                "price": {
                                    "type": "number",
                                    "description": "Price as numeric value"
                                },
                                "currency": {
                                    "type": "string",
                                    "description": "Currency code"
                                },
                                "availability": {
                                    "type": "string",
                                    "enum": ["in_stock", "out_of_stock", "preorder", "discontinued"],
                                    "description": "Product availability status"
                                },
                                "rating": {
                                    "type": "number",
                                    "description": "Customer rating (0-5)"
                                }
                            },
                            "required": ["name", "price", "currency"]
                        }
                    },
                    "total_products": {
                        "type": "number",
                        "description": "Total number of products found"
                    },
                    "page_number": {
                        "type": "number",
                        "description": "Current page number if pagination exists"
                    }
                },
                "required": ["products", "total_products"]
            }
        }
    }
]

completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Extract all products from the listing page."},
        {"role": "user", "content": f"Extract products:\n\n{content}"}
    ],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_product_list"}}
)

tool_call = completion.choices[0].message.tool_calls[0]
result = json.loads(tool_call.function.arguments)

print(f"Found {result['total_products']} products:")
for product in result['products']:
    status = product.get('availability', 'unknown')
    print(f"- {product['name']}: {product['price']} {product['currency']} ({status})")

Nested Object Extraction

For complex data with nested structures, like products with specifications:

const OpenAI = require('openai');
const puppeteer = require('puppeteer');

const openai = new OpenAI({ apiKey: 'your-api-key' });

async function extractComplexProduct(url) {
  // Use Puppeteer for dynamic content
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });

  const content = await page.evaluate(() => document.body.innerText);
  await browser.close();

  const tools = [
    {
      type: "function",
      function: {
        name: "extract_product_details",
        description: "Extract comprehensive product details with nested specifications",
        parameters: {
          type: "object",
          properties: {
            basic_info: {
              type: "object",
              properties: {
                name: { type: "string" },
                brand: { type: "string" },
                model: { type: "string" },
                price: { type: "number" },
                currency: { type: "string" }
              },
              required: ["name", "price"]
            },
            specifications: {
              type: "object",
              properties: {
                dimensions: {
                  type: "object",
                  properties: {
                    length: { type: "number" },
                    width: { type: "number" },
                    height: { type: "number" },
                    unit: { type: "string" }
                  }
                },
                weight: {
                  type: "object",
                  properties: {
                    value: { type: "number" },
                    unit: { type: "string" }
                  }
                },
                features: {
                  type: "array",
                  items: { type: "string" }
                }
              }
            },
            ratings: {
              type: "object",
              properties: {
                average: { type: "number" },
                count: { type: "number" },
                distribution: {
                  type: "object",
                  properties: {
                    five_star: { type: "number" },
                    four_star: { type: "number" },
                    three_star: { type: "number" },
                    two_star: { type: "number" },
                    one_star: { type: "number" }
                  }
                }
              }
            }
          },
          required: ["basic_info"]
        }
      }
    }
  ];

  const completion = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      { role: "system", content: "Extract detailed product information." },
      { role: "user", content: `Extract data:\n\n${content.substring(0, 6000)}` }
    ],
    tools: tools,
    tool_choice: { type: "function", function: { name: "extract_product_details" } }
  });

  const toolCall = completion.choices[0].message.tool_calls[0];
  return JSON.parse(toolCall.function.arguments);
}

When working with dynamic web applications, combining browser automation with function calling ensures both complete page rendering and reliable structured data extraction.

Classification and Entity Extraction

Use enums to classify content and extract entities:

from openai import OpenAI
import requests

client = OpenAI(api_key='your-api-key')

tools = [
    {
        "type": "function",
        "function": {
            "name": "analyze_and_classify",
            "description": "Classify content and extract key entities",
            "parameters": {
                "type": "object",
                "properties": {
                    "content_type": {
                        "type": "string",
                        "enum": ["product_page", "article", "review", "forum_post",
                                "documentation", "news", "blog_post"],
                        "description": "Type of content on the page"
                    },
                    "sentiment": {
                        "type": "string",
                        "enum": ["positive", "negative", "neutral", "mixed"],
                        "description": "Overall sentiment of the content"
                    },
                    "primary_topic": {
                        "type": "string",
                        "description": "Main topic or subject matter"
                    },
                    "entities": {
                        "type": "object",
                        "properties": {
                            "people": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Names of people mentioned"
                            },
                            "organizations": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Companies or organizations mentioned"
                            },
                            "products": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Products or services mentioned"
                            },
                            "locations": {
                                "type": "array",
                                "items": {"type": "string"},
                                "description": "Geographic locations mentioned"
                            }
                        }
                    },
                    "key_facts": {
                        "type": "array",
                        "items": {"type": "string"},
                        "description": "Important facts or claims made"
                    }
                },
                "required": ["content_type", "sentiment", "primary_topic"]
            }
        }
    }
]

response = requests.get('https://example.com/page')
content = response.text[:4000]

completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": f"Analyze and classify:\n\n{content}"}
    ],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "analyze_and_classify"}}
)

tool_call = completion.choices[0].message.tool_calls[0]
analysis = json.loads(tool_call.function.arguments)

print(f"Content Type: {analysis['content_type']}")
print(f"Sentiment: {analysis['sentiment']}")
print(f"Topic: {analysis['primary_topic']}")
if 'entities' in analysis:
    print(f"Organizations: {', '.join(analysis['entities'].get('organizations', []))}")

Why Function Calling is Superior for Data Extraction

1. Guaranteed Structure

Traditional LLM prompting might return inconsistent formats:

# Without function calling - unreliable
response = "The product is called Widget Pro and costs $299.99 USD"
# or
response = '{"name": "Widget Pro", price: "299.99", "currency": "USD"}'  # Invalid JSON
# or
response = "{'name': 'Widget Pro', 'price': 299.99}"  # Python dict format

With function calling, you always get valid, structured JSON:

# With function calling - guaranteed structure
{
  "name": "Widget Pro",
  "price": 299.99,
  "currency": "USD"
}

2. Type Safety

Function calling enforces data types:

# Schema defines types
"price": {"type": "number"}  # Must be a number, not string
"in_stock": {"type": "boolean"}  # Must be true/false
"tags": {"type": "array", "items": {"type": "string"}}  # Must be array of strings

This eliminates type conversion errors and validation logic in your code.

3. Required Field Enforcement

Specify which fields are mandatory:

"required": ["name", "price", "currency"]

The LLM will always include these fields, or indicate when they cannot be found, preventing incomplete data issues.

4. Production Reliability

For automated pipelines, function calling provides the consistency needed:

def process_scraped_data(url):
    # Extract data with function calling
    data = extract_with_function_calling(url)

    # No need for extensive validation - structure is guaranteed
    # Direct database insertion or API calls
    database.insert_product(
        name=data['name'],
        price=data['price'],
        currency=data['currency']
    )

Best Practices for Function Calling

1. Design Clear, Specific Schemas

Make your schemas descriptive and unambiguous:

# ❌ Vague schema
{
    "name": "get_data",
    "parameters": {
        "type": "object",
        "properties": {
            "info": {"type": "string"}
        }
    }
}

# ✅ Clear and specific
{
    "name": "extract_product_pricing",
    "description": "Extract pricing information from an e-commerce product page",
    "parameters": {
        "type": "object",
        "properties": {
            "base_price": {
                "type": "number",
                "description": "The regular price as a decimal number without currency symbols"
            },
            "sale_price": {
                "type": "number",
                "description": "The discounted price if on sale, null otherwise"
            },
            "currency_code": {
                "type": "string",
                "description": "Three-letter ISO 4217 currency code (e.g., USD, EUR, GBP)"
            },
            "discount_percentage": {
                "type": "number",
                "description": "Percentage discount if on sale (0-100)"
            }
        },
        "required": ["base_price", "currency_code"]
    }
}

2. Optimize Content Size

Clean and minimize content before sending to reduce tokens and costs:

from bs4 import BeautifulSoup
import re

def prepare_content_for_extraction(html, max_chars=8000):
    """Optimize HTML for LLM extraction."""
    soup = BeautifulSoup(html, 'html.parser')

    # Remove unnecessary elements
    for element in soup(['script', 'style', 'nav', 'footer',
                        'header', 'aside', 'iframe', 'noscript']):
        element.decompose()

    # Extract main content area if identifiable
    main_content = (soup.find('main') or
                   soup.find('article') or
                   soup.find(class_='content') or
                   soup.body)

    # Get clean text
    text = main_content.get_text(separator=' ', strip=True)

    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    # Truncate if needed
    return text[:max_chars]

3. Implement Error Handling

Always handle potential failures gracefully:

async function extractWithRetry(content, schema, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const completion = await openai.chat.completions.create({
        model: "gpt-4",
        messages: [
          { role: "system", content: "Extract structured data accurately." },
          { role: "user", content: `Extract data:\n\n${content}` }
        ],
        tools: [{ type: "function", function: schema }],
        tool_choice: { type: "function", function: { name: schema.name } },
        temperature: 0  // Deterministic output
      });

      const toolCall = completion.choices[0].message.tool_calls[0];
      const result = JSON.parse(toolCall.function.arguments);

      // Validate result has required fields
      if (validateResult(result, schema)) {
        return result;
      }

    } catch (error) {
      console.error(`Attempt ${attempt + 1} failed:`, error.message);

      if (attempt === maxRetries - 1) {
        throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
      }

      // Exponential backoff
      await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
    }
  }
}

function validateResult(result, schema) {
  const required = schema.parameters.required || [];
  return required.every(field => field in result && result[field] !== null);
}

4. Monitor Costs and Token Usage

Track your API usage to optimize spending:

import tiktoken

def estimate_extraction_cost(content, schema, model="gpt-4"):
    """Estimate the cost of a function calling extraction."""
    encoding = tiktoken.encoding_for_model(model)

    # Count input tokens (content + schema + system message)
    content_tokens = len(encoding.encode(content))
    schema_tokens = len(encoding.encode(str(schema)))
    system_tokens = 50  # Approximate

    input_tokens = content_tokens + schema_tokens + system_tokens

    # Estimate output tokens (varies by extraction complexity)
    estimated_output_tokens = 200  # Conservative estimate

    # Pricing (check current rates)
    if model == "gpt-4":
        input_cost_per_1k = 0.03
        output_cost_per_1k = 0.06
    else:  # gpt-3.5-turbo
        input_cost_per_1k = 0.0015
        output_cost_per_1k = 0.002

    total_cost = (
        (input_tokens / 1000 * input_cost_per_1k) +
        (estimated_output_tokens / 1000 * output_cost_per_1k)
    )

    return {
        "input_tokens": input_tokens,
        "estimated_output_tokens": estimated_output_tokens,
        "estimated_cost_usd": round(total_cost, 4)
    }

# Before extraction
cost_info = estimate_extraction_cost(content, schema)
print(f"Estimated cost: ${cost_info['estimated_cost_usd']}")
print(f"Input tokens: {cost_info['input_tokens']}")

Combining Function Calling with Web Scraping Workflows

Here's a complete production example integrating browser automation with function calling:

from playwright.sync_api import sync_playwright
from openai import OpenAI
import json
import time
from typing import Dict, Optional

class IntelligentScraper:
    def __init__(self, openai_api_key: str):
        self.client = OpenAI(api_key=openai_api_key)

    def scrape_with_browser(self, url: str) -> str:
        """Fetch dynamic content using Playwright."""
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()

            # Navigate and wait for content
            page.goto(url)
            page.wait_for_load_state('networkidle')

            # Get rendered content
            content = page.content()
            browser.close()

            return content

    def extract_structured_data(
        self,
        content: str,
        schema: Dict,
        max_retries: int = 3
    ) -> Optional[Dict]:
        """Extract data using function calling."""

        # Clean and truncate content
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(content, 'html.parser')
        for tag in soup(['script', 'style', 'nav', 'footer']):
            tag.decompose()

        clean_content = soup.get_text(separator=' ', strip=True)[:8000]

        tools = [{"type": "function", "function": schema}]

        for attempt in range(max_retries):
            try:
                completion = self.client.chat.completions.create(
                    model="gpt-4",
                    messages=[
                        {
                            "role": "system",
                            "content": "Extract structured data accurately from the provided content."
                        },
                        {
                            "role": "user",
                            "content": f"Extract data:\n\n{clean_content}"
                        }
                    ],
                    tools=tools,
                    tool_choice={"type": "function", "function": {"name": schema["name"]}},
                    temperature=0
                )

                tool_call = completion.choices[0].message.tool_calls[0]
                return json.loads(tool_call.function.arguments)

            except Exception as e:
                print(f"Attempt {attempt + 1} failed: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)
                else:
                    raise

        return None

    def scrape_and_extract(self, url: str, schema: Dict) -> Optional[Dict]:
        """Complete pipeline: scrape and extract."""
        try:
            # Step 1: Fetch content with browser
            print(f"Fetching {url}...")
            html = self.scrape_with_browser(url)

            # Step 2: Extract structured data with function calling
            print("Extracting data...")
            data = self.extract_structured_data(html, schema)

            return data

        except Exception as e:
            print(f"Error processing {url}: {e}")
            return None


# Usage example
if __name__ == "__main__":
    scraper = IntelligentScraper(openai_api_key='your-api-key')

    # Define extraction schema
    product_schema = {
        "name": "extract_product",
        "description": "Extract comprehensive product information",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "brand": {"type": "string"},
                "price": {"type": "number"},
                "currency": {"type": "string"},
                "in_stock": {"type": "boolean"},
                "rating": {"type": "number"},
                "review_count": {"type": "number"},
                "features": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["name", "price", "currency"]
        }
    }

    # Scrape product page
    result = scraper.scrape_and_extract(
        'https://example.com/product',
        product_schema
    )

    if result:
        print(json.dumps(result, indent=2))

Comparison: Function Calling vs Standard Prompting

| Aspect | Standard Prompting | Function Calling | |--------|-------------------|------------------| | Output Structure | Variable, may be malformed | Guaranteed schema compliance | | Type Enforcement | None | Strong type validation | | Required Fields | Cannot enforce | Schema-enforced requirements | | Parse Errors | Common, needs validation | Rare, pre-validated by model | | Production Readiness | Requires extensive validation | Production-ready outputs | | Array Consistency | Inconsistent formats | Consistent array structures | | Debugging | Difficult to trace issues | Clear schema validation errors | | Development Speed | Slower due to validation code | Faster, less boilerplate |

Model Support and Limitations

Supported Models

Function calling is available in:

OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5-turbo (with reduced reliability)
Anthropic: Claude 3 Opus, Claude 3.5 Sonnet (via tool use)
Google: Gemini Pro and Ultra models
Azure OpenAI: GPT-4 and GPT-3.5-turbo deployments

Limitations to Consider

Token Overhead: Function schemas add 100-500 tokens per request depending on complexity. Factor this into your content size limits.

Context Windows: Very large pages may need chunking:

# GPT-4: ~8K tokens safe limit
# GPT-4 Turbo: ~120K tokens
# Claude 3.5 Sonnet: ~180K tokens

Complex Schemas: Deeply nested schemas (5+ levels) may reduce extraction accuracy. Keep schemas reasonable.

Cost: Function calling uses the same pricing as regular API calls, but remember to account for schema tokens in your cost calculations.

Conclusion

Function calling transforms data extraction from an unpredictable process into a reliable, type-safe operation. By defining clear schemas and letting the LLM conform to them, you eliminate the uncertainty of free-form responses and create production-ready extraction pipelines.

For web scraping workflows, function calling provides the reliability needed to process data at scale without extensive validation logic. Combined with traditional browser automation for navigation and modern web scraping APIs for infrastructure, it creates a powerful, maintainable stack that adapts to changing website structures while delivering consistent, structured results.

Whether you're extracting single items or processing arrays of complex nested objects, function calling ensures your data arrives in exactly the format your application expects—every single time.

Table of contents