What is Function Calling in LLMs and Why is It Useful for Data Extraction?
Function calling (also known as tool calling) is a feature in modern Large Language Models that enables the model to generate structured outputs conforming to predefined schemas. For data extraction and web scraping, this means you can instruct an LLM to extract information in a specific format with guaranteed structure, type safety, and consistency—eliminating the common problem of unreliable or malformed responses.
Instead of hoping the LLM returns valid JSON in free-form text, function calling ensures the model's output matches your exact data schema. This makes it invaluable for production web scraping pipelines, API integrations, and automated data processing workflows where reliability and consistency are critical.
Understanding Function Calling
Function calling allows you to describe one or more functions with specific parameters and data types to the LLM. The model then intelligently extracts information from the provided content and structures it to match those function parameters. Essentially, the model treats data extraction as "calling a function" with the extracted values as arguments.
Key Capabilities of Function Calling
The function calling mechanism provides several powerful capabilities:
- Guaranteed Schema Compliance: Output always matches your predefined JSON schema
- Type Validation: Fields are validated as specific types (string, number, boolean, array, object)
- Required Fields: Enforce that critical data points must be present in the output
- Nested Structures: Support complex data hierarchies with nested objects and arrays
- Enum Constraints: Limit values to predefined options for classification tasks
- Multiple Items: Reliably extract arrays of objects (like product lists or search results)
- Production Reliability: Consistent output format enables automated processing without manual validation
How Function Calling Works
The process involves three main steps:
- Define the function schema: Describe the structure of data you want to extract, including field names, types, descriptions, and constraints
- Provide content: Send the content to analyze (HTML, text, JSON, or any data)
- Receive structured data: Get back data that perfectly matches your schema
The LLM analyzes the content and generates output that conforms to the defined schema, effectively "calling" your function with the extracted data as arguments.
Basic Function Calling for Data Extraction
Python Example: Extracting Product Information
Here's a practical example of using function calling with OpenAI's API to extract product data from a webpage:
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import json
client = OpenAI(api_key='your-api-key')
# Fetch webpage content
url = 'https://example.com/product'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Get clean text content
content = soup.get_text(separator=' ', strip=True)[:4000]
# Define the function schema for extraction
tools = [
{
"type": "function",
"function": {
"name": "extract_product_data",
"description": "Extract product information from webpage content",
"parameters": {
"type": "object",
"properties": {
"product_name": {
"type": "string",
"description": "The name or title of the product"
},
"price": {
"type": "number",
"description": "Product price as a numeric value"
},
"currency": {
"type": "string",
"description": "Currency code like USD, EUR, GBP"
},
"in_stock": {
"type": "boolean",
"description": "Whether the product is currently available"
},
"rating": {
"type": "number",
"description": "Average customer rating (0-5)"
},
"description": {
"type": "string",
"description": "Brief product description"
}
},
"required": ["product_name", "price", "currency"]
}
}
}
]
# Call the API with function calling enabled
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant. Extract product information accurately."
},
{
"role": "user",
"content": f"Extract product data from this content:\n\n{content}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_product_data"}}
)
# Parse the function call result
tool_call = completion.choices[0].message.tool_calls[0]
product_data = json.loads(tool_call.function.arguments)
print(json.dumps(product_data, indent=2))
Expected Output:
{
"product_name": "Premium Wireless Headphones",
"price": 299.99,
"currency": "USD",
"in_stock": true,
"rating": 4.5,
"description": "High-quality wireless headphones with active noise cancellation"
}
JavaScript Example: Extracting Article Metadata
Using Anthropic's Claude API for article data extraction:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const cheerio = require('cheerio');
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
async function extractArticleData(url) {
// Fetch webpage
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const content = $('body').text().trim().substring(0, 4000);
// Define extraction schema using Claude's tool use
const tools = [
{
name: "extract_article",
description: "Extract structured article information from webpage content",
input_schema: {
type: "object",
properties: {
title: {
type: "string",
description: "The article headline or title"
},
author: {
type: "string",
description: "Author name"
},
publish_date: {
type: "string",
description: "Publication date in ISO format"
},
summary: {
type: "string",
description: "Brief summary of the article content"
},
tags: {
type: "array",
items: { type: "string" },
description: "Article tags or categories"
},
word_count: {
type: "number",
description: "Approximate word count"
}
},
required: ["title", "author"]
}
}
];
// Make API call with tool use
const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
tools: tools,
messages: [{
role: 'user',
content: `Extract article information from this content:\n\n${content}`
}]
});
// Parse the tool use result
const toolUse = message.content.find(block => block.type === 'tool_use');
return toolUse.input;
}
// Usage
extractArticleData('https://example.com/blog/article')
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Error:', error));
Advanced Use Cases
Extracting Arrays of Items
When scraping listing pages or search results, you need to extract multiple items. Function calling excels at this:
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import json
client = OpenAI(api_key='your-api-key')
# Fetch product listing page
url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
content = soup.get_text(separator=' ', strip=True)[:8000]
# Define schema for multiple products
tools = [
{
"type": "function",
"function": {
"name": "extract_product_list",
"description": "Extract all products from a listing page",
"parameters": {
"type": "object",
"properties": {
"products": {
"type": "array",
"description": "List of all products found",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Price as numeric value"
},
"currency": {
"type": "string",
"description": "Currency code"
},
"availability": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "preorder", "discontinued"],
"description": "Product availability status"
},
"rating": {
"type": "number",
"description": "Customer rating (0-5)"
}
},
"required": ["name", "price", "currency"]
}
},
"total_products": {
"type": "number",
"description": "Total number of products found"
},
"page_number": {
"type": "number",
"description": "Current page number if pagination exists"
}
},
"required": ["products", "total_products"]
}
}
}
]
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract all products from the listing page."},
{"role": "user", "content": f"Extract products:\n\n{content}"}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_product_list"}}
)
tool_call = completion.choices[0].message.tool_calls[0]
result = json.loads(tool_call.function.arguments)
print(f"Found {result['total_products']} products:")
for product in result['products']:
status = product.get('availability', 'unknown')
print(f"- {product['name']}: {product['price']} {product['currency']} ({status})")
Nested Object Extraction
For complex data with nested structures, like products with specifications:
const OpenAI = require('openai');
const puppeteer = require('puppeteer');
const openai = new OpenAI({ apiKey: 'your-api-key' });
async function extractComplexProduct(url) {
// Use Puppeteer for dynamic content
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
const content = await page.evaluate(() => document.body.innerText);
await browser.close();
const tools = [
{
type: "function",
function: {
name: "extract_product_details",
description: "Extract comprehensive product details with nested specifications",
parameters: {
type: "object",
properties: {
basic_info: {
type: "object",
properties: {
name: { type: "string" },
brand: { type: "string" },
model: { type: "string" },
price: { type: "number" },
currency: { type: "string" }
},
required: ["name", "price"]
},
specifications: {
type: "object",
properties: {
dimensions: {
type: "object",
properties: {
length: { type: "number" },
width: { type: "number" },
height: { type: "number" },
unit: { type: "string" }
}
},
weight: {
type: "object",
properties: {
value: { type: "number" },
unit: { type: "string" }
}
},
features: {
type: "array",
items: { type: "string" }
}
}
},
ratings: {
type: "object",
properties: {
average: { type: "number" },
count: { type: "number" },
distribution: {
type: "object",
properties: {
five_star: { type: "number" },
four_star: { type: "number" },
three_star: { type: "number" },
two_star: { type: "number" },
one_star: { type: "number" }
}
}
}
}
},
required: ["basic_info"]
}
}
}
];
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: "Extract detailed product information." },
{ role: "user", content: `Extract data:\n\n${content.substring(0, 6000)}` }
],
tools: tools,
tool_choice: { type: "function", function: { name: "extract_product_details" } }
});
const toolCall = completion.choices[0].message.tool_calls[0];
return JSON.parse(toolCall.function.arguments);
}
When working with dynamic web applications, combining browser automation with function calling ensures both complete page rendering and reliable structured data extraction.
Classification and Entity Extraction
Use enums to classify content and extract entities:
from openai import OpenAI
import requests
client = OpenAI(api_key='your-api-key')
tools = [
{
"type": "function",
"function": {
"name": "analyze_and_classify",
"description": "Classify content and extract key entities",
"parameters": {
"type": "object",
"properties": {
"content_type": {
"type": "string",
"enum": ["product_page", "article", "review", "forum_post",
"documentation", "news", "blog_post"],
"description": "Type of content on the page"
},
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral", "mixed"],
"description": "Overall sentiment of the content"
},
"primary_topic": {
"type": "string",
"description": "Main topic or subject matter"
},
"entities": {
"type": "object",
"properties": {
"people": {
"type": "array",
"items": {"type": "string"},
"description": "Names of people mentioned"
},
"organizations": {
"type": "array",
"items": {"type": "string"},
"description": "Companies or organizations mentioned"
},
"products": {
"type": "array",
"items": {"type": "string"},
"description": "Products or services mentioned"
},
"locations": {
"type": "array",
"items": {"type": "string"},
"description": "Geographic locations mentioned"
}
}
},
"key_facts": {
"type": "array",
"items": {"type": "string"},
"description": "Important facts or claims made"
}
},
"required": ["content_type", "sentiment", "primary_topic"]
}
}
}
]
response = requests.get('https://example.com/page')
content = response.text[:4000]
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": f"Analyze and classify:\n\n{content}"}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "analyze_and_classify"}}
)
tool_call = completion.choices[0].message.tool_calls[0]
analysis = json.loads(tool_call.function.arguments)
print(f"Content Type: {analysis['content_type']}")
print(f"Sentiment: {analysis['sentiment']}")
print(f"Topic: {analysis['primary_topic']}")
if 'entities' in analysis:
print(f"Organizations: {', '.join(analysis['entities'].get('organizations', []))}")
Why Function Calling is Superior for Data Extraction
1. Guaranteed Structure
Traditional LLM prompting might return inconsistent formats:
# Without function calling - unreliable
response = "The product is called Widget Pro and costs $299.99 USD"
# or
response = '{"name": "Widget Pro", price: "299.99", "currency": "USD"}' # Invalid JSON
# or
response = "{'name': 'Widget Pro', 'price': 299.99}" # Python dict format
With function calling, you always get valid, structured JSON:
# With function calling - guaranteed structure
{
"name": "Widget Pro",
"price": 299.99,
"currency": "USD"
}
2. Type Safety
Function calling enforces data types:
# Schema defines types
"price": {"type": "number"} # Must be a number, not string
"in_stock": {"type": "boolean"} # Must be true/false
"tags": {"type": "array", "items": {"type": "string"}} # Must be array of strings
This eliminates type conversion errors and validation logic in your code.
3. Required Field Enforcement
Specify which fields are mandatory:
"required": ["name", "price", "currency"]
The LLM will always include these fields, or indicate when they cannot be found, preventing incomplete data issues.
4. Production Reliability
For automated pipelines, function calling provides the consistency needed:
def process_scraped_data(url):
# Extract data with function calling
data = extract_with_function_calling(url)
# No need for extensive validation - structure is guaranteed
# Direct database insertion or API calls
database.insert_product(
name=data['name'],
price=data['price'],
currency=data['currency']
)
Best Practices for Function Calling
1. Design Clear, Specific Schemas
Make your schemas descriptive and unambiguous:
# ❌ Vague schema
{
"name": "get_data",
"parameters": {
"type": "object",
"properties": {
"info": {"type": "string"}
}
}
}
# ✅ Clear and specific
{
"name": "extract_product_pricing",
"description": "Extract pricing information from an e-commerce product page",
"parameters": {
"type": "object",
"properties": {
"base_price": {
"type": "number",
"description": "The regular price as a decimal number without currency symbols"
},
"sale_price": {
"type": "number",
"description": "The discounted price if on sale, null otherwise"
},
"currency_code": {
"type": "string",
"description": "Three-letter ISO 4217 currency code (e.g., USD, EUR, GBP)"
},
"discount_percentage": {
"type": "number",
"description": "Percentage discount if on sale (0-100)"
}
},
"required": ["base_price", "currency_code"]
}
}
2. Optimize Content Size
Clean and minimize content before sending to reduce tokens and costs:
from bs4 import BeautifulSoup
import re
def prepare_content_for_extraction(html, max_chars=8000):
"""Optimize HTML for LLM extraction."""
soup = BeautifulSoup(html, 'html.parser')
# Remove unnecessary elements
for element in soup(['script', 'style', 'nav', 'footer',
'header', 'aside', 'iframe', 'noscript']):
element.decompose()
# Extract main content area if identifiable
main_content = (soup.find('main') or
soup.find('article') or
soup.find(class_='content') or
soup.body)
# Get clean text
text = main_content.get_text(separator=' ', strip=True)
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Truncate if needed
return text[:max_chars]
3. Implement Error Handling
Always handle potential failures gracefully:
async function extractWithRetry(content, schema, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: "Extract structured data accurately." },
{ role: "user", content: `Extract data:\n\n${content}` }
],
tools: [{ type: "function", function: schema }],
tool_choice: { type: "function", function: { name: schema.name } },
temperature: 0 // Deterministic output
});
const toolCall = completion.choices[0].message.tool_calls[0];
const result = JSON.parse(toolCall.function.arguments);
// Validate result has required fields
if (validateResult(result, schema)) {
return result;
}
} catch (error) {
console.error(`Attempt ${attempt + 1} failed:`, error.message);
if (attempt === maxRetries - 1) {
throw new Error(`Failed after ${maxRetries} attempts: ${error.message}`);
}
// Exponential backoff
await new Promise(resolve => setTimeout(resolve, Math.pow(2, attempt) * 1000));
}
}
}
function validateResult(result, schema) {
const required = schema.parameters.required || [];
return required.every(field => field in result && result[field] !== null);
}
4. Monitor Costs and Token Usage
Track your API usage to optimize spending:
import tiktoken
def estimate_extraction_cost(content, schema, model="gpt-4"):
"""Estimate the cost of a function calling extraction."""
encoding = tiktoken.encoding_for_model(model)
# Count input tokens (content + schema + system message)
content_tokens = len(encoding.encode(content))
schema_tokens = len(encoding.encode(str(schema)))
system_tokens = 50 # Approximate
input_tokens = content_tokens + schema_tokens + system_tokens
# Estimate output tokens (varies by extraction complexity)
estimated_output_tokens = 200 # Conservative estimate
# Pricing (check current rates)
if model == "gpt-4":
input_cost_per_1k = 0.03
output_cost_per_1k = 0.06
else: # gpt-3.5-turbo
input_cost_per_1k = 0.0015
output_cost_per_1k = 0.002
total_cost = (
(input_tokens / 1000 * input_cost_per_1k) +
(estimated_output_tokens / 1000 * output_cost_per_1k)
)
return {
"input_tokens": input_tokens,
"estimated_output_tokens": estimated_output_tokens,
"estimated_cost_usd": round(total_cost, 4)
}
# Before extraction
cost_info = estimate_extraction_cost(content, schema)
print(f"Estimated cost: ${cost_info['estimated_cost_usd']}")
print(f"Input tokens: {cost_info['input_tokens']}")
Combining Function Calling with Web Scraping Workflows
Here's a complete production example integrating browser automation with function calling:
from playwright.sync_api import sync_playwright
from openai import OpenAI
import json
import time
from typing import Dict, Optional
class IntelligentScraper:
def __init__(self, openai_api_key: str):
self.client = OpenAI(api_key=openai_api_key)
def scrape_with_browser(self, url: str) -> str:
"""Fetch dynamic content using Playwright."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content
page.goto(url)
page.wait_for_load_state('networkidle')
# Get rendered content
content = page.content()
browser.close()
return content
def extract_structured_data(
self,
content: str,
schema: Dict,
max_retries: int = 3
) -> Optional[Dict]:
"""Extract data using function calling."""
# Clean and truncate content
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
clean_content = soup.get_text(separator=' ', strip=True)[:8000]
tools = [{"type": "function", "function": schema}]
for attempt in range(max_retries):
try:
completion = self.client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract structured data accurately from the provided content."
},
{
"role": "user",
"content": f"Extract data:\n\n{clean_content}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": schema["name"]}},
temperature=0
)
tool_call = completion.choices[0].message.tool_calls[0]
return json.loads(tool_call.function.arguments)
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
else:
raise
return None
def scrape_and_extract(self, url: str, schema: Dict) -> Optional[Dict]:
"""Complete pipeline: scrape and extract."""
try:
# Step 1: Fetch content with browser
print(f"Fetching {url}...")
html = self.scrape_with_browser(url)
# Step 2: Extract structured data with function calling
print("Extracting data...")
data = self.extract_structured_data(html, schema)
return data
except Exception as e:
print(f"Error processing {url}: {e}")
return None
# Usage example
if __name__ == "__main__":
scraper = IntelligentScraper(openai_api_key='your-api-key')
# Define extraction schema
product_schema = {
"name": "extract_product",
"description": "Extract comprehensive product information",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"brand": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
"rating": {"type": "number"},
"review_count": {"type": "number"},
"features": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["name", "price", "currency"]
}
}
# Scrape product page
result = scraper.scrape_and_extract(
'https://example.com/product',
product_schema
)
if result:
print(json.dumps(result, indent=2))
Comparison: Function Calling vs Standard Prompting
| Aspect | Standard Prompting | Function Calling | |--------|-------------------|------------------| | Output Structure | Variable, may be malformed | Guaranteed schema compliance | | Type Enforcement | None | Strong type validation | | Required Fields | Cannot enforce | Schema-enforced requirements | | Parse Errors | Common, needs validation | Rare, pre-validated by model | | Production Readiness | Requires extensive validation | Production-ready outputs | | Array Consistency | Inconsistent formats | Consistent array structures | | Debugging | Difficult to trace issues | Clear schema validation errors | | Development Speed | Slower due to validation code | Faster, less boilerplate |
Model Support and Limitations
Supported Models
Function calling is available in:
- OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5-turbo (with reduced reliability)
- Anthropic: Claude 3 Opus, Claude 3.5 Sonnet (via tool use)
- Google: Gemini Pro and Ultra models
- Azure OpenAI: GPT-4 and GPT-3.5-turbo deployments
Limitations to Consider
Token Overhead: Function schemas add 100-500 tokens per request depending on complexity. Factor this into your content size limits.
Context Windows: Very large pages may need chunking:
# GPT-4: ~8K tokens safe limit
# GPT-4 Turbo: ~120K tokens
# Claude 3.5 Sonnet: ~180K tokens
Complex Schemas: Deeply nested schemas (5+ levels) may reduce extraction accuracy. Keep schemas reasonable.
Cost: Function calling uses the same pricing as regular API calls, but remember to account for schema tokens in your cost calculations.
Conclusion
Function calling transforms data extraction from an unpredictable process into a reliable, type-safe operation. By defining clear schemas and letting the LLM conform to them, you eliminate the uncertainty of free-form responses and create production-ready extraction pipelines.
For web scraping workflows, function calling provides the reliability needed to process data at scale without extensive validation logic. Combined with traditional browser automation for navigation and modern web scraping APIs for infrastructure, it creates a powerful, maintainable stack that adapts to changing website structures while delivering consistent, structured results.
Whether you're extracting single items or processing arrays of complex nested objects, function calling ensures your data arrives in exactly the format your application expects—every single time.