What is OpenAI Function Calling and How Does It Work with Web Scraping?
OpenAI function calling is a powerful feature that allows GPT models to generate structured outputs conforming to predefined JSON schemas. For web scraping, this means you can extract data from HTML with guaranteed structure, type safety, and consistency—eliminating the common problem of unreliable or malformed LLM responses.
Instead of hoping the LLM returns valid JSON in free-form text, function calling ensures the model's output matches your exact data schema, making it ideal for production web scraping pipelines where reliability is critical.
Understanding OpenAI Function Calling
Function calling (also known as tool calling in newer API versions) enables you to describe functions with specific parameters and types to the model. The model then intelligently extracts and structures data to match those function parameters, essentially treating data extraction as "calling a function" with the extracted values as arguments.
Key Benefits for Web Scraping
- Guaranteed Structure: Output always matches your predefined schema
- Type Safety: Fields are validated as strings, numbers, booleans, arrays, or objects
- Required Fields: Enforce that critical data must be present
- Array Handling: Extract multiple items (like product lists) reliably
- Reduced Parsing Errors: No need to parse free-form text or fix malformed JSON
- Production Ready: Consistent output format enables automated processing
How Function Calling Works
The process involves three steps:
- Define the function schema: Describe what data structure you want to extract
- Send scraped content: Provide the HTML or text to analyze
- Receive structured data: Get back data that matches your schema exactly
The model analyzes the content and "calls" your function by providing arguments (the extracted data) that conform to the defined schema.
Basic Function Calling for Web Scraping
Python Example: Extracting Product Information
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import json
client = OpenAI(api_key='your-api-key-here')
# Step 1: Scrape the webpage
url = 'https://example.com/product'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
html_content = soup.get_text(separator=' ', strip=True)
# Step 2: Define the function schema
tools = [
{
"type": "function",
"function": {
"name": "extract_product",
"description": "Extract product information from a webpage",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The product name"
},
"price": {
"type": "number",
"description": "The product price as a number"
},
"currency": {
"type": "string",
"description": "The currency code (USD, EUR, etc.)"
},
"in_stock": {
"type": "boolean",
"description": "Whether the product is in stock"
},
"rating": {
"type": "number",
"description": "Product rating out of 5"
},
"description": {
"type": "string",
"description": "Product description"
}
},
"required": ["name", "price", "currency"]
}
}
}
]
# Step 3: Call the API with function calling
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant. Extract product information from the provided content."
},
{
"role": "user",
"content": f"Extract product data from this content:\n\n{html_content[:4000]}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_product"}}
)
# Step 4: Parse the function call result
tool_call = completion.choices[0].message.tool_calls[0]
product_data = json.loads(tool_call.function.arguments)
print(json.dumps(product_data, indent=2))
Output:
{
"name": "Premium Wireless Headphones",
"price": 299.99,
"currency": "USD",
"in_stock": true,
"rating": 4.5,
"description": "High-quality wireless headphones with noise cancellation"
}
JavaScript Example: Extracting Article Data
const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');
const openai = new OpenAI({
apiKey: 'your-api-key-here'
});
async function scrapeArticle(url) {
// Fetch and parse the webpage
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const content = $('body').text().trim();
// Define the function schema
const tools = [
{
type: "function",
function: {
name: "extract_article",
description: "Extract article information from webpage content",
parameters: {
type: "object",
properties: {
title: {
type: "string",
description: "The article title"
},
author: {
type: "string",
description: "The article author"
},
publish_date: {
type: "string",
description: "Publication date in ISO format"
},
summary: {
type: "string",
description: "A brief summary of the article"
},
tags: {
type: "array",
items: { type: "string" },
description: "Article tags or categories"
}
},
required: ["title", "author"]
}
}
}
];
// Call OpenAI with function calling
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{
role: "system",
content: "Extract article information from the provided content."
},
{
role: "user",
content: `Extract article data:\n\n${content.substring(0, 4000)}`
}
],
tools: tools,
tool_choice: { type: "function", function: { name: "extract_article" } }
});
// Parse the result
const toolCall = completion.choices[0].message.tool_calls[0];
const articleData = JSON.parse(toolCall.function.arguments);
return articleData;
}
// Usage
scrapeArticle('https://example.com/blog/article')
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Error:', error));
Advanced Use Cases
Extracting Multiple Items (Arrays)
When scraping lists of products, articles, or search results, you need to extract arrays of structured data:
from openai import OpenAI
import requests
from bs4 import BeautifulSoup
import json
client = OpenAI(api_key='your-api-key-here')
# Define schema for multiple products
tools = [
{
"type": "function",
"function": {
"name": "extract_product_list",
"description": "Extract a list of products from a webpage",
"parameters": {
"type": "object",
"properties": {
"products": {
"type": "array",
"description": "Array of product objects",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"url": {"type": "string"},
"availability": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "preorder"]
}
},
"required": ["name", "price"]
}
},
"total_count": {
"type": "number",
"description": "Total number of products found"
}
},
"required": ["products", "total_count"]
}
}
}
]
# Scrape product listing page
url = 'https://example.com/products'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
content = soup.get_text(separator=' ', strip=True)
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "Extract all products from the page."},
{"role": "user", "content": f"Extract products:\n\n{content[:8000]}"}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_product_list"}}
)
tool_call = completion.choices[0].message.tool_calls[0]
result = json.loads(tool_call.function.arguments)
print(f"Found {result['total_count']} products:")
for product in result['products']:
print(f"- {product['name']}: {product['price']} {product.get('currency', 'USD')}")
Nested Object Extraction
For complex data structures like product reviews with nested ratings:
const OpenAI = require('openai');
const puppeteer = require('puppeteer');
const openai = new OpenAI({ apiKey: 'your-api-key-here' });
async function scrapeProductWithReviews(url) {
// Use Puppeteer for dynamic content
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle2' });
const content = await page.evaluate(() => document.body.innerText);
await browser.close();
const tools = [
{
type: "function",
function: {
name: "extract_product_with_reviews",
description: "Extract product details including reviews",
parameters: {
type: "object",
properties: {
product: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
overall_rating: { type: "number" }
},
required: ["name"]
},
reviews: {
type: "array",
items: {
type: "object",
properties: {
author: { type: "string" },
rating: { type: "number" },
title: { type: "string" },
comment: { type: "string" },
verified_purchase: { type: "boolean" },
helpful_votes: { type: "number" }
},
required: ["author", "rating"]
}
}
},
required: ["product", "reviews"]
}
}
}
];
const completion = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "system", content: "Extract product and review data." },
{ role: "user", content: `Extract data:\n\n${content.substring(0, 6000)}` }
],
tools: tools,
tool_choice: { type: "function", function: { name: "extract_product_with_reviews" } }
});
const toolCall = completion.choices[0].message.tool_calls[0];
return JSON.parse(toolCall.function.arguments);
}
When handling dynamic content with browser automation, combining Puppeteer with function calling ensures both complete page rendering and reliable data extraction.
Enum Values for Classification
Use enums to classify scraped content into predefined categories:
from openai import OpenAI
import requests
client = OpenAI(api_key='your-api-key-here')
tools = [
{
"type": "function",
"function": {
"name": "classify_and_extract",
"description": "Classify content type and extract relevant data",
"parameters": {
"type": "object",
"properties": {
"content_type": {
"type": "string",
"enum": ["product", "article", "review", "forum_post", "documentation"],
"description": "The type of content on the page"
},
"sentiment": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Overall sentiment of the content"
},
"key_entities": {
"type": "array",
"items": {"type": "string"},
"description": "Important entities mentioned (brands, products, people)"
},
"main_topic": {
"type": "string",
"description": "The main topic or subject"
}
},
"required": ["content_type", "sentiment"]
}
}
}
]
response = requests.get('https://example.com/content')
content = response.text[:4000]
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": f"Classify and extract from:\n\n{content}"}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "classify_and_extract"}}
)
tool_call = completion.choices[0].message.tool_calls[0]
classification = json.loads(tool_call.function.arguments)
print(f"Content Type: {classification['content_type']}")
print(f"Sentiment: {classification['sentiment']}")
print(f"Entities: {', '.join(classification['key_entities'])}")
Combining Function Calling with Web Scraping Workflows
Complete Production Example
Here's a production-ready scraper using function calling with error handling, caching, and retry logic:
import requests
from bs4 import BeautifulSoup
from openai import OpenAI
import json
import hashlib
import os
import time
from typing import Dict, List, Optional
class FunctionCallingScraper:
def __init__(self, api_key: str, cache_dir: str = 'scraping_cache'):
self.client = OpenAI(api_key=api_key)
self.cache_dir = cache_dir
os.makedirs(cache_dir, exist_ok=True)
def fetch_and_clean(self, url: str) -> str:
"""Fetch webpage and clean HTML."""
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
# Remove noise
for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
element.decompose()
return soup.get_text(separator=' ', strip=True)
def get_cache_key(self, content: str, schema: Dict) -> str:
"""Generate cache key from content and schema."""
combined = f"{content}{json.dumps(schema, sort_keys=True)}"
return hashlib.md5(combined.encode()).hexdigest()
def extract_with_function_calling(
self,
content: str,
function_schema: Dict,
max_retries: int = 3
) -> Optional[Dict]:
"""Extract data using function calling with retry logic."""
# Check cache
cache_key = self.get_cache_key(content, function_schema)
cache_file = os.path.join(self.cache_dir, f"{cache_key}.json")
if os.path.exists(cache_file):
with open(cache_file, 'r') as f:
return json.load(f)
# Truncate content to fit token limits (~4000 tokens)
content = content[:16000]
tools = [{"type": "function", "function": function_schema}]
for attempt in range(max_retries):
try:
completion = self.client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract structured data from the provided content."
},
{
"role": "user",
"content": f"Extract data from:\n\n{content}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": function_schema["name"]}},
temperature=0
)
tool_call = completion.choices[0].message.tool_calls[0]
result = json.loads(tool_call.function.arguments)
# Cache the result
with open(cache_file, 'w') as f:
json.dump(result, f, indent=2)
return result
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise
return None
def scrape_url(self, url: str, function_schema: Dict) -> Optional[Dict]:
"""Complete scraping pipeline."""
try:
content = self.fetch_and_clean(url)
return self.extract_with_function_calling(content, function_schema)
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
# Usage example
if __name__ == "__main__":
scraper = FunctionCallingScraper(api_key='your-api-key-here')
# Define extraction schema
product_schema = {
"name": "extract_ecommerce_product",
"description": "Extract product information from an e-commerce page",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Product name"},
"price": {"type": "number", "description": "Price as a number"},
"currency": {"type": "string", "description": "Currency code"},
"brand": {"type": "string", "description": "Brand name"},
"category": {"type": "string", "description": "Product category"},
"in_stock": {"type": "boolean", "description": "Availability status"},
"specs": {
"type": "object",
"description": "Technical specifications",
"additionalProperties": {"type": "string"}
},
"images": {
"type": "array",
"items": {"type": "string"},
"description": "Image URLs"
}
},
"required": ["name", "price", "currency"]
}
}
# Scrape multiple URLs
urls = [
'https://example.com/product1',
'https://example.com/product2',
'https://example.com/product3'
]
for url in urls:
print(f"\nScraping {url}...")
data = scraper.scrape_url(url, product_schema)
if data:
print(json.dumps(data, indent=2))
Best Practices for Function Calling in Web Scraping
1. Design Clear, Specific Schemas
Make your function schemas as specific as possible:
# ❌ Too vague
{
"name": "extract_data",
"parameters": {
"type": "object",
"properties": {
"data": {"type": "string"}
}
}
}
# ✅ Specific and structured
{
"name": "extract_product",
"description": "Extract product details from an e-commerce page",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The product name or title"
},
"price": {
"type": "number",
"description": "Price as a decimal number without currency symbols"
},
"currency": {
"type": "string",
"description": "ISO 4217 currency code (USD, EUR, GBP, etc.)"
}
},
"required": ["name", "price"]
}
}
2. Use Enums for Controlled Values
Constrain outputs to specific values when possible:
{
type: "object",
properties: {
condition: {
type: "string",
enum: ["new", "like_new", "good", "acceptable", "poor"],
description: "Product condition"
},
shipping_speed: {
type: "string",
enum: ["standard", "express", "overnight", "international"],
description: "Available shipping speed"
}
}
}
3. Optimize Content Before Extraction
Clean and reduce HTML to minimize tokens and costs:
from bs4 import BeautifulSoup
import re
def optimize_for_extraction(html: str, target_selector: str = None) -> str:
"""Optimize HTML content for LLM extraction."""
soup = BeautifulSoup(html, 'html.parser')
# If target selector provided, extract only that section
if target_selector:
target = soup.select_one(target_selector)
if target:
soup = target
# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'footer', 'header',
'iframe', 'noscript', 'svg']):
element.decompose()
# Get text with some structure preserved
text = soup.get_text(separator=' ', strip=True)
# Clean excessive whitespace
text = re.sub(r'\s+', ' ', text)
return text.strip()
4. Handle Partial Data Gracefully
Not all required fields may be present on every page:
# Make only critical fields required
{
"name": "extract_listing",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"description": {"type": "string"},
"optional_fields": {
"type": "object",
"properties": {
"rating": {"type": "number"},
"review_count": {"type": "number"},
"seller": {"type": "string"}
}
}
},
"required": ["title"] # Only title is mandatory
}
}
5. Monitor Token Usage and Costs
Track your API usage to optimize costs:
import tiktoken
def estimate_tokens(text: str, model: str = "gpt-4") -> int:
"""Estimate token count for text."""
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
def estimate_cost(input_tokens: int, output_tokens: int, model: str = "gpt-4") -> float:
"""Estimate API call cost."""
# Prices as of 2024 (check current pricing)
prices = {
"gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
"gpt-3.5-turbo": {"input": 0.0015, "output": 0.002}
}
price = prices.get(model, prices["gpt-4"])
input_cost = (input_tokens / 1000) * price["input"]
output_cost = (output_tokens / 1000) * price["output"]
return input_cost + output_cost
# Before making API call
content = optimize_for_extraction(html)
estimated_tokens = estimate_tokens(content)
print(f"Estimated tokens: {estimated_tokens}")
print(f"Estimated cost: ${estimate_cost(estimated_tokens, 200):.4f}")
Comparing Function Calling vs. Standard Prompting
| Aspect | Standard Prompting | Function Calling | |--------|-------------------|------------------| | Structure | May return inconsistent JSON | Guaranteed schema compliance | | Type Safety | No type enforcement | Strong type validation | | Reliability | Requires parsing and validation | Direct structured output | | Required Fields | Cannot enforce | Enforced by schema | | Arrays | May vary in format | Consistent array structure | | Production Use | Needs extensive error handling | Production-ready outputs | | Debugging | Harder to track issues | Clear schema validation errors |
Limitations and Considerations
Token Limits
Function calling adds tokens to your request (for the schema definition). Monitor total token usage:
# Schema adds ~200-500 tokens depending on complexity
# Content should stay under 3000-4000 tokens for GPT-4
# Total input tokens should be under 8000 for safety
Model Support
Function calling is supported in: - GPT-4 and GPT-4 Turbo - GPT-3.5-turbo (with slightly less reliability) - Not all older models support it
Complex Schemas
Very complex nested schemas may reduce reliability. Keep schemas reasonable:
# ✅ Good: 2-3 levels of nesting
# ❌ Avoid: 5+ levels of deeply nested objects
Conclusion
OpenAI function calling transforms web scraping from an unreliable, parse-heavy process into a type-safe, structured data extraction workflow. By defining clear schemas and letting the model conform to them, you eliminate the uncertainty of free-form LLM responses.
For production web scraping systems, function calling provides the reliability needed to process data at scale. Combined with traditional browser automation techniques for navigation and modern web scraping APIs for infrastructure, it creates a powerful, maintainable scraping stack that can adapt to changing website structures while delivering consistent results.
Start with simple schemas for single-page extraction, then scale to complex multi-item arrays and nested objects as you gain confidence. The combination of guaranteed structure and intelligent extraction makes function calling an essential tool for modern web scraping applications.