How do I get structured output from an LLM for web scraping?
Getting structured output from Large Language Models (LLMs) for web scraping involves defining a schema, prompting the LLM to extract data in a specific format, and validating the results. Modern LLM APIs like OpenAI's GPT, Anthropic's Claude, and Google's Gemini provide built-in features for structured data extraction, making it easier to parse web content into consistent, typed data structures.
Why Use LLMs for Structured Data Extraction?
Traditional web scraping relies on CSS selectors or XPath to extract data from HTML. While effective for static pages, these methods struggle with:
- Dynamic or inconsistent HTML structures - Pages that change layouts frequently
- Unstructured text content - Data embedded in paragraphs or lists without clear patterns
- Complex data relationships - Information spread across multiple elements
- Natural language processing - Extracting meaning from text, not just raw values
LLMs can understand context, interpret natural language, and extract structured data from messy HTML or text without brittle selectors.
Methods for Getting Structured Output
1. JSON Mode and Structured Output APIs
Most modern LLM providers offer native support for structured outputs through JSON mode or schema-based extraction.
OpenAI GPT with JSON Mode
OpenAI's API provides a response_format
parameter to enforce JSON output:
import openai
import json
# Initialize OpenAI client
client = openai.OpenAI(api_key="your-api-key")
# HTML content from web scraping
html_content = """
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$99.99</span>
<p>Premium noise-canceling headphones with 30-hour battery life.</p>
<div class="rating">4.5 stars (1,234 reviews)</div>
</div>
"""
# Define extraction prompt
prompt = f"""
Extract product information from this HTML and return it as JSON with the following schema:
{{
"name": "product name",
"price": "numeric price value",
"description": "product description",
"rating": "numeric rating",
"review_count": "number of reviews"
}}
HTML:
{html_content}
"""
# Make API call with JSON mode
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a data extraction assistant. Always respond with valid JSON."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
# Parse structured output
product_data = json.loads(response.choices[0].message.content)
print(json.dumps(product_data, indent=2))
Output:
{
"name": "Wireless Headphones",
"price": 99.99,
"description": "Premium noise-canceling headphones with 30-hour battery life.",
"rating": 4.5,
"review_count": 1234
}
OpenAI Function Calling for Structured Extraction
Function calling provides stronger type validation and schema enforcement:
import openai
import json
client = openai.OpenAI(api_key="your-api-key")
# Define extraction schema as a function
extraction_function = {
"name": "extract_product_data",
"description": "Extract structured product information from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Price in USD"
},
"description": {
"type": "string",
"description": "Product description"
},
"rating": {
"type": "number",
"description": "Rating from 0 to 5"
},
"review_count": {
"type": "integer",
"description": "Number of customer reviews"
},
"features": {
"type": "array",
"items": {"type": "string"},
"description": "List of product features"
}
},
"required": ["name", "price", "description"]
}
}
html_content = """
<div class="product">
<h1>Wireless Headphones</h1>
<span class="price">$99.99</span>
<p>Premium noise-canceling headphones with 30-hour battery life.</p>
<ul class="features">
<li>Active noise cancellation</li>
<li>30-hour battery life</li>
<li>Bluetooth 5.0</li>
</ul>
<div class="rating">4.5 stars (1,234 reviews)</div>
</div>
"""
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "user", "content": f"Extract product data from this HTML:\n{html_content}"}
],
functions=[extraction_function],
function_call={"name": "extract_product_data"}
)
# Extract function arguments as structured data
function_args = json.loads(response.choices[0].message.function_call.arguments)
print(json.dumps(function_args, indent=2))
2. Anthropic Claude with Structured Prompting
Anthropic's Claude API excels at following structured output instructions through well-crafted prompts:
import anthropic
import json
client = anthropic.Anthropic(api_key="your-api-key")
html_content = """
<article>
<h1>Breaking: New AI Model Released</h1>
<time datetime="2024-03-15">March 15, 2024</time>
<span class="author">by Jane Smith</span>
<p>Researchers have unveiled a groundbreaking AI model...</p>
</article>
"""
prompt = f"""Extract the following information from the HTML article and return ONLY a valid JSON object with no additional text:
Required fields:
- title (string)
- date (ISO 8601 format)
- author (string)
- summary (string, first 100 characters of content)
HTML:
{html_content}
JSON:"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
# Parse JSON response
article_data = json.loads(message.content[0].text)
print(json.dumps(article_data, indent=2))
3. Google Gemini with Schema-Guided Generation
Google's Gemini API supports schema-based output control:
import google.generativeai as genai
import json
genai.configure(api_key="your-api-key")
# Define schema for structured output
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"availability": {"type": "string", "enum": ["in_stock", "out_of_stock", "preorder"]},
"specifications": {
"type": "object",
"properties": {
"brand": {"type": "string"},
"model": {"type": "string"},
"color": {"type": "string"}
}
}
},
"required": ["title", "price", "availability"]
}
model = genai.GenerativeModel('gemini-1.5-pro')
html_content = """
<div class="product-detail">
<h2>Samsung Galaxy S24</h2>
<p class="price">$799.99</p>
<span class="stock">In Stock</span>
<dl>
<dt>Brand:</dt><dd>Samsung</dd>
<dt>Model:</dt><dd>Galaxy S24</dd>
<dt>Color:</dt><dd>Titanium Gray</dd>
</dl>
</div>
"""
prompt = f"""Extract product information from this HTML according to the specified schema.
Return only valid JSON matching the schema.
HTML:
{html_content}
"""
response = model.generate_content(
prompt,
generation_config=genai.GenerationConfig(
response_mime_type="application/json",
response_schema=schema
)
)
product_data = json.loads(response.text)
print(json.dumps(product_data, indent=2))
JavaScript Implementation
For Node.js environments, here's an example using OpenAI with web scraping:
import OpenAI from 'openai';
import axios from 'axios';
import * as cheerio from 'cheerio';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
// Fetch and scrape webpage
async function scrapeAndExtract(url) {
// Fetch HTML
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// Extract main content (reduce token usage)
const mainContent = $('main, article, .content').html() || $('body').html();
// Define extraction schema
const extractionFunction = {
name: 'extract_article_data',
description: 'Extract structured article information',
parameters: {
type: 'object',
properties: {
headline: { type: 'string' },
author: { type: 'string' },
publishDate: { type: 'string' },
category: { type: 'string' },
tags: {
type: 'array',
items: { type: 'string' }
},
summary: { type: 'string' }
},
required: ['headline', 'summary']
}
};
// Call OpenAI with function calling
const completion = await openai.chat.completions.create({
model: 'gpt-4-turbo-preview',
messages: [
{
role: 'user',
content: `Extract article data from this HTML:\n${mainContent}`
}
],
functions: [extractionFunction],
function_call: { name: 'extract_article_data' }
});
// Parse structured output
const structuredData = JSON.parse(
completion.choices[0].message.function_call.arguments
);
return structuredData;
}
// Usage
const articleData = await scrapeAndExtract('https://example.com/article');
console.log(articleData);
Best Practices for Structured LLM Extraction
1. Define Clear Schemas
Always specify the exact structure you need with: - Field names and types - Required vs optional fields - Validation constraints (min/max values, enums) - Nested object structures
2. Pre-process HTML Content
Reduce token usage and improve accuracy by:
from bs4 import BeautifulSoup
def clean_html_for_llm(html_content):
"""Remove unnecessary HTML elements before LLM processing"""
soup = BeautifulSoup(html_content, 'html.parser')
# Remove scripts, styles, and navigation
for tag in soup(['script', 'style', 'nav', 'header', 'footer', 'aside']):
tag.decompose()
# Get main content area
main_content = soup.find('main') or soup.find('article') or soup.find('body')
return str(main_content)
3. Validate LLM Output
Always validate the structured output:
from pydantic import BaseModel, ValidationError
from typing import List, Optional
class Product(BaseModel):
name: str
price: float
description: str
rating: Optional[float] = None
review_count: Optional[int] = None
features: List[str] = []
def extract_and_validate(html_content):
# Get LLM response (previous examples)
llm_output = get_llm_extraction(html_content)
try:
# Validate with Pydantic
product = Product(**llm_output)
return product.model_dump()
except ValidationError as e:
print(f"Validation error: {e}")
return None
4. Handle Batch Extraction
For multiple items (e.g., product listings), structure your schema accordingly:
extraction_function = {
"name": "extract_product_list",
"description": "Extract multiple products from a listing page",
"parameters": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"url": {"type": "string"}
},
"required": ["name", "price"]
}
}
},
"required": ["products"]
}
}
Combining Traditional Scraping with LLM Extraction
For optimal results, combine traditional scraping methods with LLM extraction. Use traditional tools like Puppeteer for handling AJAX requests to load dynamic content, then pass the rendered HTML to an LLM for intelligent extraction:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import openai
def scrape_dynamic_page_with_llm(url):
# Use Selenium to handle dynamic content
driver = webdriver.Chrome()
driver.get(url)
# Wait for content to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "product-grid"))
)
# Get rendered HTML
html_content = driver.page_source
driver.quit()
# Extract structured data with LLM
client = openai.OpenAI()
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "user", "content": f"Extract all products from this HTML:\n{html_content}"}
],
functions=[product_list_schema],
function_call={"name": "extract_product_list"}
)
return json.loads(response.choices[0].message.function_call.arguments)
Error Handling and Retry Logic
Implement robust error handling for LLM API calls:
import time
import json
from openai import OpenAI, APIError, RateLimitError
def extract_with_retry(html_content, max_retries=3):
client = OpenAI()
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "user", "content": f"Extract data:\n{html_content}"}
],
response_format={"type": "json_object"}
)
# Validate JSON
data = json.loads(response.choices[0].message.content)
return data
except RateLimitError:
wait_time = 2 ** attempt # Exponential backoff
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
except json.JSONDecodeError:
print(f"Invalid JSON on attempt {attempt + 1}")
if attempt == max_retries - 1:
raise
except APIError as e:
print(f"API error: {e}")
raise
return None
Cost Optimization
LLM API calls can be expensive. Optimize costs by:
- Reduce input tokens - Extract only relevant HTML sections
- Use smaller models - GPT-3.5 or Claude Haiku for simple extractions
- Batch processing - Extract multiple items in one API call
- Cache results - Store extracted data to avoid re-processing
- Fallback to traditional methods - Use LLMs only when selectors fail
Conclusion
Getting structured output from LLMs for web scraping combines the reliability of schema validation with the intelligence of natural language understanding. By using JSON mode, function calling, or schema-guided generation, you can extract consistent, typed data from complex web pages. When handling dynamic content with browser automation, LLMs provide a powerful alternative to brittle CSS selectors, especially for pages with inconsistent structures or natural language content.
For production use, always validate LLM outputs, implement error handling, and monitor API costs. The combination of traditional web scraping tools and LLM-powered extraction provides the best balance of reliability, accuracy, and cost-effectiveness.