How do I Extract Structured Data from HTML Using LLMs?
Large Language Models (LLMs) have revolutionized web scraping by offering a more flexible and intelligent approach to extracting structured data from HTML. Unlike traditional parsing methods that rely on rigid selectors, LLMs can understand context, handle layout variations, and extract data even when the HTML structure changes.
Understanding LLM-Based Data Extraction
Traditional web scraping uses CSS selectors or XPath to target specific HTML elements. While effective, this approach breaks when websites update their layouts. LLMs can analyze HTML content contextually and extract data based on semantic understanding rather than element paths.
Key Advantages
- Layout resilience: Works even when HTML structure changes
- Semantic understanding: Identifies data based on meaning, not just position
- Reduced maintenance: Less brittle than selector-based scraping
- Complex data handling: Better at extracting nested or related data points
Basic LLM-Based Extraction with OpenAI
Here's a fundamental example using OpenAI's GPT models to extract product information from HTML:
import openai
import requests
from bs4 import BeautifulSoup
openai.api_key = "your-api-key"
# Fetch HTML content
response = requests.get("https://example-store.com/product/laptop")
html_content = response.text
# Clean and prepare HTML (optional but recommended)
soup = BeautifulSoup(html_content, 'html.parser')
main_content = soup.find('main') or soup.body
cleaned_html = str(main_content)
# Create extraction prompt
prompt = f"""
Extract the following product information from this HTML:
- Product name
- Price
- Description
- Availability status
- Specifications
HTML:
{cleaned_html[:4000]} # Limit to stay within token limits
Return the data as JSON.
"""
# Call OpenAI API
completion = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a data extraction assistant. Always return valid JSON."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
extracted_data = completion.choices[0].message.content
print(extracted_data)
Structured Output with Function Calling
Modern LLM APIs support function calling (also known as tool calling), which guarantees structured output in a specific schema:
import openai
import json
openai.api_key = "your-api-key"
# Define the schema for extracted data
functions = [
{
"name": "extract_product_data",
"description": "Extract product information from HTML",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string", "description": "Product name"},
"price": {"type": "number", "description": "Product price"},
"currency": {"type": "string", "description": "Currency code"},
"description": {"type": "string", "description": "Product description"},
"in_stock": {"type": "boolean", "description": "Availability status"},
"specifications": {
"type": "object",
"description": "Product specifications",
"additionalProperties": {"type": "string"}
}
},
"required": ["name", "price", "currency"]
}
}
]
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "user", "content": f"Extract product data from this HTML: {html_content[:4000]}"}
],
functions=functions,
function_call={"name": "extract_product_data"}
)
# Parse the function call response
function_args = json.loads(response.choices[0].message.function_call.arguments)
print(json.dumps(function_args, indent=2))
JavaScript Implementation with Claude API
Here's how to extract structured data using Anthropic's Claude API in JavaScript:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const { JSDOM } = require('jsdom');
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
async function extractDataWithClaude(url) {
// Fetch and clean HTML
const response = await axios.get(url);
const dom = new JSDOM(response.data);
const mainContent = dom.window.document.querySelector('main')?.innerHTML
|| dom.window.document.body.innerHTML;
// Truncate to avoid token limits
const truncatedHtml = mainContent.substring(0, 10000);
const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [
{
role: 'user',
content: `Extract the product information from this HTML and return it as valid JSON with fields: name, price, currency, description, inStock, and specifications (as an object).
HTML:
${truncatedHtml}`
}
]
});
const extractedText = message.content[0].text;
return JSON.parse(extractedText);
}
// Usage
extractDataWithClaude('https://example-store.com/product/laptop')
.then(data => console.log(data))
.catch(error => console.error(error));
Using LangChain for Advanced Extraction
LangChain provides a powerful framework for building LLM-powered web scraping workflows:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List, Dict, Optional
# Define data structure
class Product(BaseModel):
name: str = Field(description="Product name")
price: float = Field(description="Product price")
currency: str = Field(description="Currency code")
description: str = Field(description="Product description")
in_stock: bool = Field(description="Whether product is in stock")
specifications: Dict[str, str] = Field(description="Product specifications")
class Config:
schema_extra = {
"example": {
"name": "MacBook Pro",
"price": 1999.99,
"currency": "USD",
"description": "Powerful laptop",
"in_stock": True,
"specifications": {"RAM": "16GB", "Storage": "512GB"}
}
}
# Initialize parser and LLM
parser = PydanticOutputParser(pydantic_object=Product)
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)
# Create prompt template
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert at extracting structured product data from HTML."),
("user", "{format_instructions}\n\nExtract product data from this HTML:\n{html_content}")
])
# Create chain
chain = prompt | llm | parser
# Execute extraction
result = chain.invoke({
"format_instructions": parser.get_format_instructions(),
"html_content": html_content[:4000]
})
print(result.dict())
Handling Multiple Items with LLMs
When extracting lists of items (like search results or product listings), structure your schema accordingly:
from typing import List
from pydantic import BaseModel
class ProductListing(BaseModel):
name: str
price: float
url: str
rating: Optional[float]
class SearchResults(BaseModel):
products: List[ProductListing]
total_count: int
page: int
parser = PydanticOutputParser(pydantic_object=SearchResults)
prompt = f"""
Extract all product listings from this search results page.
{parser.get_format_instructions()}
HTML:
{search_results_html}
"""
# Process with your preferred LLM
Optimizing HTML for LLM Processing
To reduce token usage and improve accuracy:
1. Remove Unnecessary Elements
from bs4 import BeautifulSoup
def clean_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# Remove scripts, styles, and comments
for element in soup(['script', 'style', 'nav', 'footer', 'header']):
element.decompose()
# Remove attributes except data-* and class
for tag in soup.find_all(True):
attrs = dict(tag.attrs)
for attr in attrs:
if not attr.startswith('data-') and attr != 'class':
del tag.attrs[attr]
return str(soup)
2. Convert to Markdown
Many LLMs work better with markdown than raw HTML:
from markdownify import markdownify as md
cleaned_html = clean_html(html_content)
markdown_content = md(cleaned_html, heading_style="ATX")
# Use markdown in your prompt instead of HTML
Cost Optimization Strategies
LLM-based extraction can be expensive. Here are optimization techniques:
1. Cache Results
import hashlib
import json
from functools import lru_cache
@lru_cache(maxsize=1000)
def extract_with_cache(html_hash):
# Your extraction logic here
pass
# Usage
html_hash = hashlib.md5(html_content.encode()).hexdigest()
result = extract_with_cache(html_hash)
2. Use Smaller Models for Simple Tasks
# Use GPT-3.5 for straightforward extractions
model = "gpt-3.5-turbo" if simple_structure else "gpt-4-turbo-preview"
3. Batch Processing
Process multiple pages in a single API call when possible:
prompt = f"""
Extract data from these 5 product pages. Return an array of JSON objects.
Page 1:
{html_1}
Page 2:
{html_2}
...
"""
Error Handling and Validation
Always validate LLM output since LLM hallucinations can occur:
from pydantic import ValidationError
try:
result = parser.parse(llm_response)
# Additional business logic validation
if result.price < 0:
raise ValueError("Price cannot be negative")
if not result.currency in ['USD', 'EUR', 'GBP']:
raise ValueError(f"Invalid currency: {result.currency}")
except ValidationError as e:
print(f"Validation error: {e}")
# Retry or use fallback method
except ValueError as e:
print(f"Business logic error: {e}")
# Handle invalid data
Hybrid Approach: Combining Selectors and LLMs
For production systems, combine traditional selectors with LLM extraction:
def hybrid_extraction(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
# Try traditional extraction first (faster and cheaper)
try:
name = soup.select_one('.product-name').text.strip()
price = float(soup.select_one('.price').text.replace('$', ''))
return {
'name': name,
'price': price,
'method': 'selector'
}
except (AttributeError, ValueError):
# Fall back to LLM extraction
return extract_with_llm(html_content)
Real-World Example: E-commerce Scraper
Here's a complete example scraping product data:
import openai
import requests
from bs4 import BeautifulSoup
import json
class LLMProductScraper:
def __init__(self, api_key):
openai.api_key = api_key
def fetch_page(self, url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
return response.text
def clean_html(self, html):
soup = BeautifulSoup(html, 'html.parser')
main = soup.find('main') or soup.find(id='content') or soup.body
for tag in main(['script', 'style', 'nav', 'footer']):
tag.decompose()
return str(main)[:8000] # Limit tokens
def extract_data(self, html):
cleaned = self.clean_html(html)
response = openai.ChatCompletion.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "system",
"content": "Extract product data and return valid JSON with fields: name, price, currency, description, inStock, images (array), specifications (object)."
},
{"role": "user", "content": f"HTML:\n{cleaned}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
def scrape(self, url):
html = self.fetch_page(url)
return self.extract_data(html)
# Usage
scraper = LLMProductScraper(api_key="your-api-key")
product_data = scraper.scrape("https://example.com/product/123")
print(json.dumps(product_data, indent=2))
Choosing the Right LLM for Data Extraction
Different LLMs have different strengths for structured data extraction tasks:
- GPT-4: Best accuracy, handles complex nested data
- GPT-3.5-Turbo: Good balance of speed and cost
- Claude 3: Excellent at following instructions, large context window
- Llama 2: Cost-effective for self-hosting
Conclusion
LLM-based HTML extraction offers a more resilient and intelligent approach to web scraping. While it comes with higher costs and latency compared to traditional methods, the reduced maintenance burden and improved reliability often justify the investment. For production systems, consider a hybrid approach that combines the speed of traditional selectors with the flexibility of LLMs as a fallback.
Start with small-scale experiments to understand token usage and costs, then scale up once you've optimized your prompts and data structures. With proper implementation, LLM-based extraction can handle the most challenging web scraping scenarios with minimal code maintenance.