How do I convert unstructured web page content into JSON using an LLM?
Converting unstructured web page content into structured JSON is one of the most powerful applications of Large Language Models (LLMs) in web scraping. Instead of writing complex parsing logic with XPath or CSS selectors, you can use LLMs to intelligently understand and extract data from any HTML content, transforming it into clean, structured JSON format.
This guide will show you how to leverage LLMs like GPT-4, Claude, and other models to automate the conversion of unstructured web content into structured data.
Why Use LLMs for JSON Conversion?
Traditional web scraping requires you to:
- Manually inspect HTML structure
- Write brittle CSS selectors or XPath expressions
- Update code when website layouts change
- Handle variations in data formats manually
LLMs eliminate these pain points by:
- Understanding context: Semantically interpreting content regardless of HTML structure
- Adapting to changes: Working even when page layouts change
- Handling variations: Processing different formats and edge cases automatically
- Reducing maintenance: Requiring minimal code updates over time
Basic Approach: Fetching and Converting
The fundamental workflow involves three steps:
- Fetch the HTML content from the target webpage
- Send it to an LLM with instructions to extract specific data
- Receive structured JSON output
Example Using Python with OpenAI GPT-4
import requests
from openai import OpenAI
def convert_webpage_to_json(url, fields):
# Step 1: Fetch the HTML content
response = requests.get(url)
html_content = response.text
# Step 2: Initialize OpenAI client
client = OpenAI(api_key='your-api-key')
# Step 3: Create prompt for JSON conversion
prompt = f"""Extract the following information from this HTML and return it as valid JSON:
Fields to extract: {', '.join(fields)}
HTML content:
{html_content}
Return only valid JSON with no additional text."""
# Step 4: Call the LLM
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant that converts HTML to JSON."
},
{
"role": "user",
"content": prompt
}
],
temperature=0 # Lower temperature for more consistent output
)
# Step 5: Parse and return JSON
import json
result = json.loads(completion.choices[0].message.content)
return result
# Usage example
data = convert_webpage_to_json(
'https://example.com/product/laptop',
['product_name', 'price', 'rating', 'description', 'availability']
)
print(json.dumps(data, indent=2))
Output:
{
"product_name": "Dell XPS 13 Laptop",
"price": 999.99,
"rating": 4.5,
"description": "Ultra-thin 13-inch laptop with Intel Core i7 processor",
"availability": "In Stock"
}
Example Using JavaScript with Claude API
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
async function convertWebpageToJSON(url, fields) {
// Step 1: Fetch HTML content
const response = await axios.get(url);
const htmlContent = response.data;
// Step 2: Initialize Claude client
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
// Step 3: Create extraction prompt
const prompt = `Extract the following fields from this HTML and return as valid JSON:
Fields: ${fields.join(', ')}
HTML:
${htmlContent}
Return only the JSON object, no other text.`;
// Step 4: Call Claude API
const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 2048,
temperature: 0,
messages: [{
role: 'user',
content: prompt
}]
});
// Step 5: Parse JSON response
const jsonText = message.content[0].text;
const data = JSON.parse(jsonText);
return data;
}
// Usage example
convertWebpageToJSON(
'https://example.com/article',
['title', 'author', 'publish_date', 'content', 'tags']
)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Error:', error));
Advanced Techniques for Better Results
1. Using Structured Output (JSON Schema)
Modern LLMs support JSON schema to guarantee valid, type-safe output:
from openai import OpenAI
client = OpenAI(api_key='your-api-key')
# Define the exact JSON structure you want
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{
"role": "system",
"content": "Extract product information from HTML."
},
{
"role": "user",
"content": html_content
}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "product_data",
"strict": True,
"schema": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Product name"
},
"price": {
"type": "number",
"description": "Price in USD"
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP"]
},
"in_stock": {
"type": "boolean"
},
"features": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["name", "price", "in_stock"],
"additionalProperties": False
}
}
}
)
product_data = json.loads(response.choices[0].message.content)
This approach guarantees: - Valid JSON output every time - Correct data types - Required fields are always present - No unexpected fields
2. Preprocessing HTML for Better Results
Clean and reduce HTML before sending to the LLM to save tokens and improve accuracy:
from bs4 import BeautifulSoup
import requests
def clean_html_for_llm(html):
"""Remove unnecessary elements and extract main content."""
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, navigation, ads
for element in soup(['script', 'style', 'nav', 'header',
'footer', 'aside', 'iframe', 'noscript']):
element.decompose()
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
# Focus on main content
main_content = (soup.find('main') or
soup.find('article') or
soup.find(class_='content') or
soup.body)
return str(main_content) if main_content else str(soup)
def scrape_with_preprocessing(url):
# Fetch HTML
response = requests.get(url)
# Clean HTML
cleaned_html = clean_html_for_llm(response.text)
# Now send cleaned HTML to LLM
# ... (use previous examples)
3. Batch Processing Multiple Pages
Process multiple pages efficiently by batching requests:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
async function batchConvertToJSON(urls, fields) {
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
const results = [];
// Process in parallel with concurrency limit
const concurrency = 5;
for (let i = 0; i < urls.length; i += concurrency) {
const batch = urls.slice(i, i + concurrency);
const promises = batch.map(async (url) => {
try {
// Fetch HTML
const response = await axios.get(url);
// Convert to JSON
const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{
role: 'user',
content: `Extract ${fields.join(', ')} from:\n${response.data}`
}]
});
return {
url: url,
success: true,
data: JSON.parse(message.content[0].text)
};
} catch (error) {
return {
url: url,
success: false,
error: error.message
};
}
});
const batchResults = await Promise.all(promises);
results.push(...batchResults);
// Rate limiting delay
if (i + concurrency < urls.length) {
await new Promise(resolve => setTimeout(resolve, 1000));
}
}
return results;
}
// Usage
const urls = [
'https://example.com/product/1',
'https://example.com/product/2',
'https://example.com/product/3'
];
batchConvertToJSON(urls, ['name', 'price', 'rating'])
.then(results => console.log(results));
4. Handling Dynamic Content with Puppeteer
When scraping JavaScript-rendered pages, combine browser automation with LLM conversion. When handling AJAX requests using Puppeteer, you can wait for dynamic content to load before extracting:
from playwright.sync_api import sync_playwright
import anthropic
import json
def scrape_dynamic_page_to_json(url, fields):
with sync_playwright() as p:
# Launch browser
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content
page.goto(url)
page.wait_for_load_state('networkidle')
# Get fully rendered HTML
html_content = page.content()
browser.close()
# Convert to JSON using Claude
client = anthropic.Anthropic(api_key='your-api-key')
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{
"role": "user",
"content": f"""Extract these fields from the HTML: {', '.join(fields)}
HTML:
{html_content}
Return as valid JSON only."""
}]
)
# Parse JSON
return json.loads(message.content[0].text)
# Usage for SPA or AJAX-heavy sites
data = scrape_dynamic_page_to_json(
'https://example.com/spa-page',
['articles', 'total_count', 'categories']
)
print(json.dumps(data, indent=2))
5. Robust Error Handling
Always implement retry logic and validation:
import requests
from openai import OpenAI
import json
import time
from jsonschema import validate, ValidationError
def robust_html_to_json(url, fields, schema=None, max_retries=3):
"""Convert HTML to JSON with retry logic and validation."""
client = OpenAI(api_key='your-api-key')
# Fetch HTML
response = requests.get(url, timeout=10)
response.raise_for_status()
html_content = response.text
for attempt in range(max_retries):
try:
# Call LLM
completion = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "Extract data from HTML and return valid JSON only."
},
{
"role": "user",
"content": f"Extract {', '.join(fields)} from:\n{html_content}"
}
],
temperature=0
)
# Parse JSON
result = json.loads(completion.choices[0].message.content)
# Validate against schema if provided
if schema:
validate(instance=result, schema=schema)
# Check required fields
missing_fields = [f for f in fields if f not in result]
if missing_fields:
raise ValueError(f"Missing fields: {missing_fields}")
return {
'success': True,
'data': result,
'attempt': attempt + 1
}
except (json.JSONDecodeError, ValidationError, ValueError) as e:
print(f"Attempt {attempt + 1} failed: {str(e)}")
if attempt == max_retries - 1:
return {
'success': False,
'error': str(e),
'attempt': attempt + 1
}
# Exponential backoff
time.sleep(2 ** attempt)
return {'success': False, 'error': 'Max retries exceeded'}
# Usage with validation
schema = {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number", "minimum": 0, "maximum": 5}
},
"required": ["title", "price"]
}
result = robust_html_to_json(
'https://example.com/product',
['title', 'price', 'rating'],
schema=schema
)
if result['success']:
print("Extracted data:", result['data'])
else:
print("Extraction failed:", result['error'])
Converting Complex Nested Structures
LLMs excel at handling deeply nested HTML and extracting hierarchical JSON:
const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
async function extractNestedData(url) {
const response = await axios.get(url);
const anthropic = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 4096,
messages: [{
role: 'user',
content: `Extract a nested JSON structure with this format:
{
"page_title": "string",
"categories": [
{
"name": "string",
"products": [
{
"name": "string",
"price": number,
"specs": {
"color": "string",
"size": "string",
"weight": "string"
},
"reviews": [
{
"author": "string",
"rating": number,
"comment": "string"
}
]
}
]
}
]
}
HTML:
${response.data}`
}]
});
return JSON.parse(message.content[0].text);
}
// Usage
extractNestedData('https://example.com/catalog')
.then(data => {
console.log('Page Title:', data.page_title);
console.log('Categories:', data.categories.length);
data.categories.forEach(cat => {
console.log(` ${cat.name}: ${cat.products.length} products`);
});
});
Using WebScraping.AI for LLM-Powered JSON Conversion
WebScraping.AI provides built-in LLM capabilities that handle browser automation, proxy rotation, and LLM integration:
import requests
api_key = 'your-webscraping-ai-api-key'
# Field-based extraction (automatically returns JSON)
response = requests.get(
'https://api.webscraping.ai/fields',
params={
'api_key': api_key,
'url': 'https://example.com/product',
'fields': 'name,price,description,rating,availability,features'
}
)
# Already structured as JSON
product_data = response.json()
print(product_data)
const axios = require('axios');
async function scrapeWithAI(url, fields) {
const response = await axios.get('https://api.webscraping.ai/fields', {
params: {
api_key: 'your-api-key',
url: url,
fields: fields.join(',')
}
});
return response.data;
}
// Usage
scrapeWithAI(
'https://example.com/article',
['headline', 'author', 'publish_date', 'body', 'tags']
)
.then(data => console.log(data));
Best Practices for Production Use
1. Optimize Token Usage
def optimize_html_for_tokens(html, max_length=8000):
"""Reduce HTML to fit within token limits."""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Remove unnecessary elements
for tag in soup(['script', 'style', 'svg', 'path']):
tag.decompose()
# Remove attributes that don't help extraction
for tag in soup.find_all(True):
tag.attrs = {k: v for k, v in tag.attrs.items()
if k in ['class', 'id', 'href', 'src']}
# Truncate if still too long
text = str(soup)
if len(text) > max_length:
text = text[:max_length]
return text
2. Cache Results
import hashlib
import json
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_convert_to_json(url_hash, html_hash, fields_str):
"""Cache LLM responses to avoid duplicate API calls."""
# This will only be called once per unique combination
return convert_webpage_to_json(url, fields)
# Usage
url_hash = hashlib.md5(url.encode()).hexdigest()
html_hash = hashlib.md5(html.encode()).hexdigest()
result = cached_convert_to_json(url_hash, html_hash, ','.join(fields))
3. Monitor Costs and Performance
import time
import logging
class LLMScraper:
def __init__(self, api_key):
self.client = OpenAI(api_key=api_key)
self.total_tokens = 0
self.total_requests = 0
self.total_cost = 0
def convert_to_json(self, html, fields):
start_time = time.time()
response = self.client.chat.completions.create(
model="gpt-4",
messages=[...],
)
# Track metrics
tokens_used = response.usage.total_tokens
self.total_tokens += tokens_used
self.total_requests += 1
# Calculate cost (example rates)
cost = (tokens_used / 1000) * 0.03 # $0.03 per 1K tokens
self.total_cost += cost
duration = time.time() - start_time
logging.info(f"Request completed in {duration:.2f}s, "
f"Tokens: {tokens_used}, Cost: ${cost:.4f}")
return json.loads(response.choices[0].message.content)
def get_stats(self):
return {
'total_requests': self.total_requests,
'total_tokens': self.total_tokens,
'total_cost': self.total_cost,
'avg_tokens_per_request': self.total_tokens / max(self.total_requests, 1)
}
Conclusion
Converting unstructured web page content into JSON using LLMs transforms web scraping from a brittle, maintenance-heavy process into an intelligent, adaptive workflow. By combining traditional web scraping tools for fetching and navigating pages with LLM-powered extraction, you can build robust data pipelines that adapt to changing website structures and handle complex, nested data with ease.
Key takeaways:
- Start simple: Basic LLM API calls can handle most conversion tasks
- Use structured output: JSON schemas guarantee valid, type-safe results
- Preprocess HTML: Clean and optimize content to reduce tokens and costs
- Implement error handling: Retry logic and validation prevent failures
- Monitor performance: Track token usage and costs in production
- Consider managed services: APIs like WebScraping.AI handle infrastructure complexity
As LLM technology continues to improve with faster inference, lower costs, and larger context windows, converting unstructured content to JSON will become even more powerful and accessible for developers building web scraping applications.