How do I convert HTML to JSON using Claude AI?
Converting HTML to JSON using Claude AI is a powerful approach that leverages large language models (LLMs) to extract structured data from web pages without writing complex parsing logic. Claude AI can understand HTML structure and extract specific fields based on natural language instructions, making it ideal for dynamic or complex web scraping scenarios.
What is Claude AI for HTML to JSON Conversion?
Claude AI is an advanced language model developed by Anthropic that can process and understand HTML content. Instead of using traditional parsing methods like CSS selectors or XPath, you can simply describe what data you want to extract, and Claude will convert the HTML into a clean JSON structure. This approach is particularly useful when dealing with:
- Websites with frequently changing layouts
- Complex nested HTML structures
- Pages where traditional selectors are unreliable
- Data that requires contextual understanding
Setting Up Claude AI for HTML Conversion
Before you can convert HTML to JSON with Claude AI, you'll need:
- API Key: Sign up for an Anthropic API account at console.anthropic.com
- API Client: Install the official Claude SDK or use HTTP requests
- HTML Content: The web page content you want to convert
Installation
Python:
pip install anthropic
JavaScript/Node.js:
npm install @anthropic-ai/sdk
Basic HTML to JSON Conversion with Claude AI
Python Example
Here's a complete example showing how to convert HTML to JSON using Claude AI in Python:
import anthropic
import json
# Initialize the Claude client
client = anthropic.Anthropic(
api_key="your-api-key-here"
)
# Sample HTML content
html_content = """
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$299.99</span>
<div class="rating">4.5 stars</div>
<p class="description">High-quality wireless headphones with noise cancellation</p>
<ul class="features">
<li>40-hour battery life</li>
<li>Active noise cancellation</li>
<li>Bluetooth 5.0</li>
</ul>
</div>
"""
# Create the conversion prompt
prompt = f"""
Extract the following information from this HTML and return it as JSON:
- product_name
- price (as a number)
- rating (as a number)
- description
- features (as an array)
HTML:
{html_content}
Return only valid JSON, no other text.
"""
# Call Claude API
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": prompt}
]
)
# Parse the response
response_text = message.content[0].text
product_data = json.loads(response_text)
print(json.dumps(product_data, indent=2))
Output:
{
"product_name": "Premium Wireless Headphones",
"price": 299.99,
"rating": 4.5,
"description": "High-quality wireless headphones with noise cancellation",
"features": [
"40-hour battery life",
"Active noise cancellation",
"Bluetooth 5.0"
]
}
JavaScript Example
Here's the equivalent implementation in JavaScript:
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY,
});
async function convertHtmlToJson(htmlContent) {
const prompt = `
Extract the following information from this HTML and return it as JSON:
- product_name
- price (as a number)
- rating (as a number)
- description
- features (as an array)
HTML:
${htmlContent}
Return only valid JSON, no other text.
`;
const message = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [
{ role: 'user', content: prompt }
]
});
const responseText = message.content[0].text;
return JSON.parse(responseText);
}
// Example usage
const htmlContent = `
<div class="product">
<h1>Premium Wireless Headphones</h1>
<span class="price">$299.99</span>
<div class="rating">4.5 stars</div>
<p class="description">High-quality wireless headphones with noise cancellation</p>
</div>
`;
convertHtmlToJson(htmlContent)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Error:', error));
Advanced Techniques
Using Structured Outputs
Claude AI supports structured outputs through JSON schema, which ensures consistent and type-safe responses:
import anthropic
import json
client = anthropic.Anthropic(api_key="your-api-key-here")
# Define the expected JSON schema
json_schema = {
"type": "object",
"properties": {
"product_name": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"},
"description": {"type": "string"},
"features": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["product_name", "price"]
}
# Use tool calling for structured output
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[{
"name": "extract_product_data",
"description": "Extracts product information from HTML",
"input_schema": json_schema
}],
tool_choice={"type": "tool", "name": "extract_product_data"},
messages=[{
"role": "user",
"content": f"Extract product data from this HTML:\n{html_content}"
}]
)
# Extract the structured data
tool_use = next(block for block in message.content if block.type == "tool_use")
product_data = tool_use.input
print(json.dumps(product_data, indent=2))
Combining with Web Scraping
When working with real websites, you'll need to fetch the HTML first. Here's an example combining HTML fetching with Claude AI conversion:
import anthropic
import requests
import json
def scrape_and_convert(url):
# Fetch HTML content
response = requests.get(url, headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})
html_content = response.text
# Initialize Claude client
client = anthropic.Anthropic(api_key="your-api-key-here")
# Convert to JSON
prompt = f"""
Extract product information from this e-commerce page.
Return a JSON object with: name, price, availability, description, and images array.
HTML:
{html_content[:10000]} # Limit to first 10000 chars
Return only valid JSON.
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(message.content[0].text)
# Usage
product_data = scrape_and_convert("https://example.com/product")
print(json.dumps(product_data, indent=2))
For more complex scenarios involving JavaScript-rendered pages, you can handle AJAX requests using Puppeteer to fetch the complete HTML before passing it to Claude AI.
Handling Multiple Items and Lists
When extracting multiple products or items from a page, Claude AI can return arrays of objects:
html_listing = """
<div class="products">
<div class="product">
<h3>Laptop Pro</h3>
<span class="price">$1299</span>
</div>
<div class="product">
<h3>Wireless Mouse</h3>
<span class="price">$29.99</span>
</div>
<div class="product">
<h3>USB-C Hub</h3>
<span class="price">$49.99</span>
</div>
</div>
"""
prompt = f"""
Extract all products from this HTML.
Return a JSON object with a "products" array containing name and price for each product.
HTML:
{html_listing}
Return only valid JSON.
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
data = json.loads(message.content[0].text)
print(json.dumps(data, indent=2))
Output:
{
"products": [
{
"name": "Laptop Pro",
"price": 1299
},
{
"name": "Wireless Mouse",
"price": 29.99
},
{
"name": "USB-C Hub",
"price": 49.99
}
]
}
Error Handling and Best Practices
Robust Error Handling
Always implement proper error handling when converting HTML to JSON:
import anthropic
import json
from anthropic import APIError
def safe_html_to_json(html_content, schema_description):
try:
client = anthropic.Anthropic(api_key="your-api-key-here")
prompt = f"""
{schema_description}
HTML:
{html_content}
Return only valid JSON. If information is missing, use null.
"""
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
messages=[{"role": "user", "content": prompt}]
)
response_text = message.content[0].text.strip()
# Remove markdown code blocks if present
if response_text.startswith("```language-json"):
response_text = response_text[7:-3].strip()
elif response_text.startswith("```"):
response_text = response_text[3:-3].strip()
return json.loads(response_text)
except json.JSONDecodeError as e:
print(f"Invalid JSON response: {e}")
return None
except APIError as e:
print(f"API error: {e}")
return None
except Exception as e:
print(f"Unexpected error: {e}")
return None
Best Practices
- Limit HTML Size: Claude has token limits. Truncate HTML or extract relevant sections before sending:
# Extract just the main content
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
main_content = soup.find('main') or soup.find('article')
cleaned_html = str(main_content)[:10000]
Be Specific in Prompts: Clearly define the expected JSON structure and field types.
Use Examples: Provide example output in your prompt for consistent formatting:
prompt = """
Extract product data and return it like this example:
{
"name": "Product Name",
"price": 99.99,
"in_stock": true
}
HTML:
{html_content}
"""
Validate Output: Always validate the JSON structure matches your expectations.
Handle Rate Limits: Implement retry logic and respect API rate limits.
Cost Considerations
Claude AI pricing is based on tokens processed. For HTML to JSON conversion:
- Input tokens: HTML content + prompt
- Output tokens: JSON response
To optimize costs:
- Preprocess HTML to remove unnecessary tags (scripts, styles)
- Extract only relevant sections before sending to Claude
- Use caching for repeated conversions of similar pages
- Choose the appropriate model (Claude Haiku for simple extractions, Sonnet for complex ones)
Alternative: Using WebScraping.AI with Claude
For production web scraping with AI-powered extraction, consider using specialized APIs that combine headless browsing with LLM extraction. When handling browser sessions in Puppeteer, you can capture the rendered HTML and then pass it to Claude for intelligent data extraction.
Conclusion
Converting HTML to JSON using Claude AI offers a flexible, code-light approach to web scraping. It's particularly valuable for:
- Rapid prototyping of scraping projects
- Websites with complex or changing structures
- Extracting data that requires semantic understanding
- Scenarios where traditional parsing is too brittle
While it may have higher per-request costs than traditional parsing, the reduced development time and increased flexibility often make it a worthwhile trade-off, especially for complex extraction tasks.
By combining Claude AI's natural language understanding with proper HTML preprocessing and error handling, you can build robust web scraping solutions that adapt to website changes without requiring constant maintenance of selectors and parsing logic.