Table of contents

How do I convert HTML to JSON using Claude AI?

Converting HTML to JSON using Claude AI is a powerful approach that leverages large language models (LLMs) to extract structured data from web pages without writing complex parsing logic. Claude AI can understand HTML structure and extract specific fields based on natural language instructions, making it ideal for dynamic or complex web scraping scenarios.

What is Claude AI for HTML to JSON Conversion?

Claude AI is an advanced language model developed by Anthropic that can process and understand HTML content. Instead of using traditional parsing methods like CSS selectors or XPath, you can simply describe what data you want to extract, and Claude will convert the HTML into a clean JSON structure. This approach is particularly useful when dealing with:

  • Websites with frequently changing layouts
  • Complex nested HTML structures
  • Pages where traditional selectors are unreliable
  • Data that requires contextual understanding

Setting Up Claude AI for HTML Conversion

Before you can convert HTML to JSON with Claude AI, you'll need:

  1. API Key: Sign up for an Anthropic API account at console.anthropic.com
  2. API Client: Install the official Claude SDK or use HTTP requests
  3. HTML Content: The web page content you want to convert

Installation

Python:

pip install anthropic

JavaScript/Node.js:

npm install @anthropic-ai/sdk

Basic HTML to JSON Conversion with Claude AI

Python Example

Here's a complete example showing how to convert HTML to JSON using Claude AI in Python:

import anthropic
import json

# Initialize the Claude client
client = anthropic.Anthropic(
    api_key="your-api-key-here"
)

# Sample HTML content
html_content = """
<div class="product">
    <h1>Premium Wireless Headphones</h1>
    <span class="price">$299.99</span>
    <div class="rating">4.5 stars</div>
    <p class="description">High-quality wireless headphones with noise cancellation</p>
    <ul class="features">
        <li>40-hour battery life</li>
        <li>Active noise cancellation</li>
        <li>Bluetooth 5.0</li>
    </ul>
</div>
"""

# Create the conversion prompt
prompt = f"""
Extract the following information from this HTML and return it as JSON:
- product_name
- price (as a number)
- rating (as a number)
- description
- features (as an array)

HTML:
{html_content}

Return only valid JSON, no other text.
"""

# Call Claude API
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

# Parse the response
response_text = message.content[0].text
product_data = json.loads(response_text)

print(json.dumps(product_data, indent=2))

Output:

{
  "product_name": "Premium Wireless Headphones",
  "price": 299.99,
  "rating": 4.5,
  "description": "High-quality wireless headphones with noise cancellation",
  "features": [
    "40-hour battery life",
    "Active noise cancellation",
    "Bluetooth 5.0"
  ]
}

JavaScript Example

Here's the equivalent implementation in JavaScript:

const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY,
});

async function convertHtmlToJson(htmlContent) {
  const prompt = `
Extract the following information from this HTML and return it as JSON:
- product_name
- price (as a number)
- rating (as a number)
- description
- features (as an array)

HTML:
${htmlContent}

Return only valid JSON, no other text.
  `;

  const message = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [
      { role: 'user', content: prompt }
    ]
  });

  const responseText = message.content[0].text;
  return JSON.parse(responseText);
}

// Example usage
const htmlContent = `
<div class="product">
    <h1>Premium Wireless Headphones</h1>
    <span class="price">$299.99</span>
    <div class="rating">4.5 stars</div>
    <p class="description">High-quality wireless headphones with noise cancellation</p>
</div>
`;

convertHtmlToJson(htmlContent)
  .then(data => console.log(JSON.stringify(data, null, 2)))
  .catch(error => console.error('Error:', error));

Advanced Techniques

Using Structured Outputs

Claude AI supports structured outputs through JSON schema, which ensures consistent and type-safe responses:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key-here")

# Define the expected JSON schema
json_schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string"},
        "price": {"type": "number"},
        "rating": {"type": "number"},
        "description": {"type": "string"},
        "features": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["product_name", "price"]
}

# Use tool calling for structured output
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[{
        "name": "extract_product_data",
        "description": "Extracts product information from HTML",
        "input_schema": json_schema
    }],
    tool_choice={"type": "tool", "name": "extract_product_data"},
    messages=[{
        "role": "user",
        "content": f"Extract product data from this HTML:\n{html_content}"
    }]
)

# Extract the structured data
tool_use = next(block for block in message.content if block.type == "tool_use")
product_data = tool_use.input
print(json.dumps(product_data, indent=2))

Combining with Web Scraping

When working with real websites, you'll need to fetch the HTML first. Here's an example combining HTML fetching with Claude AI conversion:

import anthropic
import requests
import json

def scrape_and_convert(url):
    # Fetch HTML content
    response = requests.get(url, headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    })
    html_content = response.text

    # Initialize Claude client
    client = anthropic.Anthropic(api_key="your-api-key-here")

    # Convert to JSON
    prompt = f"""
    Extract product information from this e-commerce page.
    Return a JSON object with: name, price, availability, description, and images array.

    HTML:
    {html_content[:10000]}  # Limit to first 10000 chars

    Return only valid JSON.
    """

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )

    return json.loads(message.content[0].text)

# Usage
product_data = scrape_and_convert("https://example.com/product")
print(json.dumps(product_data, indent=2))

For more complex scenarios involving JavaScript-rendered pages, you can handle AJAX requests using Puppeteer to fetch the complete HTML before passing it to Claude AI.

Handling Multiple Items and Lists

When extracting multiple products or items from a page, Claude AI can return arrays of objects:

html_listing = """
<div class="products">
    <div class="product">
        <h3>Laptop Pro</h3>
        <span class="price">$1299</span>
    </div>
    <div class="product">
        <h3>Wireless Mouse</h3>
        <span class="price">$29.99</span>
    </div>
    <div class="product">
        <h3>USB-C Hub</h3>
        <span class="price">$49.99</span>
    </div>
</div>
"""

prompt = f"""
Extract all products from this HTML.
Return a JSON object with a "products" array containing name and price for each product.

HTML:
{html_listing}

Return only valid JSON.
"""

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}]
)

data = json.loads(message.content[0].text)
print(json.dumps(data, indent=2))

Output:

{
  "products": [
    {
      "name": "Laptop Pro",
      "price": 1299
    },
    {
      "name": "Wireless Mouse",
      "price": 29.99
    },
    {
      "name": "USB-C Hub",
      "price": 49.99
    }
  ]
}

Error Handling and Best Practices

Robust Error Handling

Always implement proper error handling when converting HTML to JSON:

import anthropic
import json
from anthropic import APIError

def safe_html_to_json(html_content, schema_description):
    try:
        client = anthropic.Anthropic(api_key="your-api-key-here")

        prompt = f"""
        {schema_description}

        HTML:
        {html_content}

        Return only valid JSON. If information is missing, use null.
        """

        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=2048,
            messages=[{"role": "user", "content": prompt}]
        )

        response_text = message.content[0].text.strip()

        # Remove markdown code blocks if present
        if response_text.startswith("```language-json"):
            response_text = response_text[7:-3].strip()
        elif response_text.startswith("```"):
            response_text = response_text[3:-3].strip()

        return json.loads(response_text)

    except json.JSONDecodeError as e:
        print(f"Invalid JSON response: {e}")
        return None
    except APIError as e:
        print(f"API error: {e}")
        return None
    except Exception as e:
        print(f"Unexpected error: {e}")
        return None

Best Practices

  1. Limit HTML Size: Claude has token limits. Truncate HTML or extract relevant sections before sending:
   # Extract just the main content
   from bs4 import BeautifulSoup

   soup = BeautifulSoup(html_content, 'html.parser')
   main_content = soup.find('main') or soup.find('article')
   cleaned_html = str(main_content)[:10000]
  1. Be Specific in Prompts: Clearly define the expected JSON structure and field types.

  2. Use Examples: Provide example output in your prompt for consistent formatting:

   prompt = """
   Extract product data and return it like this example:
   {
     "name": "Product Name",
     "price": 99.99,
     "in_stock": true
   }

   HTML:
   {html_content}
   """
  1. Validate Output: Always validate the JSON structure matches your expectations.

  2. Handle Rate Limits: Implement retry logic and respect API rate limits.

Cost Considerations

Claude AI pricing is based on tokens processed. For HTML to JSON conversion:

  • Input tokens: HTML content + prompt
  • Output tokens: JSON response

To optimize costs:

  1. Preprocess HTML to remove unnecessary tags (scripts, styles)
  2. Extract only relevant sections before sending to Claude
  3. Use caching for repeated conversions of similar pages
  4. Choose the appropriate model (Claude Haiku for simple extractions, Sonnet for complex ones)

Alternative: Using WebScraping.AI with Claude

For production web scraping with AI-powered extraction, consider using specialized APIs that combine headless browsing with LLM extraction. When handling browser sessions in Puppeteer, you can capture the rendered HTML and then pass it to Claude for intelligent data extraction.

Conclusion

Converting HTML to JSON using Claude AI offers a flexible, code-light approach to web scraping. It's particularly valuable for:

  • Rapid prototyping of scraping projects
  • Websites with complex or changing structures
  • Extracting data that requires semantic understanding
  • Scenarios where traditional parsing is too brittle

While it may have higher per-request costs than traditional parsing, the reduced development time and increased flexibility often make it a worthwhile trade-off, especially for complex extraction tasks.

By combining Claude AI's natural language understanding with proper HTML preprocessing and error handling, you can build robust web scraping solutions that adapt to website changes without requiring constant maintenance of selectors and parsing logic.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon