How do I convert unstructured web page content into JSON using an LLM?

Converting unstructured web page content into structured JSON is one of the most powerful applications of Large Language Models (LLMs) in web scraping. Instead of writing complex parsing logic with XPath or CSS selectors, you can use LLMs to intelligently understand and extract data from any HTML content, transforming it into clean, structured JSON format.

This guide will show you how to leverage LLMs like GPT-4, Claude, and other models to automate the conversion of unstructured web content into structured data.

Why Use LLMs for JSON Conversion?

Traditional web scraping requires you to:

Manually inspect HTML structure
Write brittle CSS selectors or XPath expressions
Update code when website layouts change
Handle variations in data formats manually

LLMs eliminate these pain points by:

Understanding context: Semantically interpreting content regardless of HTML structure
Adapting to changes: Working even when page layouts change
Handling variations: Processing different formats and edge cases automatically
Reducing maintenance: Requiring minimal code updates over time

Basic Approach: Fetching and Converting

The fundamental workflow involves three steps:

Fetch the HTML content from the target webpage
Send it to an LLM with instructions to extract specific data
Receive structured JSON output

Example Using Python with OpenAI GPT-4

import requests
from openai import OpenAI

def convert_webpage_to_json(url, fields):
    # Step 1: Fetch the HTML content
    response = requests.get(url)
    html_content = response.text

    # Step 2: Initialize OpenAI client
    client = OpenAI(api_key='your-api-key')

    # Step 3: Create prompt for JSON conversion
    prompt = f"""Extract the following information from this HTML and return it as valid JSON:

    Fields to extract: {', '.join(fields)}

    HTML content:
    {html_content}

    Return only valid JSON with no additional text."""

    # Step 4: Call the LLM
    completion = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": "You are a data extraction assistant that converts HTML to JSON."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0  # Lower temperature for more consistent output
    )

    # Step 5: Parse and return JSON
    import json
    result = json.loads(completion.choices[0].message.content)
    return result

# Usage example
data = convert_webpage_to_json(
    'https://example.com/product/laptop',
    ['product_name', 'price', 'rating', 'description', 'availability']
)

print(json.dumps(data, indent=2))

Output:

{
  "product_name": "Dell XPS 13 Laptop",
  "price": 999.99,
  "rating": 4.5,
  "description": "Ultra-thin 13-inch laptop with Intel Core i7 processor",
  "availability": "In Stock"
}

Example Using JavaScript with Claude API

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');

async function convertWebpageToJSON(url, fields) {
    // Step 1: Fetch HTML content
    const response = await axios.get(url);
    const htmlContent = response.data;

    // Step 2: Initialize Claude client
    const anthropic = new Anthropic({
        apiKey: process.env.ANTHROPIC_API_KEY
    });

    // Step 3: Create extraction prompt
    const prompt = `Extract the following fields from this HTML and return as valid JSON:

Fields: ${fields.join(', ')}

HTML:
${htmlContent}

Return only the JSON object, no other text.`;

    // Step 4: Call Claude API
    const message = await anthropic.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 2048,
        temperature: 0,
        messages: [{
            role: 'user',
            content: prompt
        }]
    });

    // Step 5: Parse JSON response
    const jsonText = message.content[0].text;
    const data = JSON.parse(jsonText);

    return data;
}

// Usage example
convertWebpageToJSON(
    'https://example.com/article',
    ['title', 'author', 'publish_date', 'content', 'tags']
)
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(error => console.error('Error:', error));

Advanced Techniques for Better Results

1. Using Structured Output (JSON Schema)

Modern LLMs support JSON schema to guarantee valid, type-safe output:

from openai import OpenAI

client = OpenAI(api_key='your-api-key')

# Define the exact JSON structure you want
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[
        {
            "role": "system",
            "content": "Extract product information from HTML."
        },
        {
            "role": "user",
            "content": html_content
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "product_data",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {
                        "type": "string",
                        "description": "Product name"
                    },
                    "price": {
                        "type": "number",
                        "description": "Price in USD"
                    },
                    "currency": {
                        "type": "string",
                        "enum": ["USD", "EUR", "GBP"]
                    },
                    "in_stock": {
                        "type": "boolean"
                    },
                    "features": {
                        "type": "array",
                        "items": {"type": "string"}
                    }
                },
                "required": ["name", "price", "in_stock"],
                "additionalProperties": False
            }
        }
    }
)

product_data = json.loads(response.choices[0].message.content)

This approach guarantees: - Valid JSON output every time - Correct data types - Required fields are always present - No unexpected fields

2. Preprocessing HTML for Better Results

Clean and reduce HTML before sending to the LLM to save tokens and improve accuracy:

from bs4 import BeautifulSoup
import requests

def clean_html_for_llm(html):
    """Remove unnecessary elements and extract main content."""
    soup = BeautifulSoup(html, 'html.parser')

    # Remove scripts, styles, navigation, ads
    for element in soup(['script', 'style', 'nav', 'header',
                         'footer', 'aside', 'iframe', 'noscript']):
        element.decompose()

    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()

    # Focus on main content
    main_content = (soup.find('main') or
                   soup.find('article') or
                   soup.find(class_='content') or
                   soup.body)

    return str(main_content) if main_content else str(soup)

def scrape_with_preprocessing(url):
    # Fetch HTML
    response = requests.get(url)

    # Clean HTML
    cleaned_html = clean_html_for_llm(response.text)

    # Now send cleaned HTML to LLM
    # ... (use previous examples)

3. Batch Processing Multiple Pages

Process multiple pages efficiently by batching requests:

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');

async function batchConvertToJSON(urls, fields) {
    const anthropic = new Anthropic({
        apiKey: process.env.ANTHROPIC_API_KEY
    });

    const results = [];

    // Process in parallel with concurrency limit
    const concurrency = 5;
    for (let i = 0; i < urls.length; i += concurrency) {
        const batch = urls.slice(i, i + concurrency);

        const promises = batch.map(async (url) => {
            try {
                // Fetch HTML
                const response = await axios.get(url);

                // Convert to JSON
                const message = await anthropic.messages.create({
                    model: 'claude-3-5-sonnet-20241022',
                    max_tokens: 1024,
                    messages: [{
                        role: 'user',
                        content: `Extract ${fields.join(', ')} from:\n${response.data}`
                    }]
                });

                return {
                    url: url,
                    success: true,
                    data: JSON.parse(message.content[0].text)
                };
            } catch (error) {
                return {
                    url: url,
                    success: false,
                    error: error.message
                };
            }
        });

        const batchResults = await Promise.all(promises);
        results.push(...batchResults);

        // Rate limiting delay
        if (i + concurrency < urls.length) {
            await new Promise(resolve => setTimeout(resolve, 1000));
        }
    }

    return results;
}

// Usage
const urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3'
];

batchConvertToJSON(urls, ['name', 'price', 'rating'])
    .then(results => console.log(results));

4. Handling Dynamic Content with Puppeteer

When scraping JavaScript-rendered pages, combine browser automation with LLM conversion. When handling AJAX requests using Puppeteer, you can wait for dynamic content to load before extracting:

from playwright.sync_api import sync_playwright
import anthropic
import json

def scrape_dynamic_page_to_json(url, fields):
    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for content
        page.goto(url)
        page.wait_for_load_state('networkidle')

        # Get fully rendered HTML
        html_content = page.content()
        browser.close()

        # Convert to JSON using Claude
        client = anthropic.Anthropic(api_key='your-api-key')

        message = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=2048,
            messages=[{
                "role": "user",
                "content": f"""Extract these fields from the HTML: {', '.join(fields)}

                HTML:
                {html_content}

                Return as valid JSON only."""
            }]
        )

        # Parse JSON
        return json.loads(message.content[0].text)

# Usage for SPA or AJAX-heavy sites
data = scrape_dynamic_page_to_json(
    'https://example.com/spa-page',
    ['articles', 'total_count', 'categories']
)
print(json.dumps(data, indent=2))

5. Robust Error Handling

Always implement retry logic and validation:

import requests
from openai import OpenAI
import json
import time
from jsonschema import validate, ValidationError

def robust_html_to_json(url, fields, schema=None, max_retries=3):
    """Convert HTML to JSON with retry logic and validation."""

    client = OpenAI(api_key='your-api-key')

    # Fetch HTML
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    html_content = response.text

    for attempt in range(max_retries):
        try:
            # Call LLM
            completion = client.chat.completions.create(
                model="gpt-4",
                messages=[
                    {
                        "role": "system",
                        "content": "Extract data from HTML and return valid JSON only."
                    },
                    {
                        "role": "user",
                        "content": f"Extract {', '.join(fields)} from:\n{html_content}"
                    }
                ],
                temperature=0
            )

            # Parse JSON
            result = json.loads(completion.choices[0].message.content)

            # Validate against schema if provided
            if schema:
                validate(instance=result, schema=schema)

            # Check required fields
            missing_fields = [f for f in fields if f not in result]
            if missing_fields:
                raise ValueError(f"Missing fields: {missing_fields}")

            return {
                'success': True,
                'data': result,
                'attempt': attempt + 1
            }

        except (json.JSONDecodeError, ValidationError, ValueError) as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")

            if attempt == max_retries - 1:
                return {
                    'success': False,
                    'error': str(e),
                    'attempt': attempt + 1
                }

            # Exponential backoff
            time.sleep(2 ** attempt)

    return {'success': False, 'error': 'Max retries exceeded'}

# Usage with validation
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "rating": {"type": "number", "minimum": 0, "maximum": 5}
    },
    "required": ["title", "price"]
}

result = robust_html_to_json(
    'https://example.com/product',
    ['title', 'price', 'rating'],
    schema=schema
)

if result['success']:
    print("Extracted data:", result['data'])
else:
    print("Extraction failed:", result['error'])

Converting Complex Nested Structures

LLMs excel at handling deeply nested HTML and extracting hierarchical JSON:

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');

async function extractNestedData(url) {
    const response = await axios.get(url);
    const anthropic = new Anthropic({
        apiKey: process.env.ANTHROPIC_API_KEY
    });

    const message = await anthropic.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 4096,
        messages: [{
            role: 'user',
            content: `Extract a nested JSON structure with this format:
            {
                "page_title": "string",
                "categories": [
                    {
                        "name": "string",
                        "products": [
                            {
                                "name": "string",
                                "price": number,
                                "specs": {
                                    "color": "string",
                                    "size": "string",
                                    "weight": "string"
                                },
                                "reviews": [
                                    {
                                        "author": "string",
                                        "rating": number,
                                        "comment": "string"
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }

            HTML:
            ${response.data}`
        }]
    });

    return JSON.parse(message.content[0].text);
}

// Usage
extractNestedData('https://example.com/catalog')
    .then(data => {
        console.log('Page Title:', data.page_title);
        console.log('Categories:', data.categories.length);
        data.categories.forEach(cat => {
            console.log(`  ${cat.name}: ${cat.products.length} products`);
        });
    });

Using WebScraping.AI for LLM-Powered JSON Conversion

WebScraping.AI provides built-in LLM capabilities that handle browser automation, proxy rotation, and LLM integration:

import requests

api_key = 'your-webscraping-ai-api-key'

# Field-based extraction (automatically returns JSON)
response = requests.get(
    'https://api.webscraping.ai/fields',
    params={
        'api_key': api_key,
        'url': 'https://example.com/product',
        'fields': 'name,price,description,rating,availability,features'
    }
)

# Already structured as JSON
product_data = response.json()
print(product_data)

const axios = require('axios');

async function scrapeWithAI(url, fields) {
    const response = await axios.get('https://api.webscraping.ai/fields', {
        params: {
            api_key: 'your-api-key',
            url: url,
            fields: fields.join(',')
        }
    });

    return response.data;
}

// Usage
scrapeWithAI(
    'https://example.com/article',
    ['headline', 'author', 'publish_date', 'body', 'tags']
)
.then(data => console.log(data));

Best Practices for Production Use

1. Optimize Token Usage

def optimize_html_for_tokens(html, max_length=8000):
    """Reduce HTML to fit within token limits."""
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(html, 'html.parser')

    # Remove unnecessary elements
    for tag in soup(['script', 'style', 'svg', 'path']):
        tag.decompose()

    # Remove attributes that don't help extraction
    for tag in soup.find_all(True):
        tag.attrs = {k: v for k, v in tag.attrs.items()
                    if k in ['class', 'id', 'href', 'src']}

    # Truncate if still too long
    text = str(soup)
    if len(text) > max_length:
        text = text[:max_length]

    return text

2. Cache Results

import hashlib
import json
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_convert_to_json(url_hash, html_hash, fields_str):
    """Cache LLM responses to avoid duplicate API calls."""
    # This will only be called once per unique combination
    return convert_webpage_to_json(url, fields)

# Usage
url_hash = hashlib.md5(url.encode()).hexdigest()
html_hash = hashlib.md5(html.encode()).hexdigest()
result = cached_convert_to_json(url_hash, html_hash, ','.join(fields))

3. Monitor Costs and Performance

import time
import logging

class LLMScraper:
    def __init__(self, api_key):
        self.client = OpenAI(api_key=api_key)
        self.total_tokens = 0
        self.total_requests = 0
        self.total_cost = 0

    def convert_to_json(self, html, fields):
        start_time = time.time()

        response = self.client.chat.completions.create(
            model="gpt-4",
            messages=[...],
        )

        # Track metrics
        tokens_used = response.usage.total_tokens
        self.total_tokens += tokens_used
        self.total_requests += 1

        # Calculate cost (example rates)
        cost = (tokens_used / 1000) * 0.03  # $0.03 per 1K tokens
        self.total_cost += cost

        duration = time.time() - start_time

        logging.info(f"Request completed in {duration:.2f}s, "
                    f"Tokens: {tokens_used}, Cost: ${cost:.4f}")

        return json.loads(response.choices[0].message.content)

    def get_stats(self):
        return {
            'total_requests': self.total_requests,
            'total_tokens': self.total_tokens,
            'total_cost': self.total_cost,
            'avg_tokens_per_request': self.total_tokens / max(self.total_requests, 1)
        }

Conclusion

Converting unstructured web page content into JSON using LLMs transforms web scraping from a brittle, maintenance-heavy process into an intelligent, adaptive workflow. By combining traditional web scraping tools for fetching and navigating pages with LLM-powered extraction, you can build robust data pipelines that adapt to changing website structures and handle complex, nested data with ease.

Key takeaways:

Start simple: Basic LLM API calls can handle most conversion tasks
Use structured output: JSON schemas guarantee valid, type-safe results
Preprocess HTML: Clean and optimize content to reduce tokens and costs
Implement error handling: Retry logic and validation prevent failures
Monitor performance: Track token usage and costs in production
Consider managed services: APIs like WebScraping.AI handle infrastructure complexity

As LLM technology continues to improve with faster inference, lower costs, and larger context windows, converting unstructured content to JSON will become even more powerful and accessible for developers building web scraping applications.

Table of contents

How do I convert unstructured web page content into JSON using an LLM?

Why Use LLMs for JSON Conversion?

Basic Approach: Fetching and Converting

Example Using Python with OpenAI GPT-4

Example Using JavaScript with Claude API

Advanced Techniques for Better Results

1. Using Structured Output (JSON Schema)

2. Preprocessing HTML for Better Results

3. Batch Processing Multiple Pages

4. Handling Dynamic Content with Puppeteer

5. Robust Error Handling

Converting Complex Nested Structures

Using WebScraping.AI for LLM-Powered JSON Conversion

Best Practices for Production Use

1. Optimize Token Usage

2. Cache Results

3. Monitor Costs and Performance

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the difference between using an LLM and BeautifulSoup for web scraping?

How do I handle dynamic websites with LLM-based web scraping?

Can LLMs extract data from JavaScript-rendered pages?

Get Started Now

Support