Table of contents

How do I Extract Structured Data from HTML Using LLMs?

Large Language Models (LLMs) have revolutionized web scraping by offering a more flexible and intelligent approach to extracting structured data from HTML. Unlike traditional parsing methods that rely on rigid selectors, LLMs can understand context, handle layout variations, and extract data even when the HTML structure changes.

Understanding LLM-Based Data Extraction

Traditional web scraping uses CSS selectors or XPath to target specific HTML elements. While effective, this approach breaks when websites update their layouts. LLMs can analyze HTML content contextually and extract data based on semantic understanding rather than element paths.

Key Advantages

  • Layout resilience: Works even when HTML structure changes
  • Semantic understanding: Identifies data based on meaning, not just position
  • Reduced maintenance: Less brittle than selector-based scraping
  • Complex data handling: Better at extracting nested or related data points

Basic LLM-Based Extraction with OpenAI

Here's a fundamental example using OpenAI's GPT models to extract product information from HTML:

import openai
import requests
from bs4 import BeautifulSoup

openai.api_key = "your-api-key"

# Fetch HTML content
response = requests.get("https://example-store.com/product/laptop")
html_content = response.text

# Clean and prepare HTML (optional but recommended)
soup = BeautifulSoup(html_content, 'html.parser')
main_content = soup.find('main') or soup.body
cleaned_html = str(main_content)

# Create extraction prompt
prompt = f"""
Extract the following product information from this HTML:
- Product name
- Price
- Description
- Availability status
- Specifications

HTML:
{cleaned_html[:4000]}  # Limit to stay within token limits

Return the data as JSON.
"""

# Call OpenAI API
completion = openai.ChatCompletion.create(
    model="gpt-4-turbo-preview",
    messages=[
        {"role": "system", "content": "You are a data extraction assistant. Always return valid JSON."},
        {"role": "user", "content": prompt}
    ],
    response_format={"type": "json_object"}
)

extracted_data = completion.choices[0].message.content
print(extracted_data)

Structured Output with Function Calling

Modern LLM APIs support function calling (also known as tool calling), which guarantees structured output in a specific schema:

import openai
import json

openai.api_key = "your-api-key"

# Define the schema for extracted data
functions = [
    {
        "name": "extract_product_data",
        "description": "Extract product information from HTML",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string", "description": "Product name"},
                "price": {"type": "number", "description": "Product price"},
                "currency": {"type": "string", "description": "Currency code"},
                "description": {"type": "string", "description": "Product description"},
                "in_stock": {"type": "boolean", "description": "Availability status"},
                "specifications": {
                    "type": "object",
                    "description": "Product specifications",
                    "additionalProperties": {"type": "string"}
                }
            },
            "required": ["name", "price", "currency"]
        }
    }
]

response = openai.ChatCompletion.create(
    model="gpt-4-turbo-preview",
    messages=[
        {"role": "user", "content": f"Extract product data from this HTML: {html_content[:4000]}"}
    ],
    functions=functions,
    function_call={"name": "extract_product_data"}
)

# Parse the function call response
function_args = json.loads(response.choices[0].message.function_call.arguments)
print(json.dumps(function_args, indent=2))

JavaScript Implementation with Claude API

Here's how to extract structured data using Anthropic's Claude API in JavaScript:

const Anthropic = require('@anthropic-ai/sdk');
const axios = require('axios');
const { JSDOM } = require('jsdom');

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

async function extractDataWithClaude(url) {
  // Fetch and clean HTML
  const response = await axios.get(url);
  const dom = new JSDOM(response.data);
  const mainContent = dom.window.document.querySelector('main')?.innerHTML
    || dom.window.document.body.innerHTML;

  // Truncate to avoid token limits
  const truncatedHtml = mainContent.substring(0, 10000);

  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: `Extract the product information from this HTML and return it as valid JSON with fields: name, price, currency, description, inStock, and specifications (as an object).

HTML:
${truncatedHtml}`
      }
    ]
  });

  const extractedText = message.content[0].text;
  return JSON.parse(extractedText);
}

// Usage
extractDataWithClaude('https://example-store.com/product/laptop')
  .then(data => console.log(data))
  .catch(error => console.error(error));

Using LangChain for Advanced Extraction

LangChain provides a powerful framework for building LLM-powered web scraping workflows:

from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List, Dict, Optional

# Define data structure
class Product(BaseModel):
    name: str = Field(description="Product name")
    price: float = Field(description="Product price")
    currency: str = Field(description="Currency code")
    description: str = Field(description="Product description")
    in_stock: bool = Field(description="Whether product is in stock")
    specifications: Dict[str, str] = Field(description="Product specifications")

    class Config:
        schema_extra = {
            "example": {
                "name": "MacBook Pro",
                "price": 1999.99,
                "currency": "USD",
                "description": "Powerful laptop",
                "in_stock": True,
                "specifications": {"RAM": "16GB", "Storage": "512GB"}
            }
        }

# Initialize parser and LLM
parser = PydanticOutputParser(pydantic_object=Product)
llm = ChatOpenAI(model="gpt-4-turbo-preview", temperature=0)

# Create prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert at extracting structured product data from HTML."),
    ("user", "{format_instructions}\n\nExtract product data from this HTML:\n{html_content}")
])

# Create chain
chain = prompt | llm | parser

# Execute extraction
result = chain.invoke({
    "format_instructions": parser.get_format_instructions(),
    "html_content": html_content[:4000]
})

print(result.dict())

Handling Multiple Items with LLMs

When extracting lists of items (like search results or product listings), structure your schema accordingly:

from typing import List
from pydantic import BaseModel

class ProductListing(BaseModel):
    name: str
    price: float
    url: str
    rating: Optional[float]

class SearchResults(BaseModel):
    products: List[ProductListing]
    total_count: int
    page: int

parser = PydanticOutputParser(pydantic_object=SearchResults)

prompt = f"""
Extract all product listings from this search results page.

{parser.get_format_instructions()}

HTML:
{search_results_html}
"""

# Process with your preferred LLM

Optimizing HTML for LLM Processing

To reduce token usage and improve accuracy:

1. Remove Unnecessary Elements

from bs4 import BeautifulSoup

def clean_html(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove scripts, styles, and comments
    for element in soup(['script', 'style', 'nav', 'footer', 'header']):
        element.decompose()

    # Remove attributes except data-* and class
    for tag in soup.find_all(True):
        attrs = dict(tag.attrs)
        for attr in attrs:
            if not attr.startswith('data-') and attr != 'class':
                del tag.attrs[attr]

    return str(soup)

2. Convert to Markdown

Many LLMs work better with markdown than raw HTML:

from markdownify import markdownify as md

cleaned_html = clean_html(html_content)
markdown_content = md(cleaned_html, heading_style="ATX")

# Use markdown in your prompt instead of HTML

Cost Optimization Strategies

LLM-based extraction can be expensive. Here are optimization techniques:

1. Cache Results

import hashlib
import json
from functools import lru_cache

@lru_cache(maxsize=1000)
def extract_with_cache(html_hash):
    # Your extraction logic here
    pass

# Usage
html_hash = hashlib.md5(html_content.encode()).hexdigest()
result = extract_with_cache(html_hash)

2. Use Smaller Models for Simple Tasks

# Use GPT-3.5 for straightforward extractions
model = "gpt-3.5-turbo" if simple_structure else "gpt-4-turbo-preview"

3. Batch Processing

Process multiple pages in a single API call when possible:

prompt = f"""
Extract data from these 5 product pages. Return an array of JSON objects.

Page 1:
{html_1}

Page 2:
{html_2}

...
"""

Error Handling and Validation

Always validate LLM output since LLM hallucinations can occur:

from pydantic import ValidationError

try:
    result = parser.parse(llm_response)

    # Additional business logic validation
    if result.price < 0:
        raise ValueError("Price cannot be negative")

    if not result.currency in ['USD', 'EUR', 'GBP']:
        raise ValueError(f"Invalid currency: {result.currency}")

except ValidationError as e:
    print(f"Validation error: {e}")
    # Retry or use fallback method
except ValueError as e:
    print(f"Business logic error: {e}")
    # Handle invalid data

Hybrid Approach: Combining Selectors and LLMs

For production systems, combine traditional selectors with LLM extraction:

def hybrid_extraction(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')

    # Try traditional extraction first (faster and cheaper)
    try:
        name = soup.select_one('.product-name').text.strip()
        price = float(soup.select_one('.price').text.replace('$', ''))

        return {
            'name': name,
            'price': price,
            'method': 'selector'
        }
    except (AttributeError, ValueError):
        # Fall back to LLM extraction
        return extract_with_llm(html_content)

Real-World Example: E-commerce Scraper

Here's a complete example scraping product data:

import openai
import requests
from bs4 import BeautifulSoup
import json

class LLMProductScraper:
    def __init__(self, api_key):
        openai.api_key = api_key

    def fetch_page(self, url):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        response = requests.get(url, headers=headers)
        return response.text

    def clean_html(self, html):
        soup = BeautifulSoup(html, 'html.parser')
        main = soup.find('main') or soup.find(id='content') or soup.body

        for tag in main(['script', 'style', 'nav', 'footer']):
            tag.decompose()

        return str(main)[:8000]  # Limit tokens

    def extract_data(self, html):
        cleaned = self.clean_html(html)

        response = openai.ChatCompletion.create(
            model="gpt-4-turbo-preview",
            messages=[
                {
                    "role": "system",
                    "content": "Extract product data and return valid JSON with fields: name, price, currency, description, inStock, images (array), specifications (object)."
                },
                {"role": "user", "content": f"HTML:\n{cleaned}"}
            ],
            response_format={"type": "json_object"}
        )

        return json.loads(response.choices[0].message.content)

    def scrape(self, url):
        html = self.fetch_page(url)
        return self.extract_data(html)

# Usage
scraper = LLMProductScraper(api_key="your-api-key")
product_data = scraper.scrape("https://example.com/product/123")
print(json.dumps(product_data, indent=2))

Choosing the Right LLM for Data Extraction

Different LLMs have different strengths for structured data extraction tasks:

  • GPT-4: Best accuracy, handles complex nested data
  • GPT-3.5-Turbo: Good balance of speed and cost
  • Claude 3: Excellent at following instructions, large context window
  • Llama 2: Cost-effective for self-hosting

Conclusion

LLM-based HTML extraction offers a more resilient and intelligent approach to web scraping. While it comes with higher costs and latency compared to traditional methods, the reduced maintenance burden and improved reliability often justify the investment. For production systems, consider a hybrid approach that combines the speed of traditional selectors with the flexibility of LLMs as a fallback.

Start with small-scale experiments to understand token usage and costs, then scale up once you've optimized your prompts and data structures. With proper implementation, LLM-based extraction can handle the most challenging web scraping scenarios with minimal code maintenance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon