Table of contents

How do I get structured output from an LLM for web scraping?

Getting structured output from Large Language Models (LLMs) for web scraping involves defining a schema, prompting the LLM to extract data in a specific format, and validating the results. Modern LLM APIs like OpenAI's GPT, Anthropic's Claude, and Google's Gemini provide built-in features for structured data extraction, making it easier to parse web content into consistent, typed data structures.

Why Use LLMs for Structured Data Extraction?

Traditional web scraping relies on CSS selectors or XPath to extract data from HTML. While effective for static pages, these methods struggle with:

  • Dynamic or inconsistent HTML structures - Pages that change layouts frequently
  • Unstructured text content - Data embedded in paragraphs or lists without clear patterns
  • Complex data relationships - Information spread across multiple elements
  • Natural language processing - Extracting meaning from text, not just raw values

LLMs can understand context, interpret natural language, and extract structured data from messy HTML or text without brittle selectors.

Methods for Getting Structured Output

1. JSON Mode and Structured Output APIs

Most modern LLM providers offer native support for structured outputs through JSON mode or schema-based extraction.

OpenAI GPT with JSON Mode

OpenAI's API provides a response_format parameter to enforce JSON output:

import openai
import json

# Initialize OpenAI client
client = openai.OpenAI(api_key="your-api-key")

# HTML content from web scraping
html_content = """
<div class="product">
  <h1>Wireless Headphones</h1>
  <span class="price">$99.99</span>
  <p>Premium noise-canceling headphones with 30-hour battery life.</p>
  <div class="rating">4.5 stars (1,234 reviews)</div>
</div>
"""

# Define extraction prompt
prompt = f"""
Extract product information from this HTML and return it as JSON with the following schema:
{{
  "name": "product name",
  "price": "numeric price value",
  "description": "product description",
  "rating": "numeric rating",
  "review_count": "number of reviews"
}}

HTML:
{html_content}
"""

# Make API call with JSON mode
response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[
        {"role": "system", "content": "You are a data extraction assistant. Always respond with valid JSON."},
        {"role": "user", "content": prompt}
    ],
    response_format={"type": "json_object"}
)

# Parse structured output
product_data = json.loads(response.choices[0].message.content)
print(json.dumps(product_data, indent=2))

Output:

{
  "name": "Wireless Headphones",
  "price": 99.99,
  "description": "Premium noise-canceling headphones with 30-hour battery life.",
  "rating": 4.5,
  "review_count": 1234
}

OpenAI Function Calling for Structured Extraction

Function calling provides stronger type validation and schema enforcement:

import openai
import json

client = openai.OpenAI(api_key="your-api-key")

# Define extraction schema as a function
extraction_function = {
    "name": "extract_product_data",
    "description": "Extract structured product information from HTML",
    "parameters": {
        "type": "object",
        "properties": {
            "name": {
                "type": "string",
                "description": "Product name"
            },
            "price": {
                "type": "number",
                "description": "Price in USD"
            },
            "description": {
                "type": "string",
                "description": "Product description"
            },
            "rating": {
                "type": "number",
                "description": "Rating from 0 to 5"
            },
            "review_count": {
                "type": "integer",
                "description": "Number of customer reviews"
            },
            "features": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of product features"
            }
        },
        "required": ["name", "price", "description"]
    }
}

html_content = """
<div class="product">
  <h1>Wireless Headphones</h1>
  <span class="price">$99.99</span>
  <p>Premium noise-canceling headphones with 30-hour battery life.</p>
  <ul class="features">
    <li>Active noise cancellation</li>
    <li>30-hour battery life</li>
    <li>Bluetooth 5.0</li>
  </ul>
  <div class="rating">4.5 stars (1,234 reviews)</div>
</div>
"""

response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    messages=[
        {"role": "user", "content": f"Extract product data from this HTML:\n{html_content}"}
    ],
    functions=[extraction_function],
    function_call={"name": "extract_product_data"}
)

# Extract function arguments as structured data
function_args = json.loads(response.choices[0].message.function_call.arguments)
print(json.dumps(function_args, indent=2))

2. Anthropic Claude with Structured Prompting

Anthropic's Claude API excels at following structured output instructions through well-crafted prompts:

import anthropic
import json

client = anthropic.Anthropic(api_key="your-api-key")

html_content = """
<article>
  <h1>Breaking: New AI Model Released</h1>
  <time datetime="2024-03-15">March 15, 2024</time>
  <span class="author">by Jane Smith</span>
  <p>Researchers have unveiled a groundbreaking AI model...</p>
</article>
"""

prompt = f"""Extract the following information from the HTML article and return ONLY a valid JSON object with no additional text:

Required fields:
- title (string)
- date (ISO 8601 format)
- author (string)
- summary (string, first 100 characters of content)

HTML:
{html_content}

JSON:"""

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": prompt}
    ]
)

# Parse JSON response
article_data = json.loads(message.content[0].text)
print(json.dumps(article_data, indent=2))

3. Google Gemini with Schema-Guided Generation

Google's Gemini API supports schema-based output control:

import google.generativeai as genai
import json

genai.configure(api_key="your-api-key")

# Define schema for structured output
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "price": {"type": "number"},
        "availability": {"type": "string", "enum": ["in_stock", "out_of_stock", "preorder"]},
        "specifications": {
            "type": "object",
            "properties": {
                "brand": {"type": "string"},
                "model": {"type": "string"},
                "color": {"type": "string"}
            }
        }
    },
    "required": ["title", "price", "availability"]
}

model = genai.GenerativeModel('gemini-1.5-pro')

html_content = """
<div class="product-detail">
  <h2>Samsung Galaxy S24</h2>
  <p class="price">$799.99</p>
  <span class="stock">In Stock</span>
  <dl>
    <dt>Brand:</dt><dd>Samsung</dd>
    <dt>Model:</dt><dd>Galaxy S24</dd>
    <dt>Color:</dt><dd>Titanium Gray</dd>
  </dl>
</div>
"""

prompt = f"""Extract product information from this HTML according to the specified schema.
Return only valid JSON matching the schema.

HTML:
{html_content}
"""

response = model.generate_content(
    prompt,
    generation_config=genai.GenerationConfig(
        response_mime_type="application/json",
        response_schema=schema
    )
)

product_data = json.loads(response.text)
print(json.dumps(product_data, indent=2))

JavaScript Implementation

For Node.js environments, here's an example using OpenAI with web scraping:

import OpenAI from 'openai';
import axios from 'axios';
import * as cheerio from 'cheerio';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

// Fetch and scrape webpage
async function scrapeAndExtract(url) {
  // Fetch HTML
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);

  // Extract main content (reduce token usage)
  const mainContent = $('main, article, .content').html() || $('body').html();

  // Define extraction schema
  const extractionFunction = {
    name: 'extract_article_data',
    description: 'Extract structured article information',
    parameters: {
      type: 'object',
      properties: {
        headline: { type: 'string' },
        author: { type: 'string' },
        publishDate: { type: 'string' },
        category: { type: 'string' },
        tags: {
          type: 'array',
          items: { type: 'string' }
        },
        summary: { type: 'string' }
      },
      required: ['headline', 'summary']
    }
  };

  // Call OpenAI with function calling
  const completion = await openai.chat.completions.create({
    model: 'gpt-4-turbo-preview',
    messages: [
      {
        role: 'user',
        content: `Extract article data from this HTML:\n${mainContent}`
      }
    ],
    functions: [extractionFunction],
    function_call: { name: 'extract_article_data' }
  });

  // Parse structured output
  const structuredData = JSON.parse(
    completion.choices[0].message.function_call.arguments
  );

  return structuredData;
}

// Usage
const articleData = await scrapeAndExtract('https://example.com/article');
console.log(articleData);

Best Practices for Structured LLM Extraction

1. Define Clear Schemas

Always specify the exact structure you need with: - Field names and types - Required vs optional fields - Validation constraints (min/max values, enums) - Nested object structures

2. Pre-process HTML Content

Reduce token usage and improve accuracy by:

from bs4 import BeautifulSoup

def clean_html_for_llm(html_content):
    """Remove unnecessary HTML elements before LLM processing"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove scripts, styles, and navigation
    for tag in soup(['script', 'style', 'nav', 'header', 'footer', 'aside']):
        tag.decompose()

    # Get main content area
    main_content = soup.find('main') or soup.find('article') or soup.find('body')

    return str(main_content)

3. Validate LLM Output

Always validate the structured output:

from pydantic import BaseModel, ValidationError
from typing import List, Optional

class Product(BaseModel):
    name: str
    price: float
    description: str
    rating: Optional[float] = None
    review_count: Optional[int] = None
    features: List[str] = []

def extract_and_validate(html_content):
    # Get LLM response (previous examples)
    llm_output = get_llm_extraction(html_content)

    try:
        # Validate with Pydantic
        product = Product(**llm_output)
        return product.model_dump()
    except ValidationError as e:
        print(f"Validation error: {e}")
        return None

4. Handle Batch Extraction

For multiple items (e.g., product listings), structure your schema accordingly:

extraction_function = {
    "name": "extract_product_list",
    "description": "Extract multiple products from a listing page",
    "parameters": {
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string"},
                        "price": {"type": "number"},
                        "url": {"type": "string"}
                    },
                    "required": ["name", "price"]
                }
            }
        },
        "required": ["products"]
    }
}

Combining Traditional Scraping with LLM Extraction

For optimal results, combine traditional scraping methods with LLM extraction. Use traditional tools like Puppeteer for handling AJAX requests to load dynamic content, then pass the rendered HTML to an LLM for intelligent extraction:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import openai

def scrape_dynamic_page_with_llm(url):
    # Use Selenium to handle dynamic content
    driver = webdriver.Chrome()
    driver.get(url)

    # Wait for content to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "product-grid"))
    )

    # Get rendered HTML
    html_content = driver.page_source
    driver.quit()

    # Extract structured data with LLM
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "user", "content": f"Extract all products from this HTML:\n{html_content}"}
        ],
        functions=[product_list_schema],
        function_call={"name": "extract_product_list"}
    )

    return json.loads(response.choices[0].message.function_call.arguments)

Error Handling and Retry Logic

Implement robust error handling for LLM API calls:

import time
import json
from openai import OpenAI, APIError, RateLimitError

def extract_with_retry(html_content, max_retries=3):
    client = OpenAI()

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4-turbo-preview",
                messages=[
                    {"role": "user", "content": f"Extract data:\n{html_content}"}
                ],
                response_format={"type": "json_object"}
            )

            # Validate JSON
            data = json.loads(response.choices[0].message.content)
            return data

        except RateLimitError:
            wait_time = 2 ** attempt  # Exponential backoff
            print(f"Rate limited. Waiting {wait_time}s...")
            time.sleep(wait_time)

        except json.JSONDecodeError:
            print(f"Invalid JSON on attempt {attempt + 1}")
            if attempt == max_retries - 1:
                raise

        except APIError as e:
            print(f"API error: {e}")
            raise

    return None

Cost Optimization

LLM API calls can be expensive. Optimize costs by:

  1. Reduce input tokens - Extract only relevant HTML sections
  2. Use smaller models - GPT-3.5 or Claude Haiku for simple extractions
  3. Batch processing - Extract multiple items in one API call
  4. Cache results - Store extracted data to avoid re-processing
  5. Fallback to traditional methods - Use LLMs only when selectors fail

Conclusion

Getting structured output from LLMs for web scraping combines the reliability of schema validation with the intelligence of natural language understanding. By using JSON mode, function calling, or schema-guided generation, you can extract consistent, typed data from complex web pages. When handling dynamic content with browser automation, LLMs provide a powerful alternative to brittle CSS selectors, especially for pages with inconsistent structures or natural language content.

For production use, always validate LLM outputs, implement error handling, and monitor API costs. The combination of traditional web scraping tools and LLM-powered extraction provides the best balance of reliability, accuracy, and cost-effectiveness.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon