Can LLMs Understand and Extract Data from Complex Page Layouts?

Yes, Large Language Models (LLMs) excel at understanding and extracting data from complex page layouts. Unlike traditional web scraping tools that rely on rigid CSS selectors or XPath expressions, LLMs can comprehend page structure semantically, making them particularly effective for intricate layouts, nested elements, and dynamically changing websites.

How LLMs Process Complex Layouts

LLMs analyze web pages differently than traditional parsers. They understand:

Semantic relationships between elements (headers, subheaders, content blocks)
Visual hierarchy through HTML structure and attributes
Contextual patterns that indicate data relationships
Nested structures like product cards, comment threads, or table hierarchies
Unstructured or semi-structured content that changes between pages

This semantic understanding allows LLMs to extract data even when the HTML structure varies significantly across pages.

Advantages Over Traditional Scraping Methods

1. Layout Flexibility

Traditional scrapers break when websites change their CSS classes or DOM structure. LLMs adapt naturally because they understand content semantically rather than relying on specific selectors.

# Traditional approach - breaks when classes change
from bs4 import BeautifulSoup

html = """<div class="product-v2-card-new">...</div>"""
product = soup.select('.product-card')  # Fails if class changes to product-v2-card-new

# LLM approach - understands intent
import openai

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": f"Extract product name, price, and rating from this HTML:\n{html}"
    }]
)

2. Complex Nested Structures

LLMs can navigate deeply nested HTML without explicit path definitions. For example, extracting reviews with nested replies:

from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")

html_content = """
<div class="review-thread">
    <div class="main-review">
        <p class="reviewer">John Doe</p>
        <p class="rating">5 stars</p>
        <p class="comment">Great product!</p>
        <div class="replies">
            <div class="reply">
                <p class="replier">Support Team</p>
                <p class="reply-text">Thank you for your feedback!</p>
            </div>
        </div>
    </div>
</div>
"""

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"""Extract all reviews and their nested replies from this HTML.
        Return as JSON with this structure:
        {{
            "reviews": [
                {{
                    "reviewer": "name",
                    "rating": "rating",
                    "comment": "comment text",
                    "replies": [
                        {{"replier": "name", "reply_text": "text"}}
                    ]
                }}
            ]
        }}

        HTML:
        {html_content}
        """
    }]
)

print(message.content[0].text)

3. Multi-Column and Grid Layouts

LLMs can understand complex grid layouts and extract data while preserving relationships between elements:

// Using OpenAI with JavaScript for a complex product comparison table
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

const complexHTML = `
<div class="comparison-grid">
  <div class="product-col">
    <h3>Product A</h3>
    <div class="specs">
      <span class="price">$299</span>
      <ul class="features">
        <li>Feature 1</li>
        <li>Feature 2</li>
      </ul>
    </div>
  </div>
  <div class="product-col">
    <h3>Product B</h3>
    <div class="specs">
      <span class="price">$399</span>
      <ul class="features">
        <li>Feature 1</li>
        <li>Feature 3</li>
      </ul>
    </div>
  </div>
</div>
`;

async function extractComparison() {
  const completion = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{
      role: "user",
      content: `Extract product comparison data from this HTML grid layout.
      Return as JSON array with product name, price, and features.

      HTML: ${complexHTML}`
    }],
    response_format: { type: "json_object" }
  });

  console.log(completion.choices[0].message.content);
}

extractComparison();

Best Practices for Complex Layout Extraction

1. Provide Clear Instructions

Be specific about the data structure you want and the relationships between elements:

import google.generativeai as genai

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel('gemini-pro')

prompt = """
Extract data from this e-commerce page layout. The page has:
- A main product section (name, price, availability)
- A specifications table (key-value pairs)
- Customer reviews (reviewer name, rating, date, comment)
- Related products sidebar (product name, thumbnail URL, price)

Return a structured JSON with all this information, preserving the hierarchy.

HTML:
{html_content}
"""

response = model.generate_content(prompt)
print(response.text)

2. Use Function Calling for Structured Output

Modern LLMs support function calling to ensure consistent output format:

import openai

tools = [{
    "type": "function",
    "function": {
        "name": "extract_product_data",
        "description": "Extract structured product data from HTML",
        "parameters": {
            "type": "object",
            "properties": {
                "product_name": {"type": "string"},
                "price": {"type": "number"},
                "specifications": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "key": {"type": "string"},
                            "value": {"type": "string"}
                        }
                    }
                },
                "reviews": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "reviewer": {"type": "string"},
                            "rating": {"type": "number"},
                            "comment": {"type": "string"}
                        }
                    }
                }
            },
            "required": ["product_name", "price"]
        }
    }
}]

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "user",
        "content": f"Extract product data from: {html_content}"
    }],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_product_data"}}
)

3. Preprocess HTML for Better Results

While LLMs handle complex layouts well, preprocessing can improve accuracy and reduce token costs:

from bs4 import BeautifulSoup
import re

def simplify_html(html_content):
    """Remove unnecessary attributes and scripts while preserving structure"""
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove scripts, styles, and comments
    for element in soup(['script', 'style', 'noscript']):
        element.decompose()

    # Remove most attributes except semantic ones
    for tag in soup.find_all(True):
        tag.attrs = {
            key: value for key, value in tag.attrs.items()
            if key in ['class', 'id', 'role', 'aria-label']
        }

    # Remove extra whitespace
    clean_html = re.sub(r'\s+', ' ', str(soup))

    return clean_html

# Use simplified HTML with LLM
simplified = simplify_html(complex_html)

4. Handle Dynamic Content

For complex JavaScript-rendered pages, combine browser automation with LLM extraction:

from playwright.sync_api import sync_playwright
import anthropic

def scrape_dynamic_layout(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url)

        # Wait for dynamic content to load
        page.wait_for_selector('.product-grid', state='visible')

        # Get fully rendered HTML
        html_content = page.content()
        browser.close()

    # Extract with LLM
    client = anthropic.Anthropic(api_key="your-api-key")
    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        messages=[{
            "role": "user",
            "content": f"Extract all products from this grid layout:\n{html_content}"
        }]
    )

    return message.content[0].text

result = scrape_dynamic_layout("https://example.com/products")

Limitations and Considerations

Token Limits

Complex pages can exceed token limits. Strategies to handle this:

Extract specific sections: Target only relevant HTML portions
Chunk processing: Split large pages into sections
Summarize first: Use LLM to identify relevant sections, then extract data

def chunk_html_by_sections(html_content, max_chars=10000):
    """Split HTML into semantic chunks"""
    soup = BeautifulSoup(html_content, 'html.parser')
    chunks = []

    # Split by major sections
    for section in soup.find_all(['section', 'article', 'div'], class_=re.compile('(main|content|product)')):
        chunk_html = str(section)
        if len(chunk_html) < max_chars:
            chunks.append(chunk_html)

    return chunks

# Process each chunk
for chunk in chunk_html_by_sections(large_html):
    result = extract_with_llm(chunk)

Cost Considerations

LLM-based extraction costs more than traditional parsing. Optimize by:

Using smaller models (GPT-3.5, Claude Haiku) for simple extractions
Caching results for static content
Combining traditional parsing with LLM extraction

# Hybrid approach: Use traditional parsing first, LLM for complex parts
def hybrid_extraction(html):
    soup = BeautifulSoup(html, 'html.parser')

    # Easy extraction with BeautifulSoup
    title = soup.find('h1', class_='product-title').text
    price = soup.find('span', class_='price').text

    # Use LLM only for complex nested reviews
    reviews_section = str(soup.find('div', id='reviews'))
    reviews = extract_reviews_with_llm(reviews_section)

    return {
        'title': title,
        'price': price,
        'reviews': reviews
    }

Accuracy Validation

Always validate LLM outputs, especially for critical data:

import json
from jsonschema import validate, ValidationError

# Define expected schema
schema = {
    "type": "object",
    "properties": {
        "product_name": {"type": "string", "minLength": 1},
        "price": {"type": "number", "minimum": 0},
        "rating": {"type": "number", "minimum": 0, "maximum": 5}
    },
    "required": ["product_name", "price"]
}

def extract_and_validate(html_content):
    # Extract with LLM
    llm_response = extract_with_llm(html_content)

    try:
        data = json.loads(llm_response)
        validate(instance=data, schema=schema)
        return data
    except (json.JSONDecodeError, ValidationError) as e:
        print(f"Validation failed: {e}")
        return None

Real-World Use Cases

E-commerce Product Catalogs

Extract products from grid layouts with varying structures:

def extract_product_catalog(html):
    prompt = """
    Extract all products from this catalog page. Products may be in grids,
    lists, or cards. Each product should include:
    - Product name
    - Price (current and original if on sale)
    - Image URL
    - Rating (if available)
    - Stock status

    Return as JSON array.
    """

    # Use appropriate LLM API
    return llm_extract(html, prompt)

News Article Aggregation

Extract articles with complex layouts including sidebars, related content, and embedded media:

async function extractNewsArticles(html) {
  const prompt = `
    Extract the main article and related articles from this news page.
    Include:
    - Main article (headline, author, date, content, featured image)
    - Related articles (headline, thumbnail, excerpt)
    - Categories/tags

    Return as structured JSON.
  `;

  const response = await callLLMAPI(html, prompt);
  return JSON.parse(response);
}

Social Media Threads

Extract conversation threads with nested replies and reactions:

def extract_social_thread(html):
    """Extract social media thread with nested structure"""
    prompt = """
    Extract this social media thread preserving the reply hierarchy.
    Structure:
    {
        "main_post": {
            "author": "",
            "content": "",
            "timestamp": "",
            "reactions": {}
        },
        "replies": [
            {
                "author": "",
                "content": "",
                "timestamp": "",
                "nested_replies": [...]
            }
        ]
    }
    """

    return llm_extract(html, prompt)

Conclusion

LLMs are exceptionally capable at understanding and extracting data from complex page layouts. They offer significant advantages over traditional web scraping methods, particularly for:

Pages with inconsistent structure
Deeply nested content hierarchies
Multi-column and grid layouts
Sites that frequently change their HTML structure

While there are considerations around cost and token limits, combining LLM-powered extraction with traditional web scraping techniques provides the best of both worlds: the reliability of semantic understanding and the efficiency of rule-based parsing.

For complex layouts that would require extensive XPath expressions or fragile CSS selectors, LLMs offer a more maintainable and adaptable solution that can significantly reduce scraper maintenance overhead.

Table of contents