How do I use LLMs to scrape data from tables and lists?

Large Language Models (LLMs) excel at extracting structured data from tables and lists on web pages, even when the HTML structure is complex or inconsistent. Unlike traditional CSS selectors or XPath, LLMs can understand the semantic meaning of data and adapt to varying layouts without requiring manual selector updates.

Why Use LLMs for Table and List Scraping?

LLMs offer several advantages when scraping tabular and list-based data:

Semantic Understanding: LLMs can identify data relationships without relying on specific HTML structure
Resilience to Changes: When websites update their layouts, LLM-based scrapers often continue working without modifications
Complex Table Handling: Nested tables, merged cells, and irregular structures are easier to parse
Data Normalization: LLMs can automatically clean and standardize extracted data
Missing Data Handling: AI models can intelligently handle incomplete or irregular data patterns

Basic Approach: Using LLMs with HTML Input

The most straightforward method involves sending HTML content to an LLM with instructions to extract structured data.

Python Example with OpenAI API

import openai
import requests
from bs4 import BeautifulSoup
import json

def scrape_table_with_llm(url):
    # Fetch the webpage
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract the table HTML (or get specific table)
    table_html = str(soup.find('table'))

    # Create the prompt
    prompt = f"""
    Extract all data from this HTML table and return it as a JSON array of objects.
    Each row should be an object with keys matching the column headers.

    HTML:
    {table_html}

    Return only valid JSON, no explanation.
    """

    # Call OpenAI API
    client = openai.OpenAI(api_key='your-api-key')
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a data extraction assistant that returns only valid JSON."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )

    # Parse the JSON response
    extracted_data = json.loads(response.choices[0].message.content)
    return extracted_data

# Usage
data = scrape_table_with_llm('https://example.com/pricing')
print(json.dumps(data, indent=2))

JavaScript Example with OpenAI API

const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeTableWithLLM(url) {
  // Fetch the webpage
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);

  // Extract table HTML
  const tableHtml = $('table').first().html();

  // Create the prompt
  const prompt = `
    Extract all data from this HTML table and return it as a JSON array of objects.
    Each row should be an object with keys matching the column headers.

    HTML:
    ${tableHtml}

    Return only valid JSON, no explanation.
  `;

  // Call OpenAI API
  const openai = new OpenAI({ apiKey: 'your-api-key' });
  const completion = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [
      { role: 'system', content: 'You are a data extraction assistant that returns only valid JSON.' },
      { role: 'user', content: prompt }
    ],
    temperature: 0
  });

  // Parse and return JSON
  const extractedData = JSON.parse(completion.choices[0].message.content);
  return extractedData;
}

// Usage
scrapeTableWithLLM('https://example.com/pricing')
  .then(data => console.log(JSON.stringify(data, null, 2)))
  .catch(console.error);

Advanced Technique: Using Function Calling for Structured Output

Modern LLMs support function calling for structured data extraction, which guarantees properly formatted output.

Python Example with OpenAI Function Calling

import openai
import requests
from bs4 import BeautifulSoup
import json

def scrape_table_with_functions(url, schema):
    # Fetch HTML
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    table_html = str(soup.find('table'))

    # Define function schema
    functions = [
        {
            "name": "extract_table_data",
            "description": "Extract structured data from an HTML table",
            "parameters": {
                "type": "object",
                "properties": {
                    "rows": {
                        "type": "array",
                        "description": "Array of table rows",
                        "items": {
                            "type": "object",
                            "properties": schema
                        }
                    }
                },
                "required": ["rows"]
            }
        }
    ]

    client = openai.OpenAI(api_key='your-api-key')
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "user", "content": f"Extract all data from this table:\n\n{table_html}"}
        ],
        functions=functions,
        function_call={"name": "extract_table_data"},
        temperature=0
    )

    # Parse function call response
    function_args = json.loads(response.choices[0].message.function_call.arguments)
    return function_args['rows']

# Usage with custom schema
schema = {
    "product": {"type": "string"},
    "price": {"type": "number"},
    "features": {"type": "array", "items": {"type": "string"}}
}

data = scrape_table_with_functions('https://example.com/products', schema)

Scraping Lists with LLMs

Lists (ordered and unordered) can be extracted similarly to tables. Here's an approach for complex nested lists:

Python Example for List Extraction

import openai
import requests
from bs4 import BeautifulSoup
import json

def scrape_list_with_llm(url, list_selector='ul'):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Get the list HTML
    list_html = str(soup.find(list_selector))

    prompt = f"""
    Extract all items from this HTML list into a structured JSON format.
    Preserve the hierarchy for nested lists.

    HTML:
    {list_html}

    Return a JSON array where nested items are represented as sub-arrays.
    Return only valid JSON.
    """

    client = openai.OpenAI(api_key='your-api-key')
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You extract list data and return valid JSON."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )

    return json.loads(response.choices[0].message.content)

# Usage
list_data = scrape_list_with_llm('https://example.com/features', 'ul.feature-list')

Using WebScraping.AI with LLM Features

WebScraping.AI provides built-in AI-powered extraction that simplifies scraping tables and lists without managing LLM APIs directly.

Example with WebScraping.AI Question Endpoint

import requests

def scrape_with_webscraping_ai(url, question):
    api_url = "https://api.webscraping.ai/html"

    params = {
        "api_key": "your-api-key",
        "url": url,
        "question": question
    }

    response = requests.get(api_url, params=params)
    return response.json()

# Extract pricing table
pricing_data = scrape_with_webscraping_ai(
    "https://example.com/pricing",
    "Extract all pricing tiers with their features and prices as JSON"
)

print(pricing_data)

Using Field Extraction for Tables

const axios = require('axios');

async function extractTableFields(url) {
  const response = await axios.get('https://api.webscraping.ai/html', {
    params: {
      api_key: 'your-api-key',
      url: url,
      fields: JSON.stringify({
        product_name: "Extract the product name",
        price: "Extract the price as a number",
        rating: "Extract the rating",
        availability: "Extract if the product is in stock"
      })
    }
  });

  return response.data;
}

extractTableFields('https://example.com/products')
  .then(console.log)
  .catch(console.error);

Best Practices for LLM-Based Table Scraping

1. Minimize HTML Sent to LLM

LLMs have token limits, so send only relevant HTML:

from bs4 import BeautifulSoup

def extract_minimal_html(html_content, selector):
    soup = BeautifulSoup(html_content, 'html.parser')
    element = soup.select_one(selector)

    # Remove unnecessary attributes
    for tag in element.find_all(True):
        # Keep only essential attributes
        attrs_to_keep = ['colspan', 'rowspan']
        tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}

    return str(element)

2. Use Clear Prompts with Examples

Provide example output format:

prompt = f"""
Extract data from this table and return JSON in this exact format:
[
  {{"name": "Product A", "price": 29.99, "stock": "In Stock"}},
  {{"name": "Product B", "price": 49.99, "stock": "Out of Stock"}}
]

HTML:
{table_html}

Return only the JSON array.
"""

3. Set Temperature to 0 for Consistency

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}],
    temperature=0  # Ensures deterministic output
)

4. Handle Pagination

For multi-page tables:

def scrape_paginated_table(base_url, max_pages=10):
    all_data = []

    for page in range(1, max_pages + 1):
        url = f"{base_url}?page={page}"
        page_data = scrape_table_with_llm(url)

        if not page_data:  # No more data
            break

        all_data.extend(page_data)

    return all_data

5. Implement Error Handling

import json
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def scrape_with_retry(url):
    try:
        result = scrape_table_with_llm(url)
        # Validate result is proper JSON
        if isinstance(result, list):
            return result
        raise ValueError("Invalid response format")
    except json.JSONDecodeError:
        raise ValueError("LLM returned invalid JSON")

Comparing LLM Approach vs Traditional Methods

| Aspect | Traditional (XPath/CSS) | LLM-Based | |--------|-------------------------|-----------| | Setup Time | Fast for simple tables | Requires API setup | | Resilience | Breaks with layout changes | Adapts to changes | | Cost | Free | API costs per request | | Speed | Very fast | Slower (API latency) | | Complex Structures | Difficult | Handles well | | Data Cleaning | Manual | Automatic |

For high-volume scraping with stable layouts, traditional methods are more cost-effective. For dynamic sites or complex data structures, using LLMs for data extraction provides better long-term maintainability.

Cost Optimization Tips

Cache Results: Store extracted data to avoid re-processing identical pages
Batch Requests: Combine multiple small tables in one prompt when possible
Use Smaller Models: GPT-3.5-turbo is often sufficient for simple tables
Pre-filter HTML: Remove scripts, styles, and irrelevant tags before sending to LLM
Smart Fallbacks: Use traditional parsing first, LLM only for complex cases

Conclusion

LLMs transform table and list scraping by understanding data semantically rather than structurally. While they introduce API costs and latency, the benefits of resilience and automatic data normalization often outweigh these drawbacks for complex scraping tasks. Combining traditional methods for simple cases with AI-powered web scraping for challenging scenarios provides the optimal balance of cost, speed, and reliability.

Table of contents