How do I use LLMs to scrape data from tables and lists?
Large Language Models (LLMs) excel at extracting structured data from tables and lists on web pages, even when the HTML structure is complex or inconsistent. Unlike traditional CSS selectors or XPath, LLMs can understand the semantic meaning of data and adapt to varying layouts without requiring manual selector updates.
Why Use LLMs for Table and List Scraping?
LLMs offer several advantages when scraping tabular and list-based data:
- Semantic Understanding: LLMs can identify data relationships without relying on specific HTML structure
- Resilience to Changes: When websites update their layouts, LLM-based scrapers often continue working without modifications
- Complex Table Handling: Nested tables, merged cells, and irregular structures are easier to parse
- Data Normalization: LLMs can automatically clean and standardize extracted data
- Missing Data Handling: AI models can intelligently handle incomplete or irregular data patterns
Basic Approach: Using LLMs with HTML Input
The most straightforward method involves sending HTML content to an LLM with instructions to extract structured data.
Python Example with OpenAI API
import openai
import requests
from bs4 import BeautifulSoup
import json
def scrape_table_with_llm(url):
# Fetch the webpage
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract the table HTML (or get specific table)
table_html = str(soup.find('table'))
# Create the prompt
prompt = f"""
Extract all data from this HTML table and return it as a JSON array of objects.
Each row should be an object with keys matching the column headers.
HTML:
{table_html}
Return only valid JSON, no explanation.
"""
# Call OpenAI API
client = openai.OpenAI(api_key='your-api-key')
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a data extraction assistant that returns only valid JSON."},
{"role": "user", "content": prompt}
],
temperature=0
)
# Parse the JSON response
extracted_data = json.loads(response.choices[0].message.content)
return extracted_data
# Usage
data = scrape_table_with_llm('https://example.com/pricing')
print(json.dumps(data, indent=2))
JavaScript Example with OpenAI API
const OpenAI = require('openai');
const axios = require('axios');
const cheerio = require('cheerio');
async function scrapeTableWithLLM(url) {
// Fetch the webpage
const response = await axios.get(url);
const $ = cheerio.load(response.data);
// Extract table HTML
const tableHtml = $('table').first().html();
// Create the prompt
const prompt = `
Extract all data from this HTML table and return it as a JSON array of objects.
Each row should be an object with keys matching the column headers.
HTML:
${tableHtml}
Return only valid JSON, no explanation.
`;
// Call OpenAI API
const openai = new OpenAI({ apiKey: 'your-api-key' });
const completion = await openai.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'system', content: 'You are a data extraction assistant that returns only valid JSON.' },
{ role: 'user', content: prompt }
],
temperature: 0
});
// Parse and return JSON
const extractedData = JSON.parse(completion.choices[0].message.content);
return extractedData;
}
// Usage
scrapeTableWithLLM('https://example.com/pricing')
.then(data => console.log(JSON.stringify(data, null, 2)))
.catch(console.error);
Advanced Technique: Using Function Calling for Structured Output
Modern LLMs support function calling for structured data extraction, which guarantees properly formatted output.
Python Example with OpenAI Function Calling
import openai
import requests
from bs4 import BeautifulSoup
import json
def scrape_table_with_functions(url, schema):
# Fetch HTML
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table_html = str(soup.find('table'))
# Define function schema
functions = [
{
"name": "extract_table_data",
"description": "Extract structured data from an HTML table",
"parameters": {
"type": "object",
"properties": {
"rows": {
"type": "array",
"description": "Array of table rows",
"items": {
"type": "object",
"properties": schema
}
}
},
"required": ["rows"]
}
}
]
client = openai.OpenAI(api_key='your-api-key')
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": f"Extract all data from this table:\n\n{table_html}"}
],
functions=functions,
function_call={"name": "extract_table_data"},
temperature=0
)
# Parse function call response
function_args = json.loads(response.choices[0].message.function_call.arguments)
return function_args['rows']
# Usage with custom schema
schema = {
"product": {"type": "string"},
"price": {"type": "number"},
"features": {"type": "array", "items": {"type": "string"}}
}
data = scrape_table_with_functions('https://example.com/products', schema)
Scraping Lists with LLMs
Lists (ordered and unordered) can be extracted similarly to tables. Here's an approach for complex nested lists:
Python Example for List Extraction
import openai
import requests
from bs4 import BeautifulSoup
import json
def scrape_list_with_llm(url, list_selector='ul'):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Get the list HTML
list_html = str(soup.find(list_selector))
prompt = f"""
Extract all items from this HTML list into a structured JSON format.
Preserve the hierarchy for nested lists.
HTML:
{list_html}
Return a JSON array where nested items are represented as sub-arrays.
Return only valid JSON.
"""
client = openai.OpenAI(api_key='your-api-key')
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You extract list data and return valid JSON."},
{"role": "user", "content": prompt}
],
temperature=0
)
return json.loads(response.choices[0].message.content)
# Usage
list_data = scrape_list_with_llm('https://example.com/features', 'ul.feature-list')
Using WebScraping.AI with LLM Features
WebScraping.AI provides built-in AI-powered extraction that simplifies scraping tables and lists without managing LLM APIs directly.
Example with WebScraping.AI Question Endpoint
import requests
def scrape_with_webscraping_ai(url, question):
api_url = "https://api.webscraping.ai/html"
params = {
"api_key": "your-api-key",
"url": url,
"question": question
}
response = requests.get(api_url, params=params)
return response.json()
# Extract pricing table
pricing_data = scrape_with_webscraping_ai(
"https://example.com/pricing",
"Extract all pricing tiers with their features and prices as JSON"
)
print(pricing_data)
Using Field Extraction for Tables
const axios = require('axios');
async function extractTableFields(url) {
const response = await axios.get('https://api.webscraping.ai/html', {
params: {
api_key: 'your-api-key',
url: url,
fields: JSON.stringify({
product_name: "Extract the product name",
price: "Extract the price as a number",
rating: "Extract the rating",
availability: "Extract if the product is in stock"
})
}
});
return response.data;
}
extractTableFields('https://example.com/products')
.then(console.log)
.catch(console.error);
Best Practices for LLM-Based Table Scraping
1. Minimize HTML Sent to LLM
LLMs have token limits, so send only relevant HTML:
from bs4 import BeautifulSoup
def extract_minimal_html(html_content, selector):
soup = BeautifulSoup(html_content, 'html.parser')
element = soup.select_one(selector)
# Remove unnecessary attributes
for tag in element.find_all(True):
# Keep only essential attributes
attrs_to_keep = ['colspan', 'rowspan']
tag.attrs = {k: v for k, v in tag.attrs.items() if k in attrs_to_keep}
return str(element)
2. Use Clear Prompts with Examples
Provide example output format:
prompt = f"""
Extract data from this table and return JSON in this exact format:
[
{{"name": "Product A", "price": 29.99, "stock": "In Stock"}},
{{"name": "Product B", "price": 49.99, "stock": "Out of Stock"}}
]
HTML:
{table_html}
Return only the JSON array.
"""
3. Set Temperature to 0 for Consistency
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0 # Ensures deterministic output
)
4. Handle Pagination
For multi-page tables:
def scrape_paginated_table(base_url, max_pages=10):
all_data = []
for page in range(1, max_pages + 1):
url = f"{base_url}?page={page}"
page_data = scrape_table_with_llm(url)
if not page_data: # No more data
break
all_data.extend(page_data)
return all_data
5. Implement Error Handling
import json
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def scrape_with_retry(url):
try:
result = scrape_table_with_llm(url)
# Validate result is proper JSON
if isinstance(result, list):
return result
raise ValueError("Invalid response format")
except json.JSONDecodeError:
raise ValueError("LLM returned invalid JSON")
Comparing LLM Approach vs Traditional Methods
| Aspect | Traditional (XPath/CSS) | LLM-Based | |--------|-------------------------|-----------| | Setup Time | Fast for simple tables | Requires API setup | | Resilience | Breaks with layout changes | Adapts to changes | | Cost | Free | API costs per request | | Speed | Very fast | Slower (API latency) | | Complex Structures | Difficult | Handles well | | Data Cleaning | Manual | Automatic |
For high-volume scraping with stable layouts, traditional methods are more cost-effective. For dynamic sites or complex data structures, using LLMs for data extraction provides better long-term maintainability.
Cost Optimization Tips
- Cache Results: Store extracted data to avoid re-processing identical pages
- Batch Requests: Combine multiple small tables in one prompt when possible
- Use Smaller Models: GPT-3.5-turbo is often sufficient for simple tables
- Pre-filter HTML: Remove scripts, styles, and irrelevant tags before sending to LLM
- Smart Fallbacks: Use traditional parsing first, LLM only for complex cases
Conclusion
LLMs transform table and list scraping by understanding data semantically rather than structurally. While they introduce API costs and latency, the benefits of resilience and automatic data normalization often outweigh these drawbacks for complex scraping tasks. Combining traditional methods for simple cases with AI-powered web scraping for challenging scenarios provides the optimal balance of cost, speed, and reliability.