Can LLMs Extract Data from JavaScript-Rendered Pages?
Yes, Large Language Models (LLMs) can extract data from JavaScript-rendered pages, but they cannot directly execute JavaScript. Instead, you must first render the page using a headless browser like Puppeteer, Playwright, or Selenium to get the final HTML output, then pass that rendered HTML to the LLM for data extraction.
This two-step approach combines the strengths of browser automation (handling dynamic content) with LLM capabilities (intelligent data extraction), making it particularly powerful for modern web applications that rely heavily on client-side rendering.
Understanding the Challenge
JavaScript-rendered pages present unique challenges for web scraping:
- Client-side rendering: Content is generated dynamically by JavaScript after the initial page load
- Asynchronous data loading: Data may load via AJAX requests after user interactions
- Complex state management: Modern frameworks like React, Vue, and Angular create dynamic UIs
- Delayed content: Elements may appear only after certain conditions are met
Traditional HTML parsers see only the initial HTML skeleton, missing the dynamically generated content. LLMs face the same limitation—they need the fully rendered HTML to extract meaningful data.
The Two-Step Solution
Step 1: Render the Page with a Headless Browser
First, use a headless browser to execute JavaScript and capture the fully rendered HTML. Here's how to do it with Puppeteer in Node.js:
const puppeteer = require('puppeteer');
async function getRenderedHTML(url) {
const browser = await puppeteer.launch({
headless: 'new',
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// Navigate to the page
await page.goto(url, {
waitUntil: 'networkidle2', // Wait until network is idle
timeout: 30000
});
// Wait for specific content if needed
await page.waitForSelector('.product-list', { timeout: 10000 });
// Get the fully rendered HTML
const html = await page.content();
await browser.close();
return html;
}
// Usage
const url = 'https://example.com/products';
const renderedHTML = await getRenderedHTML(url);
In Python using Playwright:
from playwright.sync_api import sync_playwright
def get_rendered_html(url):
with sync_playwright() as p:
# Launch browser
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for content
page.goto(url, wait_until='networkidle')
# Wait for specific elements
page.wait_for_selector('.product-list', timeout=10000)
# Get rendered HTML
html = page.content()
browser.close()
return html
# Usage
url = 'https://example.com/products'
rendered_html = get_rendered_html(url)
For more advanced scenarios like handling AJAX requests using Puppeteer or working with dynamic single-page applications, you may need additional wait strategies.
Step 2: Extract Data with an LLM
Once you have the rendered HTML, pass it to an LLM with a structured extraction prompt:
import openai
import json
def extract_with_llm(html, extraction_schema):
"""
Extract structured data from HTML using OpenAI's GPT model
Args:
html: The fully rendered HTML content
extraction_schema: JSON schema describing what to extract
"""
client = openai.OpenAI(api_key='your-api-key')
prompt = f"""
Extract the following data from this HTML content.
Return the data as valid JSON matching this schema:
{json.dumps(extraction_schema, indent=2)}
HTML content:
{html}
Return only the JSON data, no additional text.
"""
response = client.chat.completions.create(
model="gpt-4-turbo-preview",
messages=[
{"role": "system", "content": "You are a data extraction assistant. Extract structured data from HTML and return valid JSON."},
{"role": "user", "content": prompt}
],
temperature=0,
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Define what you want to extract
schema = {
"products": [
{
"name": "string",
"price": "number",
"rating": "number",
"availability": "string"
}
]
}
# Extract data
extracted_data = extract_with_llm(rendered_html, schema)
print(json.dumps(extracted_data, indent=2))
Using Claude API for extraction:
import anthropic
import json
def extract_with_claude(html, extraction_schema):
"""
Extract structured data using Claude API with function calling
"""
client = anthropic.Anthropic(api_key='your-api-key')
# Define the extraction function
tools = [{
"name": "extract_product_data",
"description": "Extract product information from HTML",
"input_schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"},
"availability": {"type": "string"}
},
"required": ["name", "price"]
}
}
},
"required": ["products"]
}
}]
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
tools=tools,
messages=[{
"role": "user",
"content": f"Extract all product data from this HTML:\n\n{html}"
}]
)
# Extract the function call result
for content in message.content:
if content.type == "tool_use":
return content.input
return None
# Usage
extracted_data = extract_with_claude(rendered_html, schema)
Complete End-to-End Example
Here's a full example combining both steps in Python:
from playwright.sync_api import sync_playwright
import anthropic
import json
def scrape_js_rendered_page_with_llm(url):
"""
Complete pipeline: Render JS page and extract data with LLM
"""
# Step 1: Render the page
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until='networkidle')
page.wait_for_selector('.product-item', timeout=15000)
html = page.content()
browser.close()
# Step 2: Extract with LLM
client = anthropic.Anthropic(api_key='your-api-key')
tools = [{
"name": "extract_products",
"description": "Extract product listings",
"input_schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"price": {"type": "string"},
"image_url": {"type": "string"},
"description": {"type": "string"}
}
}
}
}
}
}]
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4096,
tools=tools,
messages=[{
"role": "user",
"content": f"Extract all products from this e-commerce page HTML:\n\n{html[:50000]}" # Limit context size
}]
)
for content in message.content:
if content.type == "tool_use":
return content.input
return None
# Run the scraper
products = scrape_js_rendered_page_with_llm('https://example-shop.com/products')
print(json.dumps(products, indent=2))
Advanced Techniques
Waiting for Dynamic Content
When crawling single-page applications using Puppeteer, you may need to wait for specific conditions:
async function waitForDynamicContent(page) {
// Wait for network to be idle
await page.waitForNetworkIdle();
// Wait for specific element
await page.waitForSelector('.loaded-content');
// Wait for custom condition
await page.waitForFunction(() => {
return document.querySelectorAll('.product-item').length > 10;
});
// Additional delay for animations
await page.waitForTimeout(1000);
}
Handling Infinite Scroll
For pages with infinite scroll:
from playwright.sync_api import sync_playwright
def scrape_infinite_scroll(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until='networkidle')
# Scroll to load more content
previous_height = 0
while True:
# Scroll to bottom
page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
page.wait_for_timeout(2000)
# Check if new content loaded
current_height = page.evaluate('document.body.scrollHeight')
if current_height == previous_height:
break
previous_height = current_height
html = page.content()
browser.close()
return html
Chunking Large HTML for LLMs
Large pages may exceed LLM context limits. Extract relevant sections first:
from bs4 import BeautifulSoup
def extract_relevant_content(html, selector='.main-content'):
"""
Extract only the relevant part of HTML before sending to LLM
"""
soup = BeautifulSoup(html, 'html.parser')
# Find the main content area
main_content = soup.select_one(selector)
if main_content:
# Remove unnecessary elements
for tag in main_content.find_all(['script', 'style', 'noscript']):
tag.decompose()
return str(main_content)
return html
# Process before sending to LLM
cleaned_html = extract_relevant_content(rendered_html, '.product-grid')
extracted_data = extract_with_llm(cleaned_html, schema)
When to Use LLMs vs Traditional Selectors
Use LLMs when: - HTML structure changes frequently - Data isn't in consistent formats - You need semantic understanding (e.g., identifying product features from descriptions) - Multiple page layouts need one extractor - Dealing with unstructured text content
Use traditional CSS/XPath selectors when: - HTML structure is stable - Speed is critical (LLMs are slower) - Cost is a concern (LLM API calls cost money) - Data is in consistent, predictable locations - Simple tabular data extraction
Best approach: Combine both methods—use headless browsers with traditional selectors when possible, and fall back to LLMs for complex or unpredictable content.
Using WebScraping.AI API
For a simpler solution, use WebScraping.AI's built-in JavaScript rendering with LLM-powered extraction:
import requests
def scrape_with_api(url, question):
"""
Use WebScraping.AI API with automatic JS rendering and LLM extraction
"""
response = requests.get(
'https://api.webscraping.ai/ai',
params={
'api_key': 'YOUR_API_KEY',
'url': url,
'question': question,
'js': True, # Enable JavaScript rendering
'wait_for': '.product-list' # Wait for specific element
}
)
return response.json()
# Extract product data
result = scrape_with_api(
'https://example.com/products',
'Extract all products with their names, prices, and ratings as JSON'
)
print(result)
This handles both JavaScript rendering and LLM extraction in a single API call, saving you infrastructure and maintenance overhead.
Best Practices
- Optimize wait strategies: Don't wait longer than necessary; use specific selectors instead of fixed timeouts
- Minimize HTML sent to LLMs: Extract only relevant sections to reduce costs and improve accuracy
- Use structured outputs: Always request JSON with a specific schema for consistent results
- Cache rendered pages: If scraping the same page multiple times, cache the rendered HTML
- Handle rate limits: Both headless browsers and LLM APIs have limits—implement retry logic
- Monitor costs: LLM API calls can become expensive at scale; consider traditional parsing for high-volume tasks
- Error handling: Implement robust error handling for both browser automation and LLM API calls
Conclusion
LLMs can absolutely extract data from JavaScript-rendered pages, but they require a two-step process: first rendering the page with a headless browser, then using the LLM to extract structured data from the resulting HTML. This combination provides a powerful, flexible solution for modern web scraping, especially when dealing with complex, dynamic websites where traditional selectors fall short.
For production use cases, consider using dedicated APIs like WebScraping.AI that handle both JavaScript rendering and intelligent data extraction, letting you focus on your application logic rather than infrastructure management.