How do I Handle Dynamic Websites with LLM-based Web Scraping?
Dynamic websites that rely heavily on JavaScript to render content present unique challenges for traditional web scraping approaches. When combined with Large Language Models (LLMs), you need a two-stage approach: first, render the dynamic content using browser automation tools, then extract and structure the data using LLMs. This hybrid approach leverages the strengths of both technologies to handle even the most complex modern web applications.
Understanding the Challenge
Dynamic websites use JavaScript frameworks like React, Vue, or Angular to render content client-side. When you fetch the HTML directly using standard HTTP libraries, you only get the initial page skeleton without the JavaScript-rendered content. LLMs can't execute JavaScript, so you must first render the page fully before passing it to the LLM for data extraction.
The Two-Stage Approach
Stage 1: Render Dynamic Content with Browser Automation
Use headless browsers to execute JavaScript and wait for content to load. The most popular tools are Puppeteer (Node.js) and Playwright (multi-language support).
Using Puppeteer with LLMs (JavaScript)
const puppeteer = require('puppeteer');
const OpenAI = require('openai');
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
async function scrapeDynamicSite(url) {
// Launch browser and render page
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle0' });
// Wait for specific dynamic content to load
await page.waitForSelector('.product-list', { timeout: 10000 });
// Extract the fully rendered HTML
const html = await page.content();
await browser.close();
// Pass rendered HTML to LLM for extraction
const completion = await openai.chat.completions.create({
model: "gpt-4-turbo",
messages: [
{
role: "system",
content: "You are a data extraction assistant. Extract product information from the HTML and return it as JSON."
},
{
role: "user",
content: `Extract all products with their names, prices, and ratings from this HTML:\n\n${html}`
}
],
response_format: { type: "json_object" }
});
return JSON.parse(completion.choices[0].message.content);
}
// Usage
scrapeDynamicSite('https://example.com/products')
.then(data => console.log(data))
.catch(error => console.error(error));
Using Playwright with LLMs (Python)
from playwright.sync_api import sync_playwright
from openai import OpenAI
import json
client = OpenAI(api_key="your-api-key")
def scrape_dynamic_site(url):
with sync_playwright() as p:
# Launch browser
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate and wait for dynamic content
page.goto(url, wait_until='networkidle')
# Wait for specific elements to ensure JavaScript has rendered
page.wait_for_selector('.product-list', timeout=10000)
# Scroll to load lazy-loaded content
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
# Get fully rendered HTML
html_content = page.content()
browser.close()
# Extract data using LLM
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "system",
"content": "Extract product information from HTML and return as JSON with fields: name, price, rating, availability."
},
{
"role": "user",
"content": f"Extract all products from this HTML:\n\n{html_content}"
}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)
# Usage
products = scrape_dynamic_site('https://example.com/products')
print(json.dumps(products, indent=2))
Stage 2: Optimize Content Before LLM Processing
Since LLMs have token limits and processing costs scale with input size, optimize the HTML before sending it to the LLM.
Remove Unnecessary Elements
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
def get_cleaned_content(url):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until='networkidle')
# Wait for content
page.wait_for_selector('main', timeout=10000)
html = page.content()
browser.close()
# Clean HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
# Remove scripts, styles, and other noise
for element in soup(['script', 'style', 'nav', 'footer', 'header', 'iframe', 'noscript']):
element.decompose()
# Extract only the main content area
main_content = soup.find('main') or soup.find('article') or soup.body
# Return cleaned text or HTML
return str(main_content) if main_content else str(soup)
Advanced Techniques for Dynamic Content
Handling Infinite Scroll
Many modern websites use infinite scroll to load content dynamically. You need to scroll programmatically to trigger content loading:
async function scrapeInfiniteScroll(url, scrolls = 5) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle0' });
// Scroll multiple times to load more content
for (let i = 0; i < scrolls; i++) {
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
// Wait for new content to load
await page.waitForTimeout(2000);
}
const html = await page.content();
await browser.close();
// Now pass to LLM for extraction
return html;
}
Handling AJAX and API Calls
Instead of scraping rendered HTML, you can monitor network requests to capture API responses directly:
from playwright.sync_api import sync_playwright
import json
def intercept_api_data(url):
api_responses = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Intercept network requests
def handle_response(response):
if 'api' in response.url and response.status == 200:
try:
api_responses.append(response.json())
except:
pass
page.on('response', handle_response)
page.goto(url, wait_until='networkidle')
page.wait_for_timeout(3000)
browser.close()
return api_responses
This approach is more efficient because API responses are already structured (usually JSON), requiring less LLM processing.
Handling Single Page Applications (SPAs)
SPAs require special attention because content changes without page reloads. You need to handle AJAX requests and wait for specific state changes:
async function scrapeSPA(url, navigationSelector) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle0' });
// Click navigation and wait for content update
await page.click(navigationSelector);
await page.waitForFunction(
'document.querySelector(".content").innerText.length > 100',
{ timeout: 5000 }
);
const html = await page.content();
await browser.close();
return html;
}
Combining with LLM Function Calling
Modern LLMs support function calling (also called tool use), which provides more reliable structured output:
from openai import OpenAI
import json
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "extract_products",
"description": "Extract product information from webpage",
"parameters": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"rating": {"type": "number"},
"reviews": {"type": "integer"},
"in_stock": {"type": "boolean"}
},
"required": ["name", "price"]
}
}
},
"required": ["products"]
}
}
}
]
def extract_with_function_calling(html_content):
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{
"role": "user",
"content": f"Extract product data from this HTML:\n\n{html_content}"
}
],
tools=tools,
tool_choice={"type": "function", "function": {"name": "extract_products"}}
)
tool_call = response.choices[0].message.tool_calls[0]
return json.loads(tool_call.function.arguments)
Best Practices
1. Wait Strategies
Choose the appropriate wait strategy based on your target site:
# Wait for network idle (all resources loaded)
page.goto(url, wait_until='networkidle')
# Wait for specific element
page.wait_for_selector('.product-card', timeout=10000)
# Wait for custom condition
page.wait_for_function('document.querySelectorAll(".item").length > 10')
# Fixed timeout (use sparingly)
page.wait_for_timeout(3000)
2. Error Handling and Retries
Dynamic websites can be unpredictable. Implement robust error handling:
import time
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
try:
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.set_default_timeout(30000)
page.goto(url, wait_until='networkidle')
page.wait_for_selector('.content', timeout=10000)
html = page.content()
browser.close()
return html
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt) # Exponential backoff
3. Token Optimization
Reduce LLM costs by sending only relevant content:
from bs4 import BeautifulSoup
def extract_relevant_sections(html, selectors):
"""Extract only specific sections instead of entire page"""
soup = BeautifulSoup(html, 'html.parser')
relevant_content = []
for selector in selectors:
elements = soup.select(selector)
relevant_content.extend([str(elem) for elem in elements])
return '\n'.join(relevant_content)
# Usage
cleaned_html = extract_relevant_sections(
html,
['.product-card', '.product-listing', '#main-content']
)
4. Caching Rendered Pages
For frequently accessed dynamic pages, cache the rendered HTML to reduce browser automation overhead:
import hashlib
import os
import time
def get_cached_or_scrape(url, cache_dir='./cache', cache_ttl=3600):
url_hash = hashlib.md5(url.encode()).hexdigest()
cache_file = os.path.join(cache_dir, f"{url_hash}.html")
# Check if cache exists and is fresh
if os.path.exists(cache_file):
cache_age = time.time() - os.path.getmtime(cache_file)
if cache_age < cache_ttl:
with open(cache_file, 'r') as f:
return f.read()
# Scrape and cache
html = scrape_dynamic_site(url)
os.makedirs(cache_dir, exist_ok=True)
with open(cache_file, 'w') as f:
f.write(html)
return html
Using WebScraping.AI API for Dynamic Content
Instead of managing browser automation yourself, you can use a web scraping API that handles JavaScript rendering and can be combined with LLMs:
import requests
from openai import OpenAI
def scrape_with_api(url):
# Get rendered HTML from WebScraping.AI
response = requests.get(
'https://api.webscraping.ai/html',
params={
'url': url,
'api_key': 'YOUR_API_KEY',
'js': 'true', # Enable JavaScript rendering
'wait_for': '.product-list' # Wait for specific selector
}
)
html = response.text
# Process with LLM
client = OpenAI()
completion = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "user", "content": f"Extract products from:\n{html}"}
]
)
return completion.choices[0].message.content
Conclusion
Handling dynamic websites with LLM-based web scraping requires combining browser automation tools with LLM capabilities. The key is to first render the JavaScript-heavy content using tools like Puppeteer or Playwright, then leverage LLMs for intelligent data extraction from the rendered HTML. By implementing proper wait strategies, optimizing content before LLM processing, and using structured outputs through function calling, you can build robust scrapers for even the most complex modern web applications.
This hybrid approach gives you the best of both worlds: the ability to handle dynamic JavaScript content through browser automation, and the intelligent, flexible data extraction capabilities of LLMs. Remember to implement proper error handling, respect rate limits, and optimize for token usage to keep costs manageable while maintaining reliability.