How do I Handle Dynamic Websites with LLM-based Web Scraping?

Dynamic websites that rely heavily on JavaScript to render content present unique challenges for traditional web scraping approaches. When combined with Large Language Models (LLMs), you need a two-stage approach: first, render the dynamic content using browser automation tools, then extract and structure the data using LLMs. This hybrid approach leverages the strengths of both technologies to handle even the most complex modern web applications.

Understanding the Challenge

Dynamic websites use JavaScript frameworks like React, Vue, or Angular to render content client-side. When you fetch the HTML directly using standard HTTP libraries, you only get the initial page skeleton without the JavaScript-rendered content. LLMs can't execute JavaScript, so you must first render the page fully before passing it to the LLM for data extraction.

The Two-Stage Approach

Stage 1: Render Dynamic Content with Browser Automation

Use headless browsers to execute JavaScript and wait for content to load. The most popular tools are Puppeteer (Node.js) and Playwright (multi-language support).

Using Puppeteer with LLMs (JavaScript)

const puppeteer = require('puppeteer');
const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function scrapeDynamicSite(url) {
  // Launch browser and render page
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle0' });

  // Wait for specific dynamic content to load
  await page.waitForSelector('.product-list', { timeout: 10000 });

  // Extract the fully rendered HTML
  const html = await page.content();

  await browser.close();

  // Pass rendered HTML to LLM for extraction
  const completion = await openai.chat.completions.create({
    model: "gpt-4-turbo",
    messages: [
      {
        role: "system",
        content: "You are a data extraction assistant. Extract product information from the HTML and return it as JSON."
      },
      {
        role: "user",
        content: `Extract all products with their names, prices, and ratings from this HTML:\n\n${html}`
      }
    ],
    response_format: { type: "json_object" }
  });

  return JSON.parse(completion.choices[0].message.content);
}

// Usage
scrapeDynamicSite('https://example.com/products')
  .then(data => console.log(data))
  .catch(error => console.error(error));

Using Playwright with LLMs (Python)

from playwright.sync_api import sync_playwright
from openai import OpenAI
import json

client = OpenAI(api_key="your-api-key")

def scrape_dynamic_site(url):
    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for dynamic content
        page.goto(url, wait_until='networkidle')

        # Wait for specific elements to ensure JavaScript has rendered
        page.wait_for_selector('.product-list', timeout=10000)

        # Scroll to load lazy-loaded content
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)

        # Get fully rendered HTML
        html_content = page.content()

        browser.close()

    # Extract data using LLM
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {
                "role": "system",
                "content": "Extract product information from HTML and return as JSON with fields: name, price, rating, availability."
            },
            {
                "role": "user",
                "content": f"Extract all products from this HTML:\n\n{html_content}"
            }
        ],
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

# Usage
products = scrape_dynamic_site('https://example.com/products')
print(json.dumps(products, indent=2))

Stage 2: Optimize Content Before LLM Processing

Since LLMs have token limits and processing costs scale with input size, optimize the HTML before sending it to the LLM.

Remove Unnecessary Elements

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def get_cleaned_content(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until='networkidle')

        # Wait for content
        page.wait_for_selector('main', timeout=10000)

        html = page.content()
        browser.close()

    # Clean HTML with BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')

    # Remove scripts, styles, and other noise
    for element in soup(['script', 'style', 'nav', 'footer', 'header', 'iframe', 'noscript']):
        element.decompose()

    # Extract only the main content area
    main_content = soup.find('main') or soup.find('article') or soup.body

    # Return cleaned text or HTML
    return str(main_content) if main_content else str(soup)

Advanced Techniques for Dynamic Content

Handling Infinite Scroll

Many modern websites use infinite scroll to load content dynamically. You need to scroll programmatically to trigger content loading:

async function scrapeInfiniteScroll(url, scrolls = 5) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Scroll multiple times to load more content
  for (let i = 0; i < scrolls; i++) {
    await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
    });
    // Wait for new content to load
    await page.waitForTimeout(2000);
  }

  const html = await page.content();
  await browser.close();

  // Now pass to LLM for extraction
  return html;
}

Handling AJAX and API Calls

Instead of scraping rendered HTML, you can monitor network requests to capture API responses directly:

from playwright.sync_api import sync_playwright
import json

def intercept_api_data(url):
    api_responses = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Intercept network requests
        def handle_response(response):
            if 'api' in response.url and response.status == 200:
                try:
                    api_responses.append(response.json())
                except:
                    pass

        page.on('response', handle_response)
        page.goto(url, wait_until='networkidle')
        page.wait_for_timeout(3000)

        browser.close()

    return api_responses

This approach is more efficient because API responses are already structured (usually JSON), requiring less LLM processing.

Handling Single Page Applications (SPAs)

SPAs require special attention because content changes without page reloads. You need to handle AJAX requests and wait for specific state changes:

async function scrapeSPA(url, navigationSelector) {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle0' });

  // Click navigation and wait for content update
  await page.click(navigationSelector);
  await page.waitForFunction(
    'document.querySelector(".content").innerText.length > 100',
    { timeout: 5000 }
  );

  const html = await page.content();
  await browser.close();

  return html;
}

Combining with LLM Function Calling

Modern LLMs support function calling (also called tool use), which provides more reliable structured output:

from openai import OpenAI
import json

client = OpenAI()

tools = [
    {
        "type": "function",
        "function": {
            "name": "extract_products",
            "description": "Extract product information from webpage",
            "parameters": {
                "type": "object",
                "properties": {
                    "products": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "price": {"type": "number"},
                                "currency": {"type": "string"},
                                "rating": {"type": "number"},
                                "reviews": {"type": "integer"},
                                "in_stock": {"type": "boolean"}
                            },
                            "required": ["name", "price"]
                        }
                    }
                },
                "required": ["products"]
            }
        }
    }
]

def extract_with_function_calling(html_content):
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {
                "role": "user",
                "content": f"Extract product data from this HTML:\n\n{html_content}"
            }
        ],
        tools=tools,
        tool_choice={"type": "function", "function": {"name": "extract_products"}}
    )

    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

Best Practices

1. Wait Strategies

Choose the appropriate wait strategy based on your target site:

# Wait for network idle (all resources loaded)
page.goto(url, wait_until='networkidle')

# Wait for specific element
page.wait_for_selector('.product-card', timeout=10000)

# Wait for custom condition
page.wait_for_function('document.querySelectorAll(".item").length > 10')

# Fixed timeout (use sparingly)
page.wait_for_timeout(3000)

2. Error Handling and Retries

Dynamic websites can be unpredictable. Implement robust error handling:

import time

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            with sync_playwright() as p:
                browser = p.chromium.launch()
                page = browser.new_page()
                page.set_default_timeout(30000)

                page.goto(url, wait_until='networkidle')
                page.wait_for_selector('.content', timeout=10000)

                html = page.content()
                browser.close()

                return html
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

3. Token Optimization

Reduce LLM costs by sending only relevant content:

from bs4 import BeautifulSoup

def extract_relevant_sections(html, selectors):
    """Extract only specific sections instead of entire page"""
    soup = BeautifulSoup(html, 'html.parser')
    relevant_content = []

    for selector in selectors:
        elements = soup.select(selector)
        relevant_content.extend([str(elem) for elem in elements])

    return '\n'.join(relevant_content)

# Usage
cleaned_html = extract_relevant_sections(
    html,
    ['.product-card', '.product-listing', '#main-content']
)

4. Caching Rendered Pages

For frequently accessed dynamic pages, cache the rendered HTML to reduce browser automation overhead:

import hashlib
import os
import time

def get_cached_or_scrape(url, cache_dir='./cache', cache_ttl=3600):
    url_hash = hashlib.md5(url.encode()).hexdigest()
    cache_file = os.path.join(cache_dir, f"{url_hash}.html")

    # Check if cache exists and is fresh
    if os.path.exists(cache_file):
        cache_age = time.time() - os.path.getmtime(cache_file)
        if cache_age < cache_ttl:
            with open(cache_file, 'r') as f:
                return f.read()

    # Scrape and cache
    html = scrape_dynamic_site(url)
    os.makedirs(cache_dir, exist_ok=True)
    with open(cache_file, 'w') as f:
        f.write(html)

    return html

Using WebScraping.AI API for Dynamic Content

Instead of managing browser automation yourself, you can use a web scraping API that handles JavaScript rendering and can be combined with LLMs:

import requests
from openai import OpenAI

def scrape_with_api(url):
    # Get rendered HTML from WebScraping.AI
    response = requests.get(
        'https://api.webscraping.ai/html',
        params={
            'url': url,
            'api_key': 'YOUR_API_KEY',
            'js': 'true',  # Enable JavaScript rendering
            'wait_for': '.product-list'  # Wait for specific selector
        }
    )

    html = response.text

    # Process with LLM
    client = OpenAI()
    completion = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "user", "content": f"Extract products from:\n{html}"}
        ]
    )

    return completion.choices[0].message.content

Conclusion

Handling dynamic websites with LLM-based web scraping requires combining browser automation tools with LLM capabilities. The key is to first render the JavaScript-heavy content using tools like Puppeteer or Playwright, then leverage LLMs for intelligent data extraction from the rendered HTML. By implementing proper wait strategies, optimizing content before LLM processing, and using structured outputs through function calling, you can build robust scrapers for even the most complex modern web applications.

This hybrid approach gives you the best of both worlds: the ability to handle dynamic JavaScript content through browser automation, and the intelligent, flexible data extraction capabilities of LLMs. Remember to implement proper error handling, respect rate limits, and optimize for token usage to keep costs manageable while maintaining reliability.

Table of contents

How do I Handle Dynamic Websites with LLM-based Web Scraping?

Understanding the Challenge

The Two-Stage Approach

Stage 1: Render Dynamic Content with Browser Automation

Using Puppeteer with LLMs (JavaScript)

Using Playwright with LLMs (Python)

Stage 2: Optimize Content Before LLM Processing

Remove Unnecessary Elements

Advanced Techniques for Dynamic Content

Handling Infinite Scroll

Handling AJAX and API Calls

Handling Single Page Applications (SPAs)

Combining with LLM Function Calling

Best Practices

1. Wait Strategies

2. Error Handling and Retries

3. Token Optimization

4. Caching Rendered Pages

Using WebScraping.AI API for Dynamic Content

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Can LLMs extract data from JavaScript-rendered pages?

How do I use LLMs to scrape data from tables and lists?

What prompt engineering techniques work best for web scraping with LLMs?

Get Started Now

Support