Can LLMs Extract Data from JavaScript-Rendered Pages?

Yes, Large Language Models (LLMs) can extract data from JavaScript-rendered pages, but they cannot directly execute JavaScript. Instead, you must first render the page using a headless browser like Puppeteer, Playwright, or Selenium to get the final HTML output, then pass that rendered HTML to the LLM for data extraction.

This two-step approach combines the strengths of browser automation (handling dynamic content) with LLM capabilities (intelligent data extraction), making it particularly powerful for modern web applications that rely heavily on client-side rendering.

Understanding the Challenge

JavaScript-rendered pages present unique challenges for web scraping:

Client-side rendering: Content is generated dynamically by JavaScript after the initial page load
Asynchronous data loading: Data may load via AJAX requests after user interactions
Complex state management: Modern frameworks like React, Vue, and Angular create dynamic UIs
Delayed content: Elements may appear only after certain conditions are met

Traditional HTML parsers see only the initial HTML skeleton, missing the dynamically generated content. LLMs face the same limitation—they need the fully rendered HTML to extract meaningful data.

The Two-Step Solution

Step 1: Render the Page with a Headless Browser

First, use a headless browser to execute JavaScript and capture the fully rendered HTML. Here's how to do it with Puppeteer in Node.js:

const puppeteer = require('puppeteer');

async function getRenderedHTML(url) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // Navigate to the page
  await page.goto(url, {
    waitUntil: 'networkidle2', // Wait until network is idle
    timeout: 30000
  });

  // Wait for specific content if needed
  await page.waitForSelector('.product-list', { timeout: 10000 });

  // Get the fully rendered HTML
  const html = await page.content();

  await browser.close();
  return html;
}

// Usage
const url = 'https://example.com/products';
const renderedHTML = await getRenderedHTML(url);

In Python using Playwright:

from playwright.sync_api import sync_playwright

def get_rendered_html(url):
    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Navigate and wait for content
        page.goto(url, wait_until='networkidle')

        # Wait for specific elements
        page.wait_for_selector('.product-list', timeout=10000)

        # Get rendered HTML
        html = page.content()

        browser.close()
        return html

# Usage
url = 'https://example.com/products'
rendered_html = get_rendered_html(url)

For more advanced scenarios like handling AJAX requests using Puppeteer or working with dynamic single-page applications, you may need additional wait strategies.

Step 2: Extract Data with an LLM

Once you have the rendered HTML, pass it to an LLM with a structured extraction prompt:

import openai
import json

def extract_with_llm(html, extraction_schema):
    """
    Extract structured data from HTML using OpenAI's GPT model

    Args:
        html: The fully rendered HTML content
        extraction_schema: JSON schema describing what to extract
    """
    client = openai.OpenAI(api_key='your-api-key')

    prompt = f"""
    Extract the following data from this HTML content.
    Return the data as valid JSON matching this schema:

    {json.dumps(extraction_schema, indent=2)}

    HTML content:
    {html}

    Return only the JSON data, no additional text.
    """

    response = client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "system", "content": "You are a data extraction assistant. Extract structured data from HTML and return valid JSON."},
            {"role": "user", "content": prompt}
        ],
        temperature=0,
        response_format={"type": "json_object"}
    )

    return json.loads(response.choices[0].message.content)

# Define what you want to extract
schema = {
    "products": [
        {
            "name": "string",
            "price": "number",
            "rating": "number",
            "availability": "string"
        }
    ]
}

# Extract data
extracted_data = extract_with_llm(rendered_html, schema)
print(json.dumps(extracted_data, indent=2))

Using Claude API for extraction:

import anthropic
import json

def extract_with_claude(html, extraction_schema):
    """
    Extract structured data using Claude API with function calling
    """
    client = anthropic.Anthropic(api_key='your-api-key')

    # Define the extraction function
    tools = [{
        "name": "extract_product_data",
        "description": "Extract product information from HTML",
        "input_schema": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "price": {"type": "number"},
                            "rating": {"type": "number"},
                            "availability": {"type": "string"}
                        },
                        "required": ["name", "price"]
                    }
                }
            },
            "required": ["products"]
        }
    }]

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        tools=tools,
        messages=[{
            "role": "user",
            "content": f"Extract all product data from this HTML:\n\n{html}"
        }]
    )

    # Extract the function call result
    for content in message.content:
        if content.type == "tool_use":
            return content.input

    return None

# Usage
extracted_data = extract_with_claude(rendered_html, schema)

Complete End-to-End Example

Here's a full example combining both steps in Python:

from playwright.sync_api import sync_playwright
import anthropic
import json

def scrape_js_rendered_page_with_llm(url):
    """
    Complete pipeline: Render JS page and extract data with LLM
    """

    # Step 1: Render the page
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        page.goto(url, wait_until='networkidle')
        page.wait_for_selector('.product-item', timeout=15000)

        html = page.content()
        browser.close()

    # Step 2: Extract with LLM
    client = anthropic.Anthropic(api_key='your-api-key')

    tools = [{
        "name": "extract_products",
        "description": "Extract product listings",
        "input_schema": {
            "type": "object",
            "properties": {
                "products": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "title": {"type": "string"},
                            "price": {"type": "string"},
                            "image_url": {"type": "string"},
                            "description": {"type": "string"}
                        }
                    }
                }
            }
        }
    }]

    message = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        tools=tools,
        messages=[{
            "role": "user",
            "content": f"Extract all products from this e-commerce page HTML:\n\n{html[:50000]}"  # Limit context size
        }]
    )

    for content in message.content:
        if content.type == "tool_use":
            return content.input

    return None

# Run the scraper
products = scrape_js_rendered_page_with_llm('https://example-shop.com/products')
print(json.dumps(products, indent=2))

Advanced Techniques

Waiting for Dynamic Content

When crawling single-page applications using Puppeteer, you may need to wait for specific conditions:

async function waitForDynamicContent(page) {
  // Wait for network to be idle
  await page.waitForNetworkIdle();

  // Wait for specific element
  await page.waitForSelector('.loaded-content');

  // Wait for custom condition
  await page.waitForFunction(() => {
    return document.querySelectorAll('.product-item').length > 10;
  });

  // Additional delay for animations
  await page.waitForTimeout(1000);
}

Handling Infinite Scroll

For pages with infinite scroll:

from playwright.sync_api import sync_playwright

def scrape_infinite_scroll(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until='networkidle')

        # Scroll to load more content
        previous_height = 0
        while True:
            # Scroll to bottom
            page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
            page.wait_for_timeout(2000)

            # Check if new content loaded
            current_height = page.evaluate('document.body.scrollHeight')
            if current_height == previous_height:
                break
            previous_height = current_height

        html = page.content()
        browser.close()
        return html

Chunking Large HTML for LLMs

Large pages may exceed LLM context limits. Extract relevant sections first:

from bs4 import BeautifulSoup

def extract_relevant_content(html, selector='.main-content'):
    """
    Extract only the relevant part of HTML before sending to LLM
    """
    soup = BeautifulSoup(html, 'html.parser')

    # Find the main content area
    main_content = soup.select_one(selector)

    if main_content:
        # Remove unnecessary elements
        for tag in main_content.find_all(['script', 'style', 'noscript']):
            tag.decompose()

        return str(main_content)

    return html

# Process before sending to LLM
cleaned_html = extract_relevant_content(rendered_html, '.product-grid')
extracted_data = extract_with_llm(cleaned_html, schema)

When to Use LLMs vs Traditional Selectors

Use LLMs when: - HTML structure changes frequently - Data isn't in consistent formats - You need semantic understanding (e.g., identifying product features from descriptions) - Multiple page layouts need one extractor - Dealing with unstructured text content

Use traditional CSS/XPath selectors when: - HTML structure is stable - Speed is critical (LLMs are slower) - Cost is a concern (LLM API calls cost money) - Data is in consistent, predictable locations - Simple tabular data extraction

Best approach: Combine both methods—use headless browsers with traditional selectors when possible, and fall back to LLMs for complex or unpredictable content.

Using WebScraping.AI API

For a simpler solution, use WebScraping.AI's built-in JavaScript rendering with LLM-powered extraction:

import requests

def scrape_with_api(url, question):
    """
    Use WebScraping.AI API with automatic JS rendering and LLM extraction
    """
    response = requests.get(
        'https://api.webscraping.ai/ai',
        params={
            'api_key': 'YOUR_API_KEY',
            'url': url,
            'question': question,
            'js': True,  # Enable JavaScript rendering
            'wait_for': '.product-list'  # Wait for specific element
        }
    )

    return response.json()

# Extract product data
result = scrape_with_api(
    'https://example.com/products',
    'Extract all products with their names, prices, and ratings as JSON'
)
print(result)

This handles both JavaScript rendering and LLM extraction in a single API call, saving you infrastructure and maintenance overhead.

Best Practices

Optimize wait strategies: Don't wait longer than necessary; use specific selectors instead of fixed timeouts
Minimize HTML sent to LLMs: Extract only relevant sections to reduce costs and improve accuracy
Use structured outputs: Always request JSON with a specific schema for consistent results
Cache rendered pages: If scraping the same page multiple times, cache the rendered HTML
Handle rate limits: Both headless browsers and LLM APIs have limits—implement retry logic
Monitor costs: LLM API calls can become expensive at scale; consider traditional parsing for high-volume tasks
Error handling: Implement robust error handling for both browser automation and LLM API calls

Conclusion

LLMs can absolutely extract data from JavaScript-rendered pages, but they require a two-step process: first rendering the page with a headless browser, then using the LLM to extract structured data from the resulting HTML. This combination provides a powerful, flexible solution for modern web scraping, especially when dealing with complex, dynamic websites where traditional selectors fall short.

For production use cases, consider using dedicated APIs like WebScraping.AI that handle both JavaScript rendering and intelligent data extraction, letting you focus on your application logic rather than infrastructure management.

Table of contents

Can LLMs Extract Data from JavaScript-Rendered Pages?

Understanding the Challenge

The Two-Step Solution

Step 1: Render the Page with a Headless Browser

Step 2: Extract Data with an LLM

Complete End-to-End Example

Advanced Techniques

Waiting for Dynamic Content

Handling Infinite Scroll

Chunking Large HTML for LLMs

When to Use LLMs vs Traditional Selectors

Using WebScraping.AI API

Best Practices

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I use LLMs to scrape data from tables and lists?

What prompt engineering techniques work best for web scraping with LLMs?

How do I provide examples to an LLM for better web scraping results?

Get Started Now

Support