Table of contents

How do I Wait for JavaScript to Load Before Scraping with Crawlee?

When scraping modern websites, JavaScript often loads content dynamically after the initial page load. Crawlee provides several powerful methods to wait for JavaScript content to render before extracting data, ensuring you capture complete and accurate information.

Understanding JavaScript Rendering in Crawlee

Crawlee offers different crawler types designed for various scraping scenarios:

  • CheerioCrawler: Fast but doesn't execute JavaScript
  • PuppeteerCrawler: Uses headless Chrome to execute JavaScript
  • PlaywrightCrawler: Uses Playwright for JavaScript execution with better browser support

For JavaScript-heavy sites, you'll need to use either PuppeteerCrawler or PlaywrightCrawler to properly wait for dynamic content.

Basic Wait Strategies in Crawlee

1. Waiting for Network Idle

The most common approach is waiting for network activity to settle, indicating that JavaScript has finished loading content:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for network to be idle
        await page.waitForLoadState('networkidle');

        const title = await page.title();
        log.info(`Title: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Network idle states include: - load: Waits for the load event - domcontentloaded: Waits for DOMContentLoaded event - networkidle: Waits until there are no network connections for at least 500ms

2. Waiting for Specific Selectors

Often the best approach is to wait for specific elements that are loaded via JavaScript:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for a specific element to appear
        await page.waitForSelector('.product-list', {
            timeout: 10000, // Wait up to 10 seconds
        });

        // Now extract data
        const products = await page.$$eval('.product-item', (elements) => {
            return elements.map(el => ({
                name: el.querySelector('.product-name')?.textContent,
                price: el.querySelector('.product-price')?.textContent,
            }));
        });

        log.info(`Found ${products.length} products`);
    },
});

await crawler.run(['https://shop.example.com']);

3. Waiting for Multiple Conditions

You can combine multiple wait strategies for more robust scraping:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for DOM to load
        await page.waitForLoadState('domcontentloaded');

        // Wait for specific content
        await page.waitForSelector('#main-content', { state: 'visible' });

        // Wait for JavaScript variable to be defined
        await page.waitForFunction(() => window.dataLayer !== undefined);

        // Additional wait for AJAX to complete
        await page.waitForLoadState('networkidle');

        const content = await page.content();
        log.info('Page fully loaded');
    },
});

await crawler.run(['https://example.com']);

Advanced Waiting Techniques

Waiting for JavaScript Functions

Sometimes you need to wait for specific JavaScript conditions to be met:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for a custom condition
        await page.waitForFunction(
            () => document.querySelectorAll('.loaded-item').length >= 10,
            { timeout: 15000 }
        );

        // Or wait for an API response to be processed
        await page.waitForFunction(
            () => window.apiDataLoaded === true,
            { timeout: 10000 }
        );

        const items = await page.$$('.loaded-item');
        log.info(`Found ${items.length} loaded items`);
    },
});

await crawler.run(['https://example.com/dynamic']);

Handling Lazy Loading and Infinite Scroll

For pages that load content as you scroll, you can automate scrolling and waiting:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        let previousHeight = 0;
        let currentHeight = await page.evaluate(() => document.body.scrollHeight);

        // Scroll until no more content loads
        while (previousHeight !== currentHeight) {
            previousHeight = currentHeight;

            // Scroll to bottom
            await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

            // Wait for new content to load
            await page.waitForTimeout(2000);

            // Get new height
            currentHeight = await page.evaluate(() => document.body.scrollHeight);
        }

        log.info('Finished loading all content');

        const allItems = await page.$$eval('.item', items => items.length);
        log.info(`Total items: ${allItems}`);
    },
});

await crawler.run(['https://example.com/infinite-scroll']);

Waiting for AJAX Requests

When dealing with AJAX requests that load dynamic content, you can monitor network activity:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for a specific API call to complete
        const responsePromise = page.waitForResponse(
            response => response.url().includes('/api/products') && response.status() === 200,
            { timeout: 10000 }
        );

        // Trigger the action that causes the API call
        await page.click('#load-more-button');

        // Wait for the response
        await responsePromise;

        // Wait for DOM to update with new data
        await page.waitForSelector('.new-products', { timeout: 5000 });

        log.info('AJAX content loaded successfully');
    },
});

await crawler.run(['https://example.com/products']);

Using Pre-Navigation Hooks

Crawlee allows you to configure waiting behavior globally using pre-navigation hooks:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }, goToOptions) => {
            // Set default wait condition
            goToOptions.waitUntil = 'networkidle';
            goToOptions.timeout = 30000; // 30 seconds
        },
    ],
    requestHandler: async ({ page, request, log }) => {
        // Page is already loaded with networkidle
        const content = await page.content();
        log.info('Processing page content');
    },
});

await crawler.run(['https://example.com']);

TypeScript Example

Here's a complete TypeScript example demonstrating multiple waiting strategies:

import { PlaywrightCrawler, PlaywrightCrawlingContext } from 'crawlee';

interface ProductData {
    name: string;
    price: string;
    availability: string;
}

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 50,
    requestHandler: async ({ page, request, log }: PlaywrightCrawlingContext) => {
        log.info(`Processing: ${request.url}`);

        try {
            // Wait for initial content
            await page.waitForLoadState('domcontentloaded');

            // Wait for product grid to be visible
            await page.waitForSelector('.product-grid', {
                state: 'visible',
                timeout: 10000,
            });

            // Wait for price elements to load (often loaded via JS)
            await page.waitForFunction(
                () => {
                    const priceElements = document.querySelectorAll('.product-price');
                    return priceElements.length > 0 &&
                           Array.from(priceElements).every(el => el.textContent?.trim());
                },
                { timeout: 8000 }
            );

            // Extract data
            const products: ProductData[] = await page.$$eval('.product-item', (elements) => {
                return elements.map(el => ({
                    name: el.querySelector('.product-name')?.textContent?.trim() || '',
                    price: el.querySelector('.product-price')?.textContent?.trim() || '',
                    availability: el.querySelector('.availability')?.textContent?.trim() || '',
                }));
            });

            log.info(`Extracted ${products.length} products`);

            // Save data
            await Dataset.pushData({
                url: request.url,
                products,
                scrapedAt: new Date().toISOString(),
            });

        } catch (error) {
            log.error(`Error processing ${request.url}: ${error}`);
        }
    },
    failedRequestHandler: async ({ request, log }) => {
        log.error(`Request ${request.url} failed too many times`);
    },
});

await crawler.run([
    'https://example.com/products',
    'https://example.com/products?page=2',
]);

Best Practices for Waiting in Crawlee

1. Choose the Right Waiting Strategy

  • Use waitForSelector() when you know specific elements will appear
  • Use waitForLoadState('networkidle') for heavily dynamic pages
  • Use waitForFunction() for custom conditions
  • Combine multiple strategies for reliability

2. Set Appropriate Timeouts

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        try {
            // Set reasonable timeouts
            await page.waitForSelector('.content', { timeout: 15000 });
        } catch (error) {
            log.warning('Content did not load in time, using fallback');
            // Implement fallback logic
        }
    },
    navigationTimeoutSecs: 60, // Global navigation timeout
});

3. Handle Errors Gracefully

Similar to handling timeouts in Puppeteer, implement proper error handling:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        try {
            await page.waitForSelector('.main-content', { timeout: 10000 });
        } catch (error) {
            log.warning(`Timeout waiting for selector: ${error.message}`);
            // Continue with available content or skip
            return;
        }

        // Process page
    },
    maxRequestRetries: 3,
});

4. Optimize Performance

Don't wait longer than necessary:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Use Promise.race for multiple possible selectors
        await Promise.race([
            page.waitForSelector('.content-type-a'),
            page.waitForSelector('.content-type-b'),
        ]);

        // Or wait for the first of multiple conditions
        await page.waitForFunction(
            () => document.querySelector('.loaded') ||
                  document.querySelector('.alternative-loaded'),
            { timeout: 8000 }
        );
    },
});

Python Example with Crawlee

If you're using Crawlee for Python, the syntax is similar:

from crawlee.playwright_crawler import PlaywrightCrawler

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context):
        page = context.page
        log = context.log

        # Wait for network to be idle
        await page.wait_for_load_state('networkidle')

        # Wait for specific selector
        await page.wait_for_selector('.product-list', timeout=10000)

        # Wait for custom function
        await page.wait_for_function(
            'document.querySelectorAll(".product").length >= 10',
            timeout=15000
        )

        # Extract data
        products = await page.eval_on_selector_all(
            '.product',
            'elements => elements.map(el => el.textContent)'
        )

        log.info(f'Found {len(products)} products')

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Conclusion

Waiting for JavaScript to load properly is crucial for successful web scraping with Crawlee. By using the appropriate waiting strategies—whether it's waitForSelector(), waitForLoadState(), waitForFunction(), or a combination—you can ensure your scraper captures all dynamically loaded content reliably. When scraping single-page applications, these techniques become even more critical.

Remember to balance thoroughness with performance by setting reasonable timeouts, implementing proper error handling, and choosing the most efficient waiting strategy for your specific use case. With Crawlee's flexible API, you have full control over how your crawler waits for and processes JavaScript-rendered content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon