How do I Wait for JavaScript to Load Before Scraping with Crawlee?

When scraping modern websites, JavaScript often loads content dynamically after the initial page load. Crawlee provides several powerful methods to wait for JavaScript content to render before extracting data, ensuring you capture complete and accurate information.

Understanding JavaScript Rendering in Crawlee

Crawlee offers different crawler types designed for various scraping scenarios:

CheerioCrawler: Fast but doesn't execute JavaScript
PuppeteerCrawler: Uses headless Chrome to execute JavaScript
PlaywrightCrawler: Uses Playwright for JavaScript execution with better browser support

For JavaScript-heavy sites, you'll need to use either PuppeteerCrawler or PlaywrightCrawler to properly wait for dynamic content.

Basic Wait Strategies in Crawlee

1. Waiting for Network Idle

The most common approach is waiting for network activity to settle, indicating that JavaScript has finished loading content:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for network to be idle
        await page.waitForLoadState('networkidle');

        const title = await page.title();
        log.info(`Title: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Network idle states include: - load: Waits for the load event - domcontentloaded: Waits for DOMContentLoaded event - networkidle: Waits until there are no network connections for at least 500ms

2. Waiting for Specific Selectors

Often the best approach is to wait for specific elements that are loaded via JavaScript:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for a specific element to appear
        await page.waitForSelector('.product-list', {
            timeout: 10000, // Wait up to 10 seconds
        });

        // Now extract data
        const products = await page.$$eval('.product-item', (elements) => {
            return elements.map(el => ({
                name: el.querySelector('.product-name')?.textContent,
                price: el.querySelector('.product-price')?.textContent,
            }));
        });

        log.info(`Found ${products.length} products`);
    },
});

await crawler.run(['https://shop.example.com']);

3. Waiting for Multiple Conditions

You can combine multiple wait strategies for more robust scraping:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for DOM to load
        await page.waitForLoadState('domcontentloaded');

        // Wait for specific content
        await page.waitForSelector('#main-content', { state: 'visible' });

        // Wait for JavaScript variable to be defined
        await page.waitForFunction(() => window.dataLayer !== undefined);

        // Additional wait for AJAX to complete
        await page.waitForLoadState('networkidle');

        const content = await page.content();
        log.info('Page fully loaded');
    },
});

await crawler.run(['https://example.com']);

Advanced Waiting Techniques

Waiting for JavaScript Functions

Sometimes you need to wait for specific JavaScript conditions to be met:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for a custom condition
        await page.waitForFunction(
            () => document.querySelectorAll('.loaded-item').length >= 10,
            { timeout: 15000 }
        );

        // Or wait for an API response to be processed
        await page.waitForFunction(
            () => window.apiDataLoaded === true,
            { timeout: 10000 }
        );

        const items = await page.$$('.loaded-item');
        log.info(`Found ${items.length} loaded items`);
    },
});

await crawler.run(['https://example.com/dynamic']);

Handling Lazy Loading and Infinite Scroll

For pages that load content as you scroll, you can automate scrolling and waiting:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        let previousHeight = 0;
        let currentHeight = await page.evaluate(() => document.body.scrollHeight);

        // Scroll until no more content loads
        while (previousHeight !== currentHeight) {
            previousHeight = currentHeight;

            // Scroll to bottom
            await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

            // Wait for new content to load
            await page.waitForTimeout(2000);

            // Get new height
            currentHeight = await page.evaluate(() => document.body.scrollHeight);
        }

        log.info('Finished loading all content');

        const allItems = await page.$$eval('.item', items => items.length);
        log.info(`Total items: ${allItems}`);
    },
});

await crawler.run(['https://example.com/infinite-scroll']);

Waiting for AJAX Requests

When dealing with AJAX requests that load dynamic content, you can monitor network activity:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for a specific API call to complete
        const responsePromise = page.waitForResponse(
            response => response.url().includes('/api/products') && response.status() === 200,
            { timeout: 10000 }
        );

        // Trigger the action that causes the API call
        await page.click('#load-more-button');

        // Wait for the response
        await responsePromise;

        // Wait for DOM to update with new data
        await page.waitForSelector('.new-products', { timeout: 5000 });

        log.info('AJAX content loaded successfully');
    },
});

await crawler.run(['https://example.com/products']);

Using Pre-Navigation Hooks

Crawlee allows you to configure waiting behavior globally using pre-navigation hooks:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page, request }, goToOptions) => {
            // Set default wait condition
            goToOptions.waitUntil = 'networkidle';
            goToOptions.timeout = 30000; // 30 seconds
        },
    ],
    requestHandler: async ({ page, request, log }) => {
        // Page is already loaded with networkidle
        const content = await page.content();
        log.info('Processing page content');
    },
});

await crawler.run(['https://example.com']);

TypeScript Example

Here's a complete TypeScript example demonstrating multiple waiting strategies:

import { PlaywrightCrawler, PlaywrightCrawlingContext } from 'crawlee';

interface ProductData {
    name: string;
    price: string;
    availability: string;
}

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 50,
    requestHandler: async ({ page, request, log }: PlaywrightCrawlingContext) => {
        log.info(`Processing: ${request.url}`);

        try {
            // Wait for initial content
            await page.waitForLoadState('domcontentloaded');

            // Wait for product grid to be visible
            await page.waitForSelector('.product-grid', {
                state: 'visible',
                timeout: 10000,
            });

            // Wait for price elements to load (often loaded via JS)
            await page.waitForFunction(
                () => {
                    const priceElements = document.querySelectorAll('.product-price');
                    return priceElements.length > 0 &&
                           Array.from(priceElements).every(el => el.textContent?.trim());
                },
                { timeout: 8000 }
            );

            // Extract data
            const products: ProductData[] = await page.$$eval('.product-item', (elements) => {
                return elements.map(el => ({
                    name: el.querySelector('.product-name')?.textContent?.trim() || '',
                    price: el.querySelector('.product-price')?.textContent?.trim() || '',
                    availability: el.querySelector('.availability')?.textContent?.trim() || '',
                }));
            });

            log.info(`Extracted ${products.length} products`);

            // Save data
            await Dataset.pushData({
                url: request.url,
                products,
                scrapedAt: new Date().toISOString(),
            });

        } catch (error) {
            log.error(`Error processing ${request.url}: ${error}`);
        }
    },
    failedRequestHandler: async ({ request, log }) => {
        log.error(`Request ${request.url} failed too many times`);
    },
});

await crawler.run([
    'https://example.com/products',
    'https://example.com/products?page=2',
]);

Best Practices for Waiting in Crawlee

1. Choose the Right Waiting Strategy

Use waitForSelector() when you know specific elements will appear
Use waitForLoadState('networkidle') for heavily dynamic pages
Use waitForFunction() for custom conditions
Combine multiple strategies for reliability

2. Set Appropriate Timeouts

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        try {
            // Set reasonable timeouts
            await page.waitForSelector('.content', { timeout: 15000 });
        } catch (error) {
            log.warning('Content did not load in time, using fallback');
            // Implement fallback logic
        }
    },
    navigationTimeoutSecs: 60, // Global navigation timeout
});

3. Handle Errors Gracefully

Similar to handling timeouts in Puppeteer, implement proper error handling:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        try {
            await page.waitForSelector('.main-content', { timeout: 10000 });
        } catch (error) {
            log.warning(`Timeout waiting for selector: ${error.message}`);
            // Continue with available content or skip
            return;
        }

        // Process page
    },
    maxRequestRetries: 3,
});

4. Optimize Performance

Don't wait longer than necessary:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Use Promise.race for multiple possible selectors
        await Promise.race([
            page.waitForSelector('.content-type-a'),
            page.waitForSelector('.content-type-b'),
        ]);

        // Or wait for the first of multiple conditions
        await page.waitForFunction(
            () => document.querySelector('.loaded') ||
                  document.querySelector('.alternative-loaded'),
            { timeout: 8000 }
        );
    },
});

Python Example with Crawlee

If you're using Crawlee for Python, the syntax is similar:

from crawlee.playwright_crawler import PlaywrightCrawler

async def main():
    crawler = PlaywrightCrawler()

    @crawler.router.default_handler
    async def request_handler(context):
        page = context.page
        log = context.log

        # Wait for network to be idle
        await page.wait_for_load_state('networkidle')

        # Wait for specific selector
        await page.wait_for_selector('.product-list', timeout=10000)

        # Wait for custom function
        await page.wait_for_function(
            'document.querySelectorAll(".product").length >= 10',
            timeout=15000
        )

        # Extract data
        products = await page.eval_on_selector_all(
            '.product',
            'elements => elements.map(el => el.textContent)'
        )

        log.info(f'Found {len(products)} products')

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Conclusion

Waiting for JavaScript to load properly is crucial for successful web scraping with Crawlee. By using the appropriate waiting strategies—whether it's waitForSelector(), waitForLoadState(), waitForFunction(), or a combination—you can ensure your scraper captures all dynamically loaded content reliably. When scraping single-page applications, these techniques become even more critical.

Remember to balance thoroughness with performance by setting reasonable timeouts, implementing proper error handling, and choosing the most efficient waiting strategy for your specific use case. With Crawlee's flexible API, you have full control over how your crawler waits for and processes JavaScript-rendered content.

Table of contents

How do I Wait for JavaScript to Load Before Scraping with Crawlee?

Understanding JavaScript Rendering in Crawlee

Basic Wait Strategies in Crawlee

1. Waiting for Network Idle

2. Waiting for Specific Selectors

3. Waiting for Multiple Conditions

Advanced Waiting Techniques

Waiting for JavaScript Functions

Handling Lazy Loading and Infinite Scroll

Waiting for AJAX Requests

Using Pre-Navigation Hooks

TypeScript Example

Best Practices for Waiting in Crawlee

1. Choose the Right Waiting Strategy

2. Set Appropriate Timeouts

3. Handle Errors Gracefully

4. Optimize Performance

Python Example with Crawlee

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How does Crawlee handle AJAX-loaded content?

What is the best way to scrape React or Vue.js websites with Crawlee?

How do I configure proxy rotation in Crawlee?

Get Started Now

Support