How does Crawlee handle AJAX-loaded content?

AJAX (Asynchronous JavaScript and XML) has become the backbone of modern web applications, enabling dynamic content loading without full page refreshes. When scraping websites that rely on AJAX, traditional HTTP-based scrapers often fail because they only capture the initial HTML response, missing content that loads asynchronously. Crawlee provides robust solutions for handling AJAX-loaded content through its browser automation capabilities and intelligent waiting mechanisms.

Understanding AJAX Content Loading

AJAX-loaded content presents unique challenges for web scraping:

Delayed rendering: Content loads after the initial page load
Dynamic DOM updates: JavaScript modifies the page structure continuously
Infinite scrolling: New content appears as users scroll
Event-triggered loading: Content loads in response to user interactions
API-based data fetching: Data arrives through XHR/Fetch requests

Crawlee addresses these challenges through multiple strategies, depending on which crawler type you use.

Browser-Based Crawlers for AJAX Content

Crawlee's browser automation crawlers (PlaywrightCrawler and PuppeteerCrawler) are specifically designed to handle JavaScript-heavy websites and AJAX content. These crawlers execute JavaScript just like a real browser, allowing AJAX requests to complete and content to render before extraction.

Using PlaywrightCrawler

PlaywrightCrawler is the recommended choice for modern web scraping with full JavaScript support:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for AJAX content to load
        await page.waitForSelector('.ajax-loaded-content', {
            state: 'visible',
            timeout: 10000
        });

        // Extract data after AJAX completes
        const data = await page.evaluate(() => {
            const items = [];
            document.querySelectorAll('.product-item').forEach(el => {
                items.push({
                    title: el.querySelector('.title')?.textContent,
                    price: el.querySelector('.price')?.textContent,
                });
            });
            return items;
        });

        log.info(`Scraped ${data.length} items from ${request.url}`);
        await crawler.pushData(data);
    },
});

await crawler.run(['https://example.com/products']);

Using PuppeteerCrawler

PuppeteerCrawler offers similar functionality with Puppeteer's API:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request }) => {
        // Wait for network to be idle (all AJAX requests complete)
        await page.waitForNetworkIdle({ idleTime: 500 });

        // Or wait for specific element
        await page.waitForSelector('#dynamic-content');

        const content = await page.$eval('#dynamic-content', el => el.innerHTML);
    },
});

await crawler.run(['https://example.com']);

Waiting Strategies for AJAX Content

Crawlee provides multiple strategies to ensure AJAX content has loaded before scraping:

1. Wait for Selectors

The most reliable method is waiting for specific elements that appear after AJAX completes:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Wait for element to appear
        await page.waitForSelector('.ajax-results', {
            state: 'visible',
            timeout: 30000
        });

        // Wait for multiple elements
        await Promise.all([
            page.waitForSelector('.product-list'),
            page.waitForSelector('.pagination'),
            page.waitForSelector('.filters')
        ]);
    },
});

2. Wait for Network Idle

Similar to how Puppeteer handles AJAX requests, Crawlee can wait for network activity to cease:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Wait until there are no more than 2 network connections
        // for at least 500ms
        await page.waitForLoadState('networkidle');

        // Now safe to extract AJAX-loaded content
        const data = await page.evaluate(() => {
            return document.querySelector('.ajax-content')?.textContent;
        });
    },
});

3. Custom Wait Functions

For complex scenarios, implement custom waiting logic:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Wait for custom condition
        await page.waitForFunction(() => {
            const loader = document.querySelector('.loading-spinner');
            const content = document.querySelector('.content-loaded');
            return !loader && content && content.children.length > 0;
        }, { timeout: 15000 });

        // Content is now ready for extraction
    },
});

4. Timed Waits

As a last resort, use fixed delays (though less reliable):

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Wait for 2 seconds
        await page.waitForTimeout(2000);

        // Proceed with extraction
    },
});

Handling Infinite Scroll and Lazy Loading

Many modern websites use infinite scroll to load content dynamically. Crawlee can simulate scrolling to trigger AJAX requests:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, log, infiniteScroll }) => {
        // Use built-in infinite scroll helper
        await infiniteScroll({
            maxScrollHeight: 10000,
            waitForSecs: 2,
            scrollDownAndUp: true,
        });

        // Or implement custom scrolling
        let previousHeight = 0;
        let currentHeight = await page.evaluate(() => document.body.scrollHeight);

        while (previousHeight < currentHeight) {
            // Scroll to bottom
            await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));

            // Wait for new content
            await page.waitForTimeout(1000);
            await page.waitForLoadState('networkidle');

            previousHeight = currentHeight;
            currentHeight = await page.evaluate(() => document.body.scrollHeight);

            log.info(`Scrolled to ${currentHeight}px`);
        }

        // Extract all loaded content
        const items = await page.$$eval('.item', elements =>
            elements.map(el => ({
                title: el.querySelector('h2')?.textContent,
                description: el.querySelector('p')?.textContent,
            }))
        );

        await crawler.pushData(items);
    },
});

Monitoring Network Requests

Similar to monitoring network requests in Puppeteer, you can intercept and analyze AJAX requests to extract data directly from API responses:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, log }) => {
        const apiData = [];

        // Intercept API requests
        page.on('response', async (response) => {
            const url = response.url();

            // Check if this is the API endpoint we're interested in
            if (url.includes('/api/products')) {
                try {
                    const json = await response.json();
                    apiData.push(...json.products);
                    log.info(`Captured API data: ${json.products.length} items`);
                } catch (e) {
                    log.error(`Failed to parse JSON from ${url}`);
                }
            }
        });

        // Navigate and trigger AJAX requests
        await page.goto('https://example.com/products');

        // Wait for all requests to complete
        await page.waitForLoadState('networkidle');

        // Save intercepted API data
        await crawler.pushData(apiData);
    },
});

Handling Click-Triggered AJAX Content

Some content only loads when users interact with the page:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, log }) => {
        // Click "Load More" button to trigger AJAX
        let hasMoreContent = true;
        let clickCount = 0;
        const maxClicks = 10;

        while (hasMoreContent && clickCount < maxClicks) {
            try {
                // Click the load more button
                const loadMoreButton = await page.$('.load-more-button');

                if (!loadMoreButton) {
                    hasMoreContent = false;
                    break;
                }

                await loadMoreButton.click();
                clickCount++;

                // Wait for new content to appear
                await page.waitForTimeout(1000);
                await page.waitForLoadState('networkidle');

                log.info(`Clicked "Load More" ${clickCount} times`);
            } catch (e) {
                log.error('No more content to load');
                hasMoreContent = false;
            }
        }

        // Extract all loaded items
        const items = await page.$$eval('.item', elements =>
            elements.map(el => el.textContent)
        );

        await crawler.pushData(items);
    },
});

Python Implementation

Crawlee for Python offers similar capabilities:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    page = context.page

    # Wait for AJAX content
    await page.wait_for_selector('.ajax-loaded-content', state='visible', timeout=10000)

    # Wait for network idle
    await page.wait_for_load_state('networkidle')

    # Extract data
    data = await page.evaluate('''() => {
        return Array.from(document.querySelectorAll('.product')).map(el => ({
            title: el.querySelector('.title')?.textContent,
            price: el.querySelector('.price')?.textContent,
        }));
    }''')

    await context.push_data(data)

await crawler.run(['https://example.com/products'])

Best Practices for AJAX Content Scraping

Use specific selectors: Wait for unique elements that only appear after AJAX completes
Combine strategies: Use both selector waiting and network idle detection
Set reasonable timeouts: Balance between reliability and performance
Handle errors gracefully: AJAX requests can fail; implement retry logic
Respect rate limits: Don't overload servers with rapid requests
Monitor network traffic: Understanding API endpoints can simplify data extraction
Test thoroughly: AJAX behavior can vary across different pages and conditions

When to Use CheerioCrawler vs Browser Crawlers

CheerioCrawler is faster but cannot handle AJAX content because it doesn't execute JavaScript. Use it only when:

The website serves all content in initial HTML
You've identified API endpoints and can call them directly
Performance is critical and JavaScript execution isn't needed

For AJAX-heavy websites, always use PlaywrightCrawler or PuppeteerCrawler, as they provide the full browser environment needed for JavaScript execution and dynamic content loading.

Conclusion

Crawlee excels at handling AJAX-loaded content through its browser automation crawlers, intelligent waiting mechanisms, and flexible API. By leveraging PlaywrightCrawler or PuppeteerCrawler with proper waiting strategies, you can reliably scrape even the most dynamic, JavaScript-heavy websites. The key is understanding how content loads on your target website and choosing the appropriate waiting strategy—whether that's waiting for specific selectors, network idle states, or custom conditions.

For additional control over browser automation and handling complex single-page applications, Crawlee's integration with Playwright and Puppeteer provides all the tools you need to successfully extract data from modern web applications.

Table of contents

How does Crawlee handle AJAX-loaded content?

Understanding AJAX Content Loading

Browser-Based Crawlers for AJAX Content

Using PlaywrightCrawler

Using PuppeteerCrawler

Waiting Strategies for AJAX Content

1. Wait for Selectors

2. Wait for Network Idle

3. Custom Wait Functions

4. Timed Waits

Handling Infinite Scroll and Lazy Loading

Monitoring Network Requests

Handling Click-Triggered AJAX Content

Python Implementation

Best Practices for AJAX Content Scraping

When to Use CheerioCrawler vs Browser Crawlers

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the best way to scrape React or Vue.js websites with Crawlee?

How do I configure proxy rotation in Crawlee?

Does Crawlee support session management for authenticated scraping?

Get Started Now

Support