Can Crawlee Handle Single-Page Applications (SPAs)?

Yes, Crawlee can effectively handle single-page applications (SPAs) using its browser-based crawlers: PlaywrightCrawler and PuppeteerCrawler. These crawlers are specifically designed to work with JavaScript-heavy websites where content is dynamically rendered on the client side, making them ideal for scraping React, Vue.js, Angular, and other modern SPA frameworks.

Understanding SPAs and Web Scraping Challenges

Single-page applications differ from traditional multi-page websites in several key ways:

Dynamic Content Loading: Content is loaded via JavaScript after the initial page load
Client-Side Routing: Navigation happens without full page reloads
Asynchronous Data Fetching: Data is often loaded via AJAX/fetch requests
Virtual DOM Updates: The DOM is updated dynamically without page refreshes

Traditional HTTP-based scrapers like CheerioCrawler cannot execute JavaScript and will only see the initial HTML shell, missing all dynamically loaded content. This is where Crawlee's browser-based crawlers excel.

Using PlaywrightCrawler for SPAs

PlaywrightCrawler is the recommended choice for scraping SPAs due to Playwright's superior features and cross-browser support. Here's a comprehensive example:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Wait for network to be idle before considering page loaded
    navigationTimeoutSecs: 60,

    async requestHandler({ page, request, enqueueLinks }) {
        console.log(`Processing: ${request.url}`);

        // Wait for SPA content to load
        // Option 1: Wait for specific selector
        await page.waitForSelector('.product-list', { timeout: 10000 });

        // Option 2: Wait for network idle
        await page.waitForLoadState('networkidle');

        // Option 3: Wait for specific time
        await page.waitForTimeout(2000);

        // Extract data after JavaScript has rendered content
        const data = await page.evaluate(() => {
            return Array.from(document.querySelectorAll('.product-item')).map(item => ({
                title: item.querySelector('h2')?.textContent?.trim(),
                price: item.querySelector('.price')?.textContent?.trim(),
                description: item.querySelector('.description')?.textContent?.trim()
            }));
        });

        // Save extracted data
        await Dataset.pushData(data);

        // Handle SPA pagination/navigation
        await enqueueLinks({
            selector: 'a.next-page',
            transformRequestFunction: (req) => {
                req.userData = { ...req.userData, pageType: 'listing' };
                return req;
            }
        });

        // Trigger SPA navigation if needed
        const loadMoreButton = await page.$('button.load-more');
        if (loadMoreButton) {
            await loadMoreButton.click();
            await page.waitForLoadState('networkidle');
            // Extract additional content after click
        }
    },

    // Handle failed requests
    failedRequestHandler({ request, error }) {
        console.error(`Request ${request.url} failed: ${error.message}`);
    },
});

// Start with SPA URLs
await crawler.run([
    'https://example-spa.com/products',
    'https://example-spa.com/categories'
]);

Using PuppeteerCrawler for SPAs

PuppeteerCrawler is another excellent option for handling SPAs, similar to crawling single-page applications using Puppeteer:

import { PuppeteerCrawler, Dataset } from 'crawlee';

const crawler = new PuppeteerCrawler({
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox', '--disable-setuid-sandbox']
        }
    },

    async requestHandler({ page, request, enqueueLinks }) {
        // Wait for SPA to initialize
        await page.waitForSelector('[data-spa-ready]', { timeout: 15000 });

        // Scroll to trigger lazy loading (common in SPAs)
        await page.evaluate(async () => {
            await new Promise((resolve) => {
                let totalHeight = 0;
                const distance = 100;
                const timer = setInterval(() => {
                    const scrollHeight = document.body.scrollHeight;
                    window.scrollBy(0, distance);
                    totalHeight += distance;

                    if (totalHeight >= scrollHeight) {
                        clearInterval(timer);
                        resolve();
                    }
                }, 100);
            });
        });

        // Extract data from SPA
        const items = await page.$$eval('.item', (elements) => {
            return elements.map(el => ({
                id: el.getAttribute('data-id'),
                name: el.querySelector('.name')?.textContent,
                status: el.querySelector('.status')?.textContent
            }));
        });

        await Dataset.pushData({ url: request.url, items });

        // Enqueue links found in SPA
        await enqueueLinks({
            selector: 'a[href^="/"]',
            baseUrl: 'https://example-spa.com'
        });
    }
});

await crawler.run(['https://example-spa.com']);

Handling SPA-Specific Scenarios

Client-Side Routing

SPAs use client-side routing where URLs change without page reloads. Crawlee handles this effectively:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Intercept SPA navigation
        await page.route('**/*', (route) => {
            // Log all requests to track AJAX calls
            console.log(`Request: ${route.request().url()}`);
            route.continue();
        });

        // Click on SPA navigation link
        const navLink = await page.$('a[data-spa-link="/about"]');
        if (navLink) {
            // Wait for SPA route change
            await Promise.all([
                page.waitForFunction(() => window.location.pathname === '/about'),
                navLink.click()
            ]);

            // Extract data from new view
            const content = await page.textContent('.main-content');
            console.log('SPA navigated to:', page.url());
        }
    }
});

Infinite Scroll and Lazy Loading

Many SPAs implement infinite scroll, which requires special handling:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const allItems = [];
        let previousHeight = 0;
        let noNewContentCount = 0;

        // Keep scrolling until no new content loads
        while (noNewContentCount < 3) {
            // Extract current items
            const items = await page.$$eval('.item', els =>
                els.map(el => ({
                    title: el.querySelector('h3')?.textContent,
                    url: el.querySelector('a')?.href
                }))
            );

            allItems.push(...items);

            // Scroll to bottom
            const currentHeight = await page.evaluate(() => {
                window.scrollTo(0, document.body.scrollHeight);
                return document.body.scrollHeight;
            });

            // Wait for potential new content
            await page.waitForTimeout(1500);

            if (currentHeight === previousHeight) {
                noNewContentCount++;
            } else {
                noNewContentCount = 0;
                previousHeight = currentHeight;
            }
        }

        // Remove duplicates and save
        const uniqueItems = Array.from(
            new Map(allItems.map(item => [item.url, item])).values()
        );

        await Dataset.pushData({ url: request.url, items: uniqueItems });
    }
});

Waiting for AJAX Requests

SPAs frequently make AJAX requests to load data. You can wait for these requests similar to handling AJAX requests using Puppeteer:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Wait for specific API endpoint
        const apiResponse = await page.waitForResponse(
            response => response.url().includes('/api/products') && response.status() === 200,
            { timeout: 10000 }
        );

        // Get JSON data from API response
        const apiData = await apiResponse.json();
        console.log('API returned:', apiData);

        // Or wait for multiple network requests to complete
        await page.waitForLoadState('networkidle');

        // Extract rendered content
        const renderedData = await page.textContent('.product-container');
    }
});

Python: Crawlee for Python with SPAs

Crawlee for Python also supports SPA scraping through its Playwright integration:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Dataset

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
        navigation_timeout_secs=60,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Wait for SPA to load
        await page.wait_for_selector('.spa-content', timeout=10000)
        await page.wait_for_load_state('networkidle')

        # Extract data
        data = await page.evaluate('''() => {
            return Array.from(document.querySelectorAll('.item')).map(item => ({
                title: item.querySelector('h2')?.textContent,
                price: item.querySelector('.price')?.textContent
            }));
        }''')

        # Save to dataset
        await context.push_data({'url': context.request.url, 'items': data})

        # Handle SPA pagination
        next_button = await page.query_selector('button.next')
        if next_button:
            await next_button.click()
            await page.wait_for_load_state('networkidle')
            # Enqueue the "new" URL after SPA navigation
            await context.enqueue_links(selector='a.item-link')

    await crawler.run(['https://example-spa.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Best Practices for Scraping SPAs with Crawlee

1. Choose the Right Wait Strategy

Different SPAs require different waiting strategies:

// Wait for specific selector
await page.waitForSelector('.content-loaded');

// Wait for network to be idle
await page.waitForLoadState('networkidle');

// Wait for custom condition
await page.waitForFunction(() => window.appReady === true);

// Combine multiple conditions
await Promise.all([
    page.waitForSelector('.header'),
    page.waitForLoadState('domcontentloaded'),
    page.waitForFunction(() => document.readyState === 'complete')
]);

2. Handle Browser Context Efficiently

Reuse browser contexts to improve performance:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Use persistent context for session handling
    launchContext: {
        useChrome: true,
        launchOptions: {
            headless: true
        }
    },

    // Reuse browser instances
    maxConcurrency: 10,

    async requestHandler({ page, request }) {
        // Your SPA scraping logic
    }
});

3. Monitor and Debug SPA Behavior

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Enable console logging from browser
        page.on('console', msg => console.log('Browser log:', msg.text()));

        // Log network requests
        page.on('request', req => console.log('Request:', req.url()));
        page.on('response', res => console.log('Response:', res.url(), res.status()));

        // Take screenshot for debugging
        await page.screenshot({ path: `screenshot-${Date.now()}.png` });
    }
});

Performance Considerations

Browser-based crawling is more resource-intensive than HTTP-only crawling. Optimize performance with these settings:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Limit concurrent browser instances
    maxConcurrency: 5,

    // Set reasonable timeouts
    navigationTimeoutSecs: 30,
    requestHandlerTimeoutSecs: 60,

    launchContext: {
        launchOptions: {
            // Disable unnecessary features
            args: [
                '--disable-dev-shm-usage',
                '--disable-gpu',
                '--disable-features=IsolateOrigins,site-per-process',
                '--no-sandbox'
            ]
        }
    },

    // Block unnecessary resources
    preNavigationHooks: [
        async ({ page }) => {
            await page.route('**/*', (route) => {
                const resourceType = route.request().resourceType();
                if (['image', 'font', 'media'].includes(resourceType)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        }
    ]
});

When to Use Browser Crawlers vs HTTP Crawlers

While Crawlee's browser-based crawlers are excellent for SPAs, consider the trade-offs:

Use PlaywrightCrawler or PuppeteerCrawler when: - The website heavily relies on JavaScript for content rendering - Client-side routing is used for navigation - Content is loaded dynamically via AJAX/fetch - You need to interact with the page (clicking, scrolling, form submission) - The site implements lazy loading or infinite scroll

Use CheerioCrawler when: - Content is server-side rendered - You need maximum speed and minimal resource usage - The website doesn't require JavaScript execution - You're scraping large-scale static content

Handling Common SPA Patterns

React Applications

React apps often use data attributes and component lifecycles:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Wait for React to render
        await page.waitForFunction(() => {
            const root = document.querySelector('#root');
            return root && root.children.length > 0;
        });

        // Wait for data to load (React often shows loading state)
        await page.waitForSelector('[data-testid="content-loaded"]');

        // Extract data from React components
        const data = await page.evaluate(() => {
            return window.__REACT_DATA__ || {}; // Some apps expose data
        });
    }
});

Vue.js Applications

Vue.js apps can be detected and scraped effectively:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Wait for Vue to mount
        await page.waitForFunction(() => window.__VUE__ !== undefined);

        // Wait for v-cloak to be removed (common Vue pattern)
        await page.waitForFunction(() => {
            return !document.querySelector('[v-cloak]');
        });

        // Extract data
        const vueData = await page.evaluate(() => {
            return window.__INITIAL_STATE__; // Common pattern in Vue SSR
        });
    }
});

Angular Applications

Angular apps have their own loading indicators and lifecycle:

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Wait for Angular to bootstrap
        await page.waitForFunction(() => {
            return window.getAllAngularTestabilities !== undefined &&
                   window.getAllAngularTestabilities()[0]?.isStable();
        });

        // Wait for loading indicators to disappear
        await page.waitForSelector('.loading-spinner', { state: 'hidden' });

        // Extract data from Angular components
        const data = await page.$$eval('[data-component]', elements => {
            return elements.map(el => ({
                component: el.getAttribute('data-component'),
                content: el.textContent
            }));
        });
    }
});

Troubleshooting SPA Scraping

Content Not Loading

If content doesn't load, try multiple wait strategies:

async requestHandler({ page, request }) {
    try {
        // Try primary selector with timeout
        await page.waitForSelector('.main-content', { timeout: 5000 });
    } catch (error) {
        // Fallback: wait for network idle
        await page.waitForLoadState('networkidle');

        // If still no content, wait additional time
        await page.waitForTimeout(3000);
    }

    // Verify content loaded
    const hasContent = await page.$('.main-content');
    if (!hasContent) {
        throw new Error('Content failed to load');
    }
}

Handling Navigation Timeouts

Similar to handling timeouts in Puppeteer, implement robust timeout handling:

const crawler = new PlaywrightCrawler({
    navigationTimeoutSecs: 60,
    requestHandlerTimeoutSecs: 120,

    async requestHandler({ page, request }) {
        try {
            await page.goto(request.url, {
                waitUntil: 'domcontentloaded',
                timeout: 30000
            });
        } catch (error) {
            if (error.name === 'TimeoutError') {
                console.log('Navigation timeout, continuing anyway...');
                // Page might still be usable
            } else {
                throw error;
            }
        }
    }
});

Conclusion

Crawlee is exceptionally well-suited for scraping single-page applications through its PlaywrightCrawler and PuppeteerCrawler implementations. These crawlers provide full JavaScript execution, allowing you to interact with SPAs just as a real user would. By properly configuring wait strategies, handling client-side routing, and implementing efficient resource management, you can reliably extract data from even the most complex modern web applications.

The key to success with SPA scraping is understanding the application's behavior, using appropriate wait conditions, and leveraging Crawlee's powerful features for request management, data extraction, and error handling. Whether you're working with React, Vue.js, Angular, or any other modern framework, Crawlee provides the tools necessary to effectively scrape dynamic content.

Table of contents