Table of contents

Does Crawlee work with modern JavaScript frameworks?

Yes, Crawlee works exceptionally well with modern JavaScript frameworks like React, Vue, Angular, and Next.js. Crawlee is designed specifically to handle JavaScript-heavy websites and single-page applications (SPAs), making it one of the best tools for scraping modern web applications that rely on client-side rendering.

Understanding Crawlee's Framework Compatibility

Crawlee supports multiple browser automation libraries including Puppeteer, Playwright, and Cheerio, which allows it to handle both server-side rendered (SSR) and client-side rendered (CSR) applications. When working with JavaScript frameworks, Crawlee's browser-based crawlers (PuppeteerCrawler and PlaywrightCrawler) are particularly effective because they execute JavaScript just like a real browser would.

Why Crawlee Excels at Modern Frameworks

Modern JavaScript frameworks often use: - Dynamic content loading via AJAX/Fetch API - Virtual DOM for efficient updates - Client-side routing without full page reloads - Lazy loading of components and resources - State management that affects content visibility

Crawlee handles all these patterns seamlessly through its browser automation capabilities.

Using Crawlee with React Applications

React applications often render content dynamically after the initial page load. Here's how to scrape a React-based website using Crawlee:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        // Wait for React components to render
        await page.waitForSelector('.react-component-class', {
            state: 'visible',
            timeout: 30000
        });

        // Wait for React to finish hydration
        await page.waitForFunction(() => {
            return window.React !== undefined;
        });

        // Extract data from React-rendered elements
        const data = await page.evaluate(() => {
            const items = [];
            document.querySelectorAll('.product-card').forEach(card => {
                items.push({
                    title: card.querySelector('h2')?.textContent,
                    price: card.querySelector('.price')?.textContent,
                    description: card.querySelector('.description')?.textContent
                });
            });
            return items;
        });

        console.log(`Scraped ${data.length} items from ${request.url}`);

        // Enqueue links for pagination (common in React apps)
        await enqueueLinks({
            selector: 'a.pagination-link',
            label: 'LISTING',
        });
    },
    maxRequestsPerCrawl: 50,
});

await crawler.run(['https://example-react-app.com']);

Scraping Vue.js Applications

Vue.js applications use reactive data binding and may require waiting for specific Vue lifecycle hooks. Here's an approach for Vue-based sites:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request }) => {
        // Wait for Vue to mount
        await page.waitForFunction(() => {
            return window.__VUE__ !== undefined ||
                   document.querySelector('[data-v-app]') !== null;
        }, { timeout: 15000 });

        // Wait for specific Vue components
        await page.waitForSelector('[data-v-component]');

        // Handle Vue Router navigation
        await page.evaluate(() => {
            // Trigger Vue Router if needed
            if (window.$router) {
                window.$router.push('/products');
            }
        });

        // Wait for route transition to complete
        await page.waitForTimeout(2000);

        // Extract data
        const vueData = await page.evaluate(() => {
            // Access Vue app instance if exposed
            const app = document.querySelector('#app').__vue__;
            return {
                items: Array.from(document.querySelectorAll('.vue-item')).map(el => ({
                    text: el.textContent.trim()
                }))
            };
        });

        console.log('Vue data:', vueData);
    },
    headless: true,
});

await crawler.run(['https://example-vue-app.com']);

Working with Angular Applications

Angular applications often use zone.js for change detection. When scraping single-page applications built with Angular, you need to wait for Angular to stabilize:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, log }) => {
        // Wait for Angular to bootstrap
        await page.waitForFunction(() => {
            return window.getAllAngularTestabilities !== undefined;
        }, { timeout: 20000 });

        // Wait for Angular to be stable (all async operations complete)
        await page.waitForFunction(() => {
            const testability = window.getAllAngularTestabilities();
            if (testability && testability.length > 0) {
                return testability[0].isStable();
            }
            return false;
        }, { timeout: 30000 });

        // Now scrape the fully rendered content
        const data = await page.$$eval('.mat-card', cards => {
            return cards.map(card => ({
                title: card.querySelector('.mat-card-title')?.textContent,
                content: card.querySelector('.mat-card-content')?.textContent
            }));
        });

        log.info(`Extracted ${data.length} Angular components`);
    },
    launchContext: {
        launchOptions: {
            headless: true,
            args: ['--no-sandbox', '--disable-setuid-sandbox']
        }
    }
});

await crawler.run(['https://example-angular-app.com']);

Handling Next.js and Other SSR Frameworks

Next.js and similar frameworks use server-side rendering with client-side hydration. This hybrid approach requires a different strategy:

import { CheerioCrawler, PlaywrightCrawler } from 'crawlee';

// For static/SSR pages, Cheerio is faster
const cheerioCrawler = new CheerioCrawler({
    requestHandler: async ({ $, request, log }) => {
        // Next.js often renders initial content server-side
        const title = $('h1').text();
        const staticContent = $('.static-content').text();

        log.info(`SSR content: ${title}`);

        // Check if page needs client-side rendering
        const hasClientSideContent = $('#__NEXT_DATA__').length > 0;

        if (hasClientSideContent) {
            log.info('Switching to browser crawler for client-side content');
            // Enqueue for browser-based crawler if needed
        }
    },
});

// For pages with heavy client-side logic
const playwrightCrawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        // Wait for Next.js hydration
        await page.waitForFunction(() => {
            return window.next !== undefined &&
                   window.next.router !== undefined;
        });

        // Wait for dynamic content
        await page.waitForSelector('[data-testid="dynamic-content"]');

        const data = await page.evaluate(() => {
            return {
                title: document.querySelector('h1')?.textContent,
                dynamicData: Array.from(
                    document.querySelectorAll('.dynamic-item')
                ).map(el => el.textContent)
            };
        });

        console.log('Next.js data:', data);
    },
});

Best Practices for Framework-Specific Scraping

1. Wait for Framework Initialization

Always wait for the framework to fully initialize before extracting data:

// Generic framework detection
await page.waitForFunction(() => {
    return window.React !== undefined ||      // React
           window.__VUE__ !== undefined ||     // Vue
           window.ng !== undefined ||          // Angular
           window.next !== undefined;          // Next.js
}, { timeout: 15000 });

2. Handle AJAX and API Calls

Modern frameworks fetch data asynchronously. You can monitor network requests to ensure all data has loaded:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Track pending requests
        let pendingRequests = 0;

        page.on('request', request => {
            if (request.url().includes('/api/')) {
                pendingRequests++;
            }
        });

        page.on('response', response => {
            if (response.url().includes('/api/')) {
                pendingRequests--;
            }
        });

        // Navigate to page
        await page.goto('https://example.com');

        // Wait for all API calls to complete
        await page.waitForFunction(() => pendingRequests === 0, {
            timeout: 30000,
            polling: 100
        });

        // Now scrape the data
    },
});

3. Handle Infinite Scroll and Lazy Loading

Many modern frameworks use infinite scroll:

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page }) => {
        await page.goto('https://example.com/feed');

        let previousHeight = 0;
        let currentHeight = await page.evaluate(() => document.body.scrollHeight);

        // Scroll until no more content loads
        while (previousHeight !== currentHeight) {
            previousHeight = currentHeight;

            // Scroll to bottom
            await page.evaluate(() => {
                window.scrollTo(0, document.body.scrollHeight);
            });

            // Wait for new content to load
            await page.waitForTimeout(2000);

            currentHeight = await page.evaluate(() => document.body.scrollHeight);
        }

        // Extract all loaded items
        const items = await page.$$eval('.item', elements => {
            return elements.map(el => el.textContent);
        });

        console.log(`Loaded ${items.length} items`);
    },
});

4. Handle Client-Side Routing

For SPAs with client-side routing, you might need to inject JavaScript or interact with the router:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        await page.goto('https://spa-example.com');

        // Click on navigation link (triggers client-side route)
        await page.click('a[href="/products"]');

        // Wait for route transition
        await page.waitForURL('**/products');
        await page.waitForLoadState('networkidle');

        // Extract data from new route
        const products = await page.$$eval('.product', els =>
            els.map(el => ({
                name: el.querySelector('h3')?.textContent,
                price: el.querySelector('.price')?.textContent
            }))
        );
    },
});

Performance Optimization

When scraping framework-based sites at scale:

Use Request Interception

Reduce unnecessary resource loading:

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [async ({ page }) => {
        // Block images, fonts, and other non-essential resources
        await page.route('**/*', route => {
            const resourceType = route.request().resourceType();
            if (['image', 'font', 'media'].includes(resourceType)) {
                route.abort();
            } else {
                route.continue();
            }
        });
    }],
    requestHandler: async ({ page }) => {
        // Your scraping logic
    },
});

Implement Smart Waiting

Instead of fixed timeouts, wait for specific conditions:

// Wait for specific element rather than arbitrary timeout
await page.waitForSelector('.data-loaded-indicator');

// Or wait for network to be idle
await page.waitForLoadState('networkidle');

Conclusion

Crawlee is fully compatible with modern JavaScript frameworks and provides robust tools for scraping React, Vue, Angular, Next.js, and other framework-based applications. By using PlaywrightCrawler or PuppeteerCrawler, you can execute JavaScript, wait for framework initialization, handle dynamic content, and extract data from even the most complex single-page applications.

The key to success is understanding how each framework renders content and using appropriate waiting strategies to ensure all data is loaded before extraction. With Crawlee's flexible API and powerful browser automation capabilities, you can build reliable scrapers for any modern web application.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon