Table of contents

What is the Best Way to Scrape React or Vue.js Websites with Crawlee?

Scraping React and Vue.js websites requires special consideration because these frameworks render content dynamically using JavaScript. Unlike traditional server-side rendered pages, single-page applications (SPAs) built with React or Vue.js load content asynchronously after the initial page load, making them challenging to scrape with simple HTTP requests.

Crawlee provides powerful tools specifically designed for scraping JavaScript-heavy websites through its PlaywrightCrawler and PuppeteerCrawler classes, which use real browser automation to execute JavaScript and wait for dynamic content to load.

Understanding the Challenge with React and Vue.js Websites

React and Vue.js applications differ from traditional websites in several key ways:

  • Client-side rendering: Content is generated in the browser using JavaScript rather than being sent from the server
  • Asynchronous data fetching: Data is often loaded via API calls after the initial page load
  • Dynamic DOM updates: The page structure changes as users interact with it
  • Virtual DOM: React and Vue use a virtual DOM that requires JavaScript execution to render actual HTML
  • Lazy loading: Components and data may load on-demand as users scroll or navigate

Because of these characteristics, standard HTTP-based scrapers like CheerioCrawler will only see an empty skeleton HTML page without the actual content.

Choosing the Right Crawlee Crawler

Crawlee offers three main crawler types, but for React and Vue.js applications, you need one that can execute JavaScript:

PlaywrightCrawler (Recommended)

PlaywrightCrawler is the most powerful and recommended option for scraping modern JavaScript frameworks. It uses Playwright, which supports Chromium, Firefox, and WebKit browsers.

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);

        // Wait for React/Vue app to load
        await page.waitForSelector('.app-content', { timeout: 30000 });

        // Extract data after JavaScript has rendered
        const data = await page.evaluate(() => {
            return {
                title: document.querySelector('h1')?.textContent,
                items: Array.from(document.querySelectorAll('.item')).map(item => ({
                    name: item.querySelector('.name')?.textContent,
                    price: item.querySelector('.price')?.textContent
                }))
            };
        });

        console.log('Extracted data:', data);
    },
});

await crawler.run(['https://example-react-app.com']);

PuppeteerCrawler

PuppeteerCrawler is another excellent choice, using Puppeteer which only supports Chromium-based browsers but has a slightly simpler API.

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    requestHandler: async ({ page, request }) => {
        // Wait for network to be idle (useful for SPAs)
        await page.waitForNetworkIdle({ timeout: 30000 });

        const data = await page.evaluate(() => {
            // Your extraction logic here
            return document.querySelector('.content')?.textContent;
        });
    },
});

Best Practices for Scraping React/Vue.js Websites

1. Wait for Content to Load Properly

The most critical aspect of scraping single-page applications is ensuring all content has loaded before extraction. Crawlee provides several strategies:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Strategy 1: Wait for specific selector
        await page.waitForSelector('.data-loaded-indicator', {
            state: 'visible',
            timeout: 30000
        });

        // Strategy 2: Wait for network to be idle
        await page.waitForLoadState('networkidle');

        // Strategy 3: Wait for specific function to be available (React)
        await page.waitForFunction(() => {
            return window.__REACT_READY__ === true;
        });

        // Strategy 4: Custom wait time (use sparingly)
        await page.waitForTimeout(3000);
    },
});

2. Handle Dynamic Data Loading

React and Vue.js apps often fetch data from APIs after the initial render. Monitor network requests to understand when data loading completes:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        // Listen for API responses
        const apiDataPromise = page.waitForResponse(
            response => response.url().includes('/api/products') && response.status() === 200,
            { timeout: 30000 }
        );

        await page.goto(request.url);

        // Wait for the API call to complete
        await apiDataPromise;

        // Now extract the data
        const products = await page.$$eval('.product-card', cards =>
            cards.map(card => ({
                title: card.querySelector('.title')?.textContent,
                price: card.querySelector('.price')?.textContent
            }))
        );
    },
});

3. Handle Infinite Scroll and Lazy Loading

Many React and Vue.js applications use infinite scroll to load content progressively:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Function to scroll and wait for new content
        async function autoScroll() {
            await page.evaluate(async () => {
                await new Promise((resolve) => {
                    let totalHeight = 0;
                    const distance = 100;
                    const timer = setInterval(() => {
                        const scrollHeight = document.body.scrollHeight;
                        window.scrollBy(0, distance);
                        totalHeight += distance;

                        if (totalHeight >= scrollHeight) {
                            clearInterval(timer);
                            resolve();
                        }
                    }, 100);
                });
            });
        }

        // Scroll to load all content
        await autoScroll();

        // Wait for final content to render
        await page.waitForTimeout(2000);

        // Extract all loaded data
        const allData = await page.$$eval('.item', items =>
            items.map(item => item.textContent)
        );
    },
});

4. Extract Data from React/Vue.js Component State

Sometimes it's easier to extract data directly from the framework's state rather than parsing the DOM:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        await page.waitForLoadState('networkidle');

        // Extract data from React component props/state
        const reactData = await page.evaluate(() => {
            // Find React root element
            const rootElement = document.querySelector('#root');
            if (!rootElement) return null;

            // Access React internal properties (React 16+)
            const reactInternalKey = Object.keys(rootElement).find(
                key => key.startsWith('__reactFiber') || key.startsWith('__reactInternalInstance')
            );

            if (reactInternalKey) {
                const fiber = rootElement[reactInternalKey];
                // Navigate the fiber tree to find component state
                // This is implementation-specific
                return fiber?.memoizedProps?.data;
            }

            return null;
        });

        // Or extract from Vue instance
        const vueData = await page.evaluate(() => {
            const app = document.querySelector('#app');
            // Vue 3 exposes instance via __vueParentComponent
            return app?.__vueParentComponent?.ctx?.data;
        });
    },
});

5. Optimize Performance with Request Interception

Block unnecessary resources to speed up scraping:

const crawler = new PlaywrightCrawler({
    preNavigationHooks: [
        async ({ page }) => {
            // Block images, fonts, and other non-essential resources
            await page.route('**/*', (route) => {
                const resourceType = route.request().resourceType();
                if (['image', 'font', 'media'].includes(resourceType)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        },
    ],
    requestHandler: async ({ page, request }) => {
        await page.goto(request.url);
        // Your extraction logic
    },
});

Complete Example: Scraping a React E-commerce Site

Here's a comprehensive example that combines all best practices:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Increase timeout for slow-loading SPAs
    navigationTimeoutSecs: 60,

    // Use headless browser
    headless: true,

    preNavigationHooks: [
        async ({ page }) => {
            // Block unnecessary resources
            await page.route('**/*', (route) => {
                const type = route.request().resourceType();
                if (['image', 'stylesheet', 'font'].includes(type)) {
                    route.abort();
                } else {
                    route.continue();
                }
            });
        },
    ],

    requestHandler: async ({ page, request, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);

        // Wait for React app to initialize
        await page.waitForSelector('[data-react-root]', { timeout: 30000 });

        // Wait for product grid to load
        await page.waitForSelector('.product-grid', { state: 'visible' });

        // Wait for API call to complete
        await page.waitForResponse(
            response => response.url().includes('/api/products'),
            { timeout: 30000 }
        );

        // Additional wait for rendering
        await page.waitForTimeout(1000);

        // Extract product data
        const products = await page.$$eval('.product-card', cards => {
            return cards.map(card => ({
                title: card.querySelector('.product-title')?.textContent?.trim(),
                price: card.querySelector('.product-price')?.textContent?.trim(),
                rating: card.querySelector('.product-rating')?.textContent?.trim(),
                url: card.querySelector('a')?.href,
                inStock: !card.querySelector('.out-of-stock')
            }));
        });

        // Save to dataset
        await Dataset.pushData(products);

        // Find and enqueue pagination links
        await enqueueLinks({
            selector: '.pagination a',
            label: 'PRODUCTS',
        });

        console.log(`Extracted ${products.length} products`);
    },

    failedRequestHandler: async ({ request }) => {
        console.log(`Request ${request.url} failed multiple times`);
    },
});

// Start crawling
await crawler.run([
    'https://example-react-shop.com/products',
]);

// Export data
const dataset = await Dataset.open();
await dataset.exportToJSON('products');

Python Example with Crawlee for Python

If you're using Python, Crawlee for Python offers similar functionality:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=50,
        headless=True,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Wait for React/Vue app to load
        await page.wait_for_selector('.app-content', timeout=30000)

        # Wait for network to be idle
        await page.wait_for_load_state('networkidle')

        # Extract data
        data = await page.evaluate("""
            () => {
                return Array.from(document.querySelectorAll('.item')).map(item => ({
                    title: item.querySelector('.title')?.textContent,
                    description: item.querySelector('.desc')?.textContent
                }));
            }
        """)

        # Push data to dataset
        await context.push_data(data)

        # Enqueue new links
        await context.enqueue_links(selector='a.next-page')

    await crawler.run(['https://example-vue-app.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Debugging Tips

When scraping React or Vue.js applications, debugging is crucial:

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        // Take screenshots at different stages
        await page.screenshot({ path: 'before-wait.png' });

        await page.waitForSelector('.content');
        await page.screenshot({ path: 'after-wait.png' });

        // Log page content for inspection
        const html = await page.content();
        console.log('Page HTML length:', html.length);

        // Check if specific elements exist
        const elementExists = await page.$('.target-element') !== null;
        console.log('Target element exists:', elementExists);
    },

    // Run in non-headless mode to watch the browser
    headless: false,
});

Conclusion

Scraping React and Vue.js websites with Crawlee requires using browser automation through PlaywrightCrawler or PuppeteerCrawler. The key to success is implementing proper wait strategies to ensure JavaScript has executed and all dynamic content has loaded before extraction. By combining selector waits, network monitoring, and strategic timeouts, you can reliably extract data from even the most complex single-page applications.

Remember to always respect website terms of service, implement rate limiting, and handle errors gracefully to build robust and ethical web scrapers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon