Table of contents

How to crawl a single page application (SPA) using Puppeteer?

Single Page Applications (SPAs) present unique challenges for web scraping because they dynamically load content using JavaScript, making traditional scraping methods ineffective. Puppeteer solves this by controlling a real browser instance, allowing you to scrape fully rendered content.

Basic SPA Crawling Example

Here's a comprehensive example of crawling an SPA:

const puppeteer = require('puppeteer');

async function crawlSPA() {
    const browser = await puppeteer.launch({
        headless: false, // Set to true for production
        args: ['--no-sandbox', '--disable-setuid-sandbox']
    });

    const page = await browser.newPage();

    try {
        // Navigate to the SPA
        await page.goto('https://your-spa-url.com', {
            waitUntil: 'networkidle2', // Wait for network to be idle
            timeout: 30000
        });

        // Wait for dynamic content to load
        await page.waitForSelector('.dynamic-content', { timeout: 10000 });

        // Extract data from the rendered page
        const data = await page.evaluate(() => {
            const items = [];
            document.querySelectorAll('.item').forEach(item => {
                items.push({
                    title: item.querySelector('.title')?.textContent?.trim(),
                    price: item.querySelector('.price')?.textContent?.trim(),
                    url: item.querySelector('a')?.href
                });
            });
            return items;
        });

        console.log('Scraped data:', data);
        return data;

    } catch (error) {
        console.error('Error crawling SPA:', error);
    } finally {
        await browser.close();
    }
}

crawlSPA();

Advanced SPA Crawling Techniques

1. Handling AJAX Requests

Wait for specific network requests to complete before scraping:

async function waitForAjaxRequests(page) {
    // Wait for specific API calls
    await page.waitForResponse(response => 
        response.url().includes('/api/data') && response.status() === 200
    );

    // Or wait for multiple requests
    const responses = await Promise.all([
        page.waitForResponse(resp => resp.url().includes('/api/users')),
        page.waitForResponse(resp => resp.url().includes('/api/products'))
    ]);
}

2. Infinite Scroll Handling

Many SPAs use infinite scroll for loading content:

async function handleInfiniteScroll(page) {
    let previousHeight = 0;
    let currentHeight = await page.evaluate('document.body.scrollHeight');

    while (currentHeight > previousHeight) {
        previousHeight = currentHeight;

        // Scroll to bottom
        await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');

        // Wait for new content to load
        await page.waitForTimeout(2000);

        // Check if more content loaded
        currentHeight = await page.evaluate('document.body.scrollHeight');
    }
}

3. Client-Side Routing Navigation

Navigate through SPA routes without full page reloads:

async function navigateSPARoutes(page) {
    // Click navigation links
    await page.click('a[href="/products"]');

    // Wait for route change
    await page.waitForFunction(
        () => window.location.pathname === '/products'
    );

    // Wait for new content
    await page.waitForSelector('.product-list');
}

Key Considerations for SPA Crawling

1. Wait Strategies

  • waitForSelector(): Wait for specific elements
  • waitForFunction(): Wait for custom conditions
  • waitForResponse(): Wait for API calls
  • waitForNavigation(): Wait for page transitions

2. Performance Optimization

const page = await browser.newPage();

// Disable images and CSS for faster loading
await page.setRequestInterception(true);
page.on('request', (req) => {
    if (req.resourceType() === 'stylesheet' || req.resourceType() === 'image') {
        req.abort();
    } else {
        req.continue();
    }
});

3. Error Handling

async function robustSPACrawl(url) {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    try {
        await page.goto(url, { waitUntil: 'networkidle2' });

        // Set up error handling
        page.on('pageerror', error => {
            console.log('Page error:', error.message);
        });

        page.on('requestfailed', request => {
            console.log('Request failed:', request.url());
        });

        // Your scraping logic here

    } catch (error) {
        console.error('Crawling failed:', error);
    } finally {
        await browser.close();
    }
}

Common SPA Patterns

React Applications

// Wait for React components to mount
await page.waitForFunction(() => 
    window.React && document.querySelector('[data-reactroot]')
);

Vue.js Applications

// Wait for Vue instance
await page.waitForFunction(() => window.Vue);

Angular Applications

// Wait for Angular to bootstrap
await page.waitForFunction(() => 
    window.getAllAngularRootElements().length > 0
);

Best Practices

  1. Use appropriate wait conditions based on your target SPA's loading patterns
  2. Implement retry logic for unreliable network conditions
  3. Monitor network requests to understand when data loading completes
  4. Handle authentication if the SPA requires login
  5. Respect rate limits and implement delays between requests
  6. Use headless mode in production for better performance

By following these techniques, you can effectively crawl even the most complex SPAs using Puppeteer, ensuring you capture all dynamically loaded content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon