Single Page Applications (SPAs) present unique challenges for web scraping because they dynamically load content using JavaScript, making traditional scraping methods ineffective. Puppeteer solves this by controlling a real browser instance, allowing you to scrape fully rendered content.
Basic SPA Crawling Example
Here's a comprehensive example of crawling an SPA:
const puppeteer = require('puppeteer');
async function crawlSPA() {
const browser = await puppeteer.launch({
headless: false, // Set to true for production
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
try {
// Navigate to the SPA
await page.goto('https://your-spa-url.com', {
waitUntil: 'networkidle2', // Wait for network to be idle
timeout: 30000
});
// Wait for dynamic content to load
await page.waitForSelector('.dynamic-content', { timeout: 10000 });
// Extract data from the rendered page
const data = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.item').forEach(item => {
items.push({
title: item.querySelector('.title')?.textContent?.trim(),
price: item.querySelector('.price')?.textContent?.trim(),
url: item.querySelector('a')?.href
});
});
return items;
});
console.log('Scraped data:', data);
return data;
} catch (error) {
console.error('Error crawling SPA:', error);
} finally {
await browser.close();
}
}
crawlSPA();
Advanced SPA Crawling Techniques
1. Handling AJAX Requests
Wait for specific network requests to complete before scraping:
async function waitForAjaxRequests(page) {
// Wait for specific API calls
await page.waitForResponse(response =>
response.url().includes('/api/data') && response.status() === 200
);
// Or wait for multiple requests
const responses = await Promise.all([
page.waitForResponse(resp => resp.url().includes('/api/users')),
page.waitForResponse(resp => resp.url().includes('/api/products'))
]);
}
2. Infinite Scroll Handling
Many SPAs use infinite scroll for loading content:
async function handleInfiniteScroll(page) {
let previousHeight = 0;
let currentHeight = await page.evaluate('document.body.scrollHeight');
while (currentHeight > previousHeight) {
previousHeight = currentHeight;
// Scroll to bottom
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
// Wait for new content to load
await page.waitForTimeout(2000);
// Check if more content loaded
currentHeight = await page.evaluate('document.body.scrollHeight');
}
}
3. Client-Side Routing Navigation
Navigate through SPA routes without full page reloads:
async function navigateSPARoutes(page) {
// Click navigation links
await page.click('a[href="/products"]');
// Wait for route change
await page.waitForFunction(
() => window.location.pathname === '/products'
);
// Wait for new content
await page.waitForSelector('.product-list');
}
Key Considerations for SPA Crawling
1. Wait Strategies
waitForSelector()
: Wait for specific elementswaitForFunction()
: Wait for custom conditionswaitForResponse()
: Wait for API callswaitForNavigation()
: Wait for page transitions
2. Performance Optimization
const page = await browser.newPage();
// Disable images and CSS for faster loading
await page.setRequestInterception(true);
page.on('request', (req) => {
if (req.resourceType() === 'stylesheet' || req.resourceType() === 'image') {
req.abort();
} else {
req.continue();
}
});
3. Error Handling
async function robustSPACrawl(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
try {
await page.goto(url, { waitUntil: 'networkidle2' });
// Set up error handling
page.on('pageerror', error => {
console.log('Page error:', error.message);
});
page.on('requestfailed', request => {
console.log('Request failed:', request.url());
});
// Your scraping logic here
} catch (error) {
console.error('Crawling failed:', error);
} finally {
await browser.close();
}
}
Common SPA Patterns
React Applications
// Wait for React components to mount
await page.waitForFunction(() =>
window.React && document.querySelector('[data-reactroot]')
);
Vue.js Applications
// Wait for Vue instance
await page.waitForFunction(() => window.Vue);
Angular Applications
// Wait for Angular to bootstrap
await page.waitForFunction(() =>
window.getAllAngularRootElements().length > 0
);
Best Practices
- Use appropriate wait conditions based on your target SPA's loading patterns
- Implement retry logic for unreliable network conditions
- Monitor network requests to understand when data loading completes
- Handle authentication if the SPA requires login
- Respect rate limits and implement delays between requests
- Use headless mode in production for better performance
By following these techniques, you can effectively crawl even the most complex SPAs using Puppeteer, ensuring you capture all dynamically loaded content.