Can Crawlee Handle Single-Page Applications (SPAs)?
Yes, Crawlee can effectively handle single-page applications (SPAs) using its browser-based crawlers: PlaywrightCrawler
and PuppeteerCrawler
. These crawlers are specifically designed to work with JavaScript-heavy websites where content is dynamically rendered on the client side, making them ideal for scraping React, Vue.js, Angular, and other modern SPA frameworks.
Understanding SPAs and Web Scraping Challenges
Single-page applications differ from traditional multi-page websites in several key ways:
- Dynamic Content Loading: Content is loaded via JavaScript after the initial page load
- Client-Side Routing: Navigation happens without full page reloads
- Asynchronous Data Fetching: Data is often loaded via AJAX/fetch requests
- Virtual DOM Updates: The DOM is updated dynamically without page refreshes
Traditional HTTP-based scrapers like CheerioCrawler
cannot execute JavaScript and will only see the initial HTML shell, missing all dynamically loaded content. This is where Crawlee's browser-based crawlers excel.
Using PlaywrightCrawler for SPAs
PlaywrightCrawler
is the recommended choice for scraping SPAs due to Playwright's superior features and cross-browser support. Here's a comprehensive example:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Wait for network to be idle before considering page loaded
navigationTimeoutSecs: 60,
async requestHandler({ page, request, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Wait for SPA content to load
// Option 1: Wait for specific selector
await page.waitForSelector('.product-list', { timeout: 10000 });
// Option 2: Wait for network idle
await page.waitForLoadState('networkidle');
// Option 3: Wait for specific time
await page.waitForTimeout(2000);
// Extract data after JavaScript has rendered content
const data = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-item')).map(item => ({
title: item.querySelector('h2')?.textContent?.trim(),
price: item.querySelector('.price')?.textContent?.trim(),
description: item.querySelector('.description')?.textContent?.trim()
}));
});
// Save extracted data
await Dataset.pushData(data);
// Handle SPA pagination/navigation
await enqueueLinks({
selector: 'a.next-page',
transformRequestFunction: (req) => {
req.userData = { ...req.userData, pageType: 'listing' };
return req;
}
});
// Trigger SPA navigation if needed
const loadMoreButton = await page.$('button.load-more');
if (loadMoreButton) {
await loadMoreButton.click();
await page.waitForLoadState('networkidle');
// Extract additional content after click
}
},
// Handle failed requests
failedRequestHandler({ request, error }) {
console.error(`Request ${request.url} failed: ${error.message}`);
},
});
// Start with SPA URLs
await crawler.run([
'https://example-spa.com/products',
'https://example-spa.com/categories'
]);
Using PuppeteerCrawler for SPAs
PuppeteerCrawler
is another excellent option for handling SPAs, similar to crawling single-page applications using Puppeteer:
import { PuppeteerCrawler, Dataset } from 'crawlee';
const crawler = new PuppeteerCrawler({
launchContext: {
launchOptions: {
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
}
},
async requestHandler({ page, request, enqueueLinks }) {
// Wait for SPA to initialize
await page.waitForSelector('[data-spa-ready]', { timeout: 15000 });
// Scroll to trigger lazy loading (common in SPAs)
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
// Extract data from SPA
const items = await page.$$eval('.item', (elements) => {
return elements.map(el => ({
id: el.getAttribute('data-id'),
name: el.querySelector('.name')?.textContent,
status: el.querySelector('.status')?.textContent
}));
});
await Dataset.pushData({ url: request.url, items });
// Enqueue links found in SPA
await enqueueLinks({
selector: 'a[href^="/"]',
baseUrl: 'https://example-spa.com'
});
}
});
await crawler.run(['https://example-spa.com']);
Handling SPA-Specific Scenarios
Client-Side Routing
SPAs use client-side routing where URLs change without page reloads. Crawlee handles this effectively:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Intercept SPA navigation
await page.route('**/*', (route) => {
// Log all requests to track AJAX calls
console.log(`Request: ${route.request().url()}`);
route.continue();
});
// Click on SPA navigation link
const navLink = await page.$('a[data-spa-link="/about"]');
if (navLink) {
// Wait for SPA route change
await Promise.all([
page.waitForFunction(() => window.location.pathname === '/about'),
navLink.click()
]);
// Extract data from new view
const content = await page.textContent('.main-content');
console.log('SPA navigated to:', page.url());
}
}
});
Infinite Scroll and Lazy Loading
Many SPAs implement infinite scroll, which requires special handling:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
const allItems = [];
let previousHeight = 0;
let noNewContentCount = 0;
// Keep scrolling until no new content loads
while (noNewContentCount < 3) {
// Extract current items
const items = await page.$$eval('.item', els =>
els.map(el => ({
title: el.querySelector('h3')?.textContent,
url: el.querySelector('a')?.href
}))
);
allItems.push(...items);
// Scroll to bottom
const currentHeight = await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
return document.body.scrollHeight;
});
// Wait for potential new content
await page.waitForTimeout(1500);
if (currentHeight === previousHeight) {
noNewContentCount++;
} else {
noNewContentCount = 0;
previousHeight = currentHeight;
}
}
// Remove duplicates and save
const uniqueItems = Array.from(
new Map(allItems.map(item => [item.url, item])).values()
);
await Dataset.pushData({ url: request.url, items: uniqueItems });
}
});
Waiting for AJAX Requests
SPAs frequently make AJAX requests to load data. You can wait for these requests similar to handling AJAX requests using Puppeteer:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Wait for specific API endpoint
const apiResponse = await page.waitForResponse(
response => response.url().includes('/api/products') && response.status() === 200,
{ timeout: 10000 }
);
// Get JSON data from API response
const apiData = await apiResponse.json();
console.log('API returned:', apiData);
// Or wait for multiple network requests to complete
await page.waitForLoadState('networkidle');
// Extract rendered content
const renderedData = await page.textContent('.product-container');
}
});
Python: Crawlee for Python with SPAs
Crawlee for Python also supports SPA scraping through its Playwright integration:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee import Dataset
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=50,
navigation_timeout_secs=60,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Wait for SPA to load
await page.wait_for_selector('.spa-content', timeout=10000)
await page.wait_for_load_state('networkidle')
# Extract data
data = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.item')).map(item => ({
title: item.querySelector('h2')?.textContent,
price: item.querySelector('.price')?.textContent
}));
}''')
# Save to dataset
await context.push_data({'url': context.request.url, 'items': data})
# Handle SPA pagination
next_button = await page.query_selector('button.next')
if next_button:
await next_button.click()
await page.wait_for_load_state('networkidle')
# Enqueue the "new" URL after SPA navigation
await context.enqueue_links(selector='a.item-link')
await crawler.run(['https://example-spa.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Best Practices for Scraping SPAs with Crawlee
1. Choose the Right Wait Strategy
Different SPAs require different waiting strategies:
// Wait for specific selector
await page.waitForSelector('.content-loaded');
// Wait for network to be idle
await page.waitForLoadState('networkidle');
// Wait for custom condition
await page.waitForFunction(() => window.appReady === true);
// Combine multiple conditions
await Promise.all([
page.waitForSelector('.header'),
page.waitForLoadState('domcontentloaded'),
page.waitForFunction(() => document.readyState === 'complete')
]);
2. Handle Browser Context Efficiently
Reuse browser contexts to improve performance:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Use persistent context for session handling
launchContext: {
useChrome: true,
launchOptions: {
headless: true
}
},
// Reuse browser instances
maxConcurrency: 10,
async requestHandler({ page, request }) {
// Your SPA scraping logic
}
});
3. Monitor and Debug SPA Behavior
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Enable console logging from browser
page.on('console', msg => console.log('Browser log:', msg.text()));
// Log network requests
page.on('request', req => console.log('Request:', req.url()));
page.on('response', res => console.log('Response:', res.url(), res.status()));
// Take screenshot for debugging
await page.screenshot({ path: `screenshot-${Date.now()}.png` });
}
});
Performance Considerations
Browser-based crawling is more resource-intensive than HTTP-only crawling. Optimize performance with these settings:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Limit concurrent browser instances
maxConcurrency: 5,
// Set reasonable timeouts
navigationTimeoutSecs: 30,
requestHandlerTimeoutSecs: 60,
launchContext: {
launchOptions: {
// Disable unnecessary features
args: [
'--disable-dev-shm-usage',
'--disable-gpu',
'--disable-features=IsolateOrigins,site-per-process',
'--no-sandbox'
]
}
},
// Block unnecessary resources
preNavigationHooks: [
async ({ page }) => {
await page.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'font', 'media'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
}
]
});
When to Use Browser Crawlers vs HTTP Crawlers
While Crawlee's browser-based crawlers are excellent for SPAs, consider the trade-offs:
Use PlaywrightCrawler or PuppeteerCrawler when: - The website heavily relies on JavaScript for content rendering - Client-side routing is used for navigation - Content is loaded dynamically via AJAX/fetch - You need to interact with the page (clicking, scrolling, form submission) - The site implements lazy loading or infinite scroll
Use CheerioCrawler when: - Content is server-side rendered - You need maximum speed and minimal resource usage - The website doesn't require JavaScript execution - You're scraping large-scale static content
Handling Common SPA Patterns
React Applications
React apps often use data attributes and component lifecycles:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Wait for React to render
await page.waitForFunction(() => {
const root = document.querySelector('#root');
return root && root.children.length > 0;
});
// Wait for data to load (React often shows loading state)
await page.waitForSelector('[data-testid="content-loaded"]');
// Extract data from React components
const data = await page.evaluate(() => {
return window.__REACT_DATA__ || {}; // Some apps expose data
});
}
});
Vue.js Applications
Vue.js apps can be detected and scraped effectively:
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Wait for Vue to mount
await page.waitForFunction(() => window.__VUE__ !== undefined);
// Wait for v-cloak to be removed (common Vue pattern)
await page.waitForFunction(() => {
return !document.querySelector('[v-cloak]');
});
// Extract data
const vueData = await page.evaluate(() => {
return window.__INITIAL_STATE__; // Common pattern in Vue SSR
});
}
});
Angular Applications
Angular apps have their own loading indicators and lifecycle:
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Wait for Angular to bootstrap
await page.waitForFunction(() => {
return window.getAllAngularTestabilities !== undefined &&
window.getAllAngularTestabilities()[0]?.isStable();
});
// Wait for loading indicators to disappear
await page.waitForSelector('.loading-spinner', { state: 'hidden' });
// Extract data from Angular components
const data = await page.$$eval('[data-component]', elements => {
return elements.map(el => ({
component: el.getAttribute('data-component'),
content: el.textContent
}));
});
}
});
Troubleshooting SPA Scraping
Content Not Loading
If content doesn't load, try multiple wait strategies:
async requestHandler({ page, request }) {
try {
// Try primary selector with timeout
await page.waitForSelector('.main-content', { timeout: 5000 });
} catch (error) {
// Fallback: wait for network idle
await page.waitForLoadState('networkidle');
// If still no content, wait additional time
await page.waitForTimeout(3000);
}
// Verify content loaded
const hasContent = await page.$('.main-content');
if (!hasContent) {
throw new Error('Content failed to load');
}
}
Handling Navigation Timeouts
Similar to handling timeouts in Puppeteer, implement robust timeout handling:
const crawler = new PlaywrightCrawler({
navigationTimeoutSecs: 60,
requestHandlerTimeoutSecs: 120,
async requestHandler({ page, request }) {
try {
await page.goto(request.url, {
waitUntil: 'domcontentloaded',
timeout: 30000
});
} catch (error) {
if (error.name === 'TimeoutError') {
console.log('Navigation timeout, continuing anyway...');
// Page might still be usable
} else {
throw error;
}
}
}
});
Conclusion
Crawlee is exceptionally well-suited for scraping single-page applications through its PlaywrightCrawler
and PuppeteerCrawler
implementations. These crawlers provide full JavaScript execution, allowing you to interact with SPAs just as a real user would. By properly configuring wait strategies, handling client-side routing, and implementing efficient resource management, you can reliably extract data from even the most complex modern web applications.
The key to success with SPA scraping is understanding the application's behavior, using appropriate wait conditions, and leveraging Crawlee's powerful features for request management, data extraction, and error handling. Whether you're working with React, Vue.js, Angular, or any other modern framework, Crawlee provides the tools necessary to effectively scrape dynamic content.