How do I Scrape Dynamic Websites with JavaScript Using Crawlee?
Scraping dynamic websites that rely heavily on JavaScript can be challenging with traditional HTTP-based tools. Crawlee provides powerful solutions through its browser-based crawlers—PlaywrightCrawler and PuppeteerCrawler—designed specifically to handle JavaScript-rendered content, AJAX requests, and complex single-page applications (SPAs).
Understanding Dynamic Websites
Dynamic websites use JavaScript to: - Load content asynchronously after the initial page load - Render content based on user interactions - Fetch data from APIs without page refreshes - Create infinite scroll or pagination - Handle authentication and session management
Traditional scrapers that only parse HTML won't capture this dynamically loaded content. Crawlee's browser-based crawlers solve this by executing JavaScript just like a real browser.
Choosing the Right Crawler
Crawlee offers two primary options for scraping dynamic websites:
PlaywrightCrawler
Modern, feature-rich, and supports multiple browsers (Chromium, Firefox, WebKit).
PuppeteerCrawler
Lightweight, Chrome/Chromium-focused, with extensive community support.
Both crawlers provide similar APIs and capabilities. Choose based on your browser requirements and performance needs.
Basic Setup for Dynamic Website Scraping
Installation
First, install Crawlee with your preferred browser automation library:
# For Playwright (recommended for most use cases)
npm install crawlee playwright
# Or for Puppeteer
npm install crawlee puppeteer
Simple PlaywrightCrawler Example
Here's how to scrape a dynamic website that loads content via JavaScript:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Limit concurrent pages for stability
maxConcurrency: 5,
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing: ${request.url}`);
// Wait for dynamic content to load
await page.waitForSelector('.product-list');
// Extract data after JavaScript execution
const products = await page.$$eval('.product-item', (elements) => {
return elements.map(el => ({
title: el.querySelector('.product-title')?.textContent?.trim(),
price: el.querySelector('.product-price')?.textContent?.trim(),
image: el.querySelector('.product-image')?.getAttribute('src')
}));
});
log.info(`Found ${products.length} products`);
// Save the extracted data
await dataset.pushData(products);
// Find and enqueue pagination links
await enqueueLinks({
selector: '.pagination a',
label: 'LISTING'
});
},
});
// Start the crawler
await crawler.run(['https://example.com/products']);
PuppeteerCrawler Example
The API is nearly identical for PuppeteerCrawler:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Scraping: ${request.url}`);
// Wait for JavaScript to render content
await page.waitForSelector('.dynamic-content');
// Extract data
const data = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.item').forEach(item => {
items.push({
name: item.querySelector('.name')?.innerText,
value: item.querySelector('.value')?.innerText
});
});
return items;
});
await dataset.pushData(data);
},
});
await crawler.run(['https://example.com']);
Handling Common Dynamic Content Patterns
Waiting for AJAX Requests
Many dynamic websites load data through AJAX calls. You need to wait for these requests to complete before extracting data:
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request, log }) {
log.info(`Processing: ${request.url}`);
// Wait for network to be idle (no ongoing requests)
await page.waitForLoadState('networkidle');
// Or wait for a specific API response
await page.waitForResponse(
response => response.url().includes('/api/products') && response.status() === 200
);
// Now extract the data
const data = await page.evaluate(() => {
return JSON.parse(document.querySelector('#app-data').textContent);
});
await dataset.pushData(data);
},
});
Similar to how AJAX requests are handled in Puppeteer, Crawlee provides robust methods for waiting on asynchronous content.
Infinite Scroll Pages
Handle infinite scroll by simulating scrolling behavior:
const crawler = new PlaywrightCrawler({
async requestHandler({ page, log }) {
log.info('Handling infinite scroll...');
let previousHeight = 0;
let currentHeight = await page.evaluate('document.body.scrollHeight');
while (previousHeight !== currentHeight) {
// Scroll to bottom
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
// Wait for new content to load
await page.waitForTimeout(2000);
previousHeight = currentHeight;
currentHeight = await page.evaluate('document.body.scrollHeight');
}
// Extract all loaded items
const items = await page.$$eval('.scroll-item', elements => {
return elements.map(el => ({
title: el.querySelector('h3')?.textContent,
description: el.querySelector('.desc')?.textContent
}));
});
await dataset.pushData(items);
},
});
Click-Based Pagination
Some sites require clicking "Load More" buttons:
const crawler = new PlaywrightCrawler({
async requestHandler({ page, log }) {
const allItems = [];
while (true) {
// Extract current page items
const items = await page.$$eval('.item', elements => {
return elements.map(el => el.textContent.trim());
});
allItems.push(...items);
// Try to find and click the "Load More" button
const loadMoreButton = await page.$('button.load-more');
if (!loadMoreButton) {
break; // No more content to load
}
await loadMoreButton.click();
// Wait for new items to appear
await page.waitForTimeout(1000);
}
log.info(`Collected ${allItems.length} total items`);
await dataset.pushData({ items: allItems });
},
});
Handling JavaScript-Rendered Content with Delays
Some websites render content after variable delays:
const crawler = new PlaywrightCrawler({
async requestHandler({ page, log }) {
// Wait for specific element with timeout
try {
await page.waitForSelector('.dynamic-element', {
timeout: 10000
});
} catch (error) {
log.error('Element did not appear within timeout');
return;
}
// Alternative: wait for function to return true
await page.waitForFunction(() => {
return document.querySelectorAll('.item').length > 0;
}, { timeout: 10000 });
const data = await page.$$eval('.item', elements => {
return elements.map(el => el.textContent);
});
await dataset.pushData(data);
},
});
Understanding how to use the waitFor function is crucial for successfully scraping dynamic content.
Advanced Techniques for Single-Page Applications
Single-page applications (SPAs) built with React, Vue, or Angular require special handling:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request, enqueueLinks, log }) {
log.info(`Scraping SPA: ${request.url}`);
// Wait for the SPA framework to fully render
await page.waitForLoadState('networkidle');
// Wait for the app root element
await page.waitForSelector('#app-root');
// Additional wait for React/Vue to hydrate
await page.waitForFunction(() => {
// Check if framework has finished rendering
return window.__APP_READY__ === true;
}, { timeout: 15000 }).catch(() => {
log.warning('App ready signal not found, proceeding anyway');
});
// Extract data from the fully rendered SPA
const data = await page.evaluate(() => {
// Access state from framework if exposed
return {
title: document.querySelector('h1')?.textContent,
items: Array.from(document.querySelectorAll('.spa-item')).map(el => ({
id: el.dataset.id,
content: el.textContent.trim()
}))
};
});
await dataset.pushData(data);
// SPAs often use client-side routing
// Click links to navigate within the SPA
const links = await page.$$eval('a[data-route]', elements => {
return elements.map(el => el.href);
});
await crawler.addRequests(links);
},
});
await crawler.run(['https://spa-example.com']);
For more insights, see how to crawl single-page applications.
Performance Optimization
Browser Context Reuse
Crawlee automatically manages browser contexts for efficiency, but you can optimize further:
const crawler = new PlaywrightCrawler({
// Reuse browser contexts when possible
useSessionPool: true,
persistCookiesPerSession: true,
// Control browser pool
maxConcurrency: 10,
async requestHandler({ page, log }) {
// Your scraping logic
},
});
Blocking Unnecessary Resources
Speed up scraping by blocking images, stylesheets, and fonts:
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page }) => {
// Block resource types
await page.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
},
],
async requestHandler({ page, log }) {
// Scraping logic
},
});
Headless Mode
Always run in headless mode for production:
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
headless: true, // Default, but explicit is better
},
},
});
Error Handling and Retries
Crawlee provides built-in retry mechanisms, but you can customize them:
const crawler = new PlaywrightCrawler({
// Retry failed requests
maxRequestRetries: 3,
// Handle errors gracefully
failedRequestHandler: async ({ request, log }) => {
log.error(`Request ${request.url} failed after ${request.retryCount} retries`);
// Save failed URLs for later review
await dataset.pushData({
url: request.url,
error: 'Failed to scrape',
timestamp: new Date().toISOString()
});
},
async requestHandler({ page, request, log }) {
try {
await page.waitForSelector('.content', { timeout: 10000 });
// Your scraping logic
} catch (error) {
log.error(`Error processing ${request.url}: ${error.message}`);
throw error; // Let Crawlee handle retry
}
},
});
Complete Real-World Example
Here's a comprehensive example scraping a dynamic e-commerce site:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
maxRequestRetries: 2,
preNavigationHooks: [
async ({ page }) => {
// Block images to speed up scraping
await page.route('**/*', (route) => {
const type = route.request().resourceType();
if (['image', 'media'].includes(type)) {
route.abort();
} else {
route.continue();
}
});
},
],
async requestHandler({ request, page, enqueueLinks, log }) {
const { label } = request.userData;
log.info(`Processing ${label}: ${request.url}`);
if (label === 'START' || label === 'CATEGORY') {
// Wait for product grid to load
await page.waitForSelector('.product-grid');
await page.waitForLoadState('networkidle');
// Enqueue product detail pages
await enqueueLinks({
selector: '.product-link',
label: 'PRODUCT',
});
// Handle pagination
await enqueueLinks({
selector: '.pagination a.next',
label: 'CATEGORY',
});
} else if (label === 'PRODUCT') {
// Wait for product details to load
await page.waitForSelector('.product-detail');
// Extract product data
const product = await page.evaluate(() => {
return {
name: document.querySelector('.product-name')?.textContent?.trim(),
price: document.querySelector('.product-price')?.textContent?.trim(),
description: document.querySelector('.product-description')?.textContent?.trim(),
rating: document.querySelector('.rating')?.textContent?.trim(),
reviews: Array.from(document.querySelectorAll('.review')).map(review => ({
author: review.querySelector('.author')?.textContent?.trim(),
text: review.querySelector('.text')?.textContent?.trim(),
rating: review.querySelector('.stars')?.textContent?.trim(),
})),
specifications: Array.from(document.querySelectorAll('.spec-row')).map(row => ({
key: row.querySelector('.spec-key')?.textContent?.trim(),
value: row.querySelector('.spec-value')?.textContent?.trim(),
})),
};
});
// Save to dataset
await Dataset.pushData(product);
log.info(`Scraped product: ${product.name}`);
}
},
failedRequestHandler: async ({ request, log }) => {
log.error(`Failed to process ${request.url}`);
},
});
// Start crawling
await crawler.run([
{
url: 'https://example-shop.com/products',
userData: { label: 'START' },
},
]);
console.log('Scraping completed!');
Best Practices
- Always wait for content: Use
waitForSelector
orwaitForLoadState
to ensure JavaScript has finished rendering - Handle timeouts gracefully: Set appropriate timeout values and catch timeout errors
- Use selectors wisely: Prefer data attributes or unique class names over generic selectors
- Monitor network activity: Watch for API calls that signal when data is ready
- Implement rate limiting: Respect target websites by controlling concurrency
- Test incrementally: Start with a single page before scaling to full site crawls
- Save partial results: Use Crawlee's dataset API to save data as you go
- Log extensively: Use the provided logger to track progress and debug issues
Conclusion
Crawlee's PlaywrightCrawler and PuppeteerCrawler provide robust solutions for scraping dynamic JavaScript-heavy websites. By leveraging browser automation, waiting for dynamic content, and following best practices, you can effectively extract data from even the most complex modern web applications.
The key is understanding how the target website loads its content and using the appropriate waiting strategies to ensure all data is rendered before extraction. With Crawlee's built-in features like automatic retries, request queueing, and data storage, you can build reliable and scalable web scraping solutions for dynamic websites.