What is the Best Way to Scrape React or Vue.js Websites with Crawlee?
Scraping React and Vue.js websites requires special consideration because these frameworks render content dynamically using JavaScript. Unlike traditional server-side rendered pages, single-page applications (SPAs) built with React or Vue.js load content asynchronously after the initial page load, making them challenging to scrape with simple HTTP requests.
Crawlee provides powerful tools specifically designed for scraping JavaScript-heavy websites through its PlaywrightCrawler
and PuppeteerCrawler
classes, which use real browser automation to execute JavaScript and wait for dynamic content to load.
Understanding the Challenge with React and Vue.js Websites
React and Vue.js applications differ from traditional websites in several key ways:
- Client-side rendering: Content is generated in the browser using JavaScript rather than being sent from the server
- Asynchronous data fetching: Data is often loaded via API calls after the initial page load
- Dynamic DOM updates: The page structure changes as users interact with it
- Virtual DOM: React and Vue use a virtual DOM that requires JavaScript execution to render actual HTML
- Lazy loading: Components and data may load on-demand as users scroll or navigate
Because of these characteristics, standard HTTP-based scrapers like CheerioCrawler
will only see an empty skeleton HTML page without the actual content.
Choosing the Right Crawlee Crawler
Crawlee offers three main crawler types, but for React and Vue.js applications, you need one that can execute JavaScript:
PlaywrightCrawler (Recommended)
PlaywrightCrawler
is the most powerful and recommended option for scraping modern JavaScript frameworks. It uses Playwright, which supports Chromium, Firefox, and WebKit browsers.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`);
// Wait for React/Vue app to load
await page.waitForSelector('.app-content', { timeout: 30000 });
// Extract data after JavaScript has rendered
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent,
items: Array.from(document.querySelectorAll('.item')).map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent
}))
};
});
console.log('Extracted data:', data);
},
});
await crawler.run(['https://example-react-app.com']);
PuppeteerCrawler
PuppeteerCrawler
is another excellent choice, using Puppeteer which only supports Chromium-based browsers but has a slightly simpler API.
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page, request }) => {
// Wait for network to be idle (useful for SPAs)
await page.waitForNetworkIdle({ timeout: 30000 });
const data = await page.evaluate(() => {
// Your extraction logic here
return document.querySelector('.content')?.textContent;
});
},
});
Best Practices for Scraping React/Vue.js Websites
1. Wait for Content to Load Properly
The most critical aspect of scraping single-page applications is ensuring all content has loaded before extraction. Crawlee provides several strategies:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Strategy 1: Wait for specific selector
await page.waitForSelector('.data-loaded-indicator', {
state: 'visible',
timeout: 30000
});
// Strategy 2: Wait for network to be idle
await page.waitForLoadState('networkidle');
// Strategy 3: Wait for specific function to be available (React)
await page.waitForFunction(() => {
return window.__REACT_READY__ === true;
});
// Strategy 4: Custom wait time (use sparingly)
await page.waitForTimeout(3000);
},
});
2. Handle Dynamic Data Loading
React and Vue.js apps often fetch data from APIs after the initial render. Monitor network requests to understand when data loading completes:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
// Listen for API responses
const apiDataPromise = page.waitForResponse(
response => response.url().includes('/api/products') && response.status() === 200,
{ timeout: 30000 }
);
await page.goto(request.url);
// Wait for the API call to complete
await apiDataPromise;
// Now extract the data
const products = await page.$$eval('.product-card', cards =>
cards.map(card => ({
title: card.querySelector('.title')?.textContent,
price: card.querySelector('.price')?.textContent
}))
);
},
});
3. Handle Infinite Scroll and Lazy Loading
Many React and Vue.js applications use infinite scroll to load content progressively:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Function to scroll and wait for new content
async function autoScroll() {
await page.evaluate(async () => {
await new Promise((resolve) => {
let totalHeight = 0;
const distance = 100;
const timer = setInterval(() => {
const scrollHeight = document.body.scrollHeight;
window.scrollBy(0, distance);
totalHeight += distance;
if (totalHeight >= scrollHeight) {
clearInterval(timer);
resolve();
}
}, 100);
});
});
}
// Scroll to load all content
await autoScroll();
// Wait for final content to render
await page.waitForTimeout(2000);
// Extract all loaded data
const allData = await page.$$eval('.item', items =>
items.map(item => item.textContent)
);
},
});
4. Extract Data from React/Vue.js Component State
Sometimes it's easier to extract data directly from the framework's state rather than parsing the DOM:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
await page.waitForLoadState('networkidle');
// Extract data from React component props/state
const reactData = await page.evaluate(() => {
// Find React root element
const rootElement = document.querySelector('#root');
if (!rootElement) return null;
// Access React internal properties (React 16+)
const reactInternalKey = Object.keys(rootElement).find(
key => key.startsWith('__reactFiber') || key.startsWith('__reactInternalInstance')
);
if (reactInternalKey) {
const fiber = rootElement[reactInternalKey];
// Navigate the fiber tree to find component state
// This is implementation-specific
return fiber?.memoizedProps?.data;
}
return null;
});
// Or extract from Vue instance
const vueData = await page.evaluate(() => {
const app = document.querySelector('#app');
// Vue 3 exposes instance via __vueParentComponent
return app?.__vueParentComponent?.ctx?.data;
});
},
});
5. Optimize Performance with Request Interception
Block unnecessary resources to speed up scraping:
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page }) => {
// Block images, fonts, and other non-essential resources
await page.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'font', 'media'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
},
],
requestHandler: async ({ page, request }) => {
await page.goto(request.url);
// Your extraction logic
},
});
Complete Example: Scraping a React E-commerce Site
Here's a comprehensive example that combines all best practices:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Increase timeout for slow-loading SPAs
navigationTimeoutSecs: 60,
// Use headless browser
headless: true,
preNavigationHooks: [
async ({ page }) => {
// Block unnecessary resources
await page.route('**/*', (route) => {
const type = route.request().resourceType();
if (['image', 'stylesheet', 'font'].includes(type)) {
route.abort();
} else {
route.continue();
}
});
},
],
requestHandler: async ({ page, request, enqueueLinks }) => {
console.log(`Processing: ${request.url}`);
// Wait for React app to initialize
await page.waitForSelector('[data-react-root]', { timeout: 30000 });
// Wait for product grid to load
await page.waitForSelector('.product-grid', { state: 'visible' });
// Wait for API call to complete
await page.waitForResponse(
response => response.url().includes('/api/products'),
{ timeout: 30000 }
);
// Additional wait for rendering
await page.waitForTimeout(1000);
// Extract product data
const products = await page.$$eval('.product-card', cards => {
return cards.map(card => ({
title: card.querySelector('.product-title')?.textContent?.trim(),
price: card.querySelector('.product-price')?.textContent?.trim(),
rating: card.querySelector('.product-rating')?.textContent?.trim(),
url: card.querySelector('a')?.href,
inStock: !card.querySelector('.out-of-stock')
}));
});
// Save to dataset
await Dataset.pushData(products);
// Find and enqueue pagination links
await enqueueLinks({
selector: '.pagination a',
label: 'PRODUCTS',
});
console.log(`Extracted ${products.length} products`);
},
failedRequestHandler: async ({ request }) => {
console.log(`Request ${request.url} failed multiple times`);
},
});
// Start crawling
await crawler.run([
'https://example-react-shop.com/products',
]);
// Export data
const dataset = await Dataset.open();
await dataset.exportToJSON('products');
Python Example with Crawlee for Python
If you're using Python, Crawlee for Python offers similar functionality:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
crawler = PlaywrightCrawler(
max_requests_per_crawl=50,
headless=True,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Wait for React/Vue app to load
await page.wait_for_selector('.app-content', timeout=30000)
# Wait for network to be idle
await page.wait_for_load_state('networkidle')
# Extract data
data = await page.evaluate("""
() => {
return Array.from(document.querySelectorAll('.item')).map(item => ({
title: item.querySelector('.title')?.textContent,
description: item.querySelector('.desc')?.textContent
}));
}
""")
# Push data to dataset
await context.push_data(data)
# Enqueue new links
await context.enqueue_links(selector='a.next-page')
await crawler.run(['https://example-vue-app.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Debugging Tips
When scraping React or Vue.js applications, debugging is crucial:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Take screenshots at different stages
await page.screenshot({ path: 'before-wait.png' });
await page.waitForSelector('.content');
await page.screenshot({ path: 'after-wait.png' });
// Log page content for inspection
const html = await page.content();
console.log('Page HTML length:', html.length);
// Check if specific elements exist
const elementExists = await page.$('.target-element') !== null;
console.log('Target element exists:', elementExists);
},
// Run in non-headless mode to watch the browser
headless: false,
});
Conclusion
Scraping React and Vue.js websites with Crawlee requires using browser automation through PlaywrightCrawler
or PuppeteerCrawler
. The key to success is implementing proper wait strategies to ensure JavaScript has executed and all dynamic content has loaded before extraction. By combining selector waits, network monitoring, and strategic timeouts, you can reliably extract data from even the most complex single-page applications.
Remember to always respect website terms of service, implement rate limiting, and handle errors gracefully to build robust and ethical web scrapers.