How do I Wait for JavaScript to Load Before Scraping with Crawlee?
When scraping modern websites, JavaScript often loads content dynamically after the initial page load. Crawlee provides several powerful methods to wait for JavaScript content to render before extracting data, ensuring you capture complete and accurate information.
Understanding JavaScript Rendering in Crawlee
Crawlee offers different crawler types designed for various scraping scenarios:
- CheerioCrawler: Fast but doesn't execute JavaScript
- PuppeteerCrawler: Uses headless Chrome to execute JavaScript
- PlaywrightCrawler: Uses Playwright for JavaScript execution with better browser support
For JavaScript-heavy sites, you'll need to use either PuppeteerCrawler
or PlaywrightCrawler
to properly wait for dynamic content.
Basic Wait Strategies in Crawlee
1. Waiting for Network Idle
The most common approach is waiting for network activity to settle, indicating that JavaScript has finished loading content:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
// Wait for network to be idle
await page.waitForLoadState('networkidle');
const title = await page.title();
log.info(`Title: ${title}`);
},
});
await crawler.run(['https://example.com']);
Network idle states include:
- load
: Waits for the load event
- domcontentloaded
: Waits for DOMContentLoaded event
- networkidle
: Waits until there are no network connections for at least 500ms
2. Waiting for Specific Selectors
Often the best approach is to wait for specific elements that are loaded via JavaScript:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page, request, log }) => {
// Wait for a specific element to appear
await page.waitForSelector('.product-list', {
timeout: 10000, // Wait up to 10 seconds
});
// Now extract data
const products = await page.$$eval('.product-item', (elements) => {
return elements.map(el => ({
name: el.querySelector('.product-name')?.textContent,
price: el.querySelector('.product-price')?.textContent,
}));
});
log.info(`Found ${products.length} products`);
},
});
await crawler.run(['https://shop.example.com']);
3. Waiting for Multiple Conditions
You can combine multiple wait strategies for more robust scraping:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
// Wait for DOM to load
await page.waitForLoadState('domcontentloaded');
// Wait for specific content
await page.waitForSelector('#main-content', { state: 'visible' });
// Wait for JavaScript variable to be defined
await page.waitForFunction(() => window.dataLayer !== undefined);
// Additional wait for AJAX to complete
await page.waitForLoadState('networkidle');
const content = await page.content();
log.info('Page fully loaded');
},
});
await crawler.run(['https://example.com']);
Advanced Waiting Techniques
Waiting for JavaScript Functions
Sometimes you need to wait for specific JavaScript conditions to be met:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page, request, log }) => {
// Wait for a custom condition
await page.waitForFunction(
() => document.querySelectorAll('.loaded-item').length >= 10,
{ timeout: 15000 }
);
// Or wait for an API response to be processed
await page.waitForFunction(
() => window.apiDataLoaded === true,
{ timeout: 10000 }
);
const items = await page.$$('.loaded-item');
log.info(`Found ${items.length} loaded items`);
},
});
await crawler.run(['https://example.com/dynamic']);
Handling Lazy Loading and Infinite Scroll
For pages that load content as you scroll, you can automate scrolling and waiting:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
let previousHeight = 0;
let currentHeight = await page.evaluate(() => document.body.scrollHeight);
// Scroll until no more content loads
while (previousHeight !== currentHeight) {
previousHeight = currentHeight;
// Scroll to bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
// Wait for new content to load
await page.waitForTimeout(2000);
// Get new height
currentHeight = await page.evaluate(() => document.body.scrollHeight);
}
log.info('Finished loading all content');
const allItems = await page.$$eval('.item', items => items.length);
log.info(`Total items: ${allItems}`);
},
});
await crawler.run(['https://example.com/infinite-scroll']);
Waiting for AJAX Requests
When dealing with AJAX requests that load dynamic content, you can monitor network activity:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
// Wait for a specific API call to complete
const responsePromise = page.waitForResponse(
response => response.url().includes('/api/products') && response.status() === 200,
{ timeout: 10000 }
);
// Trigger the action that causes the API call
await page.click('#load-more-button');
// Wait for the response
await responsePromise;
// Wait for DOM to update with new data
await page.waitForSelector('.new-products', { timeout: 5000 });
log.info('AJAX content loaded successfully');
},
});
await crawler.run(['https://example.com/products']);
Using Pre-Navigation Hooks
Crawlee allows you to configure waiting behavior globally using pre-navigation hooks:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
preNavigationHooks: [
async ({ page, request }, goToOptions) => {
// Set default wait condition
goToOptions.waitUntil = 'networkidle';
goToOptions.timeout = 30000; // 30 seconds
},
],
requestHandler: async ({ page, request, log }) => {
// Page is already loaded with networkidle
const content = await page.content();
log.info('Processing page content');
},
});
await crawler.run(['https://example.com']);
TypeScript Example
Here's a complete TypeScript example demonstrating multiple waiting strategies:
import { PlaywrightCrawler, PlaywrightCrawlingContext } from 'crawlee';
interface ProductData {
name: string;
price: string;
availability: string;
}
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 50,
requestHandler: async ({ page, request, log }: PlaywrightCrawlingContext) => {
log.info(`Processing: ${request.url}`);
try {
// Wait for initial content
await page.waitForLoadState('domcontentloaded');
// Wait for product grid to be visible
await page.waitForSelector('.product-grid', {
state: 'visible',
timeout: 10000,
});
// Wait for price elements to load (often loaded via JS)
await page.waitForFunction(
() => {
const priceElements = document.querySelectorAll('.product-price');
return priceElements.length > 0 &&
Array.from(priceElements).every(el => el.textContent?.trim());
},
{ timeout: 8000 }
);
// Extract data
const products: ProductData[] = await page.$$eval('.product-item', (elements) => {
return elements.map(el => ({
name: el.querySelector('.product-name')?.textContent?.trim() || '',
price: el.querySelector('.product-price')?.textContent?.trim() || '',
availability: el.querySelector('.availability')?.textContent?.trim() || '',
}));
});
log.info(`Extracted ${products.length} products`);
// Save data
await Dataset.pushData({
url: request.url,
products,
scrapedAt: new Date().toISOString(),
});
} catch (error) {
log.error(`Error processing ${request.url}: ${error}`);
}
},
failedRequestHandler: async ({ request, log }) => {
log.error(`Request ${request.url} failed too many times`);
},
});
await crawler.run([
'https://example.com/products',
'https://example.com/products?page=2',
]);
Best Practices for Waiting in Crawlee
1. Choose the Right Waiting Strategy
- Use
waitForSelector()
when you know specific elements will appear - Use
waitForLoadState('networkidle')
for heavily dynamic pages - Use
waitForFunction()
for custom conditions - Combine multiple strategies for reliability
2. Set Appropriate Timeouts
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
try {
// Set reasonable timeouts
await page.waitForSelector('.content', { timeout: 15000 });
} catch (error) {
log.warning('Content did not load in time, using fallback');
// Implement fallback logic
}
},
navigationTimeoutSecs: 60, // Global navigation timeout
});
3. Handle Errors Gracefully
Similar to handling timeouts in Puppeteer, implement proper error handling:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
try {
await page.waitForSelector('.main-content', { timeout: 10000 });
} catch (error) {
log.warning(`Timeout waiting for selector: ${error.message}`);
// Continue with available content or skip
return;
}
// Process page
},
maxRequestRetries: 3,
});
4. Optimize Performance
Don't wait longer than necessary:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
// Use Promise.race for multiple possible selectors
await Promise.race([
page.waitForSelector('.content-type-a'),
page.waitForSelector('.content-type-b'),
]);
// Or wait for the first of multiple conditions
await page.waitForFunction(
() => document.querySelector('.loaded') ||
document.querySelector('.alternative-loaded'),
{ timeout: 8000 }
);
},
});
Python Example with Crawlee
If you're using Crawlee for Python, the syntax is similar:
from crawlee.playwright_crawler import PlaywrightCrawler
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context):
page = context.page
log = context.log
# Wait for network to be idle
await page.wait_for_load_state('networkidle')
# Wait for specific selector
await page.wait_for_selector('.product-list', timeout=10000)
# Wait for custom function
await page.wait_for_function(
'document.querySelectorAll(".product").length >= 10',
timeout=15000
)
# Extract data
products = await page.eval_on_selector_all(
'.product',
'elements => elements.map(el => el.textContent)'
)
log.info(f'Found {len(products)} products')
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Conclusion
Waiting for JavaScript to load properly is crucial for successful web scraping with Crawlee. By using the appropriate waiting strategies—whether it's waitForSelector()
, waitForLoadState()
, waitForFunction()
, or a combination—you can ensure your scraper captures all dynamically loaded content reliably. When scraping single-page applications, these techniques become even more critical.
Remember to balance thoroughness with performance by setting reasonable timeouts, implementing proper error handling, and choosing the most efficient waiting strategy for your specific use case. With Crawlee's flexible API, you have full control over how your crawler waits for and processes JavaScript-rendered content.