How does Crawlee handle AJAX-loaded content?
AJAX (Asynchronous JavaScript and XML) has become the backbone of modern web applications, enabling dynamic content loading without full page refreshes. When scraping websites that rely on AJAX, traditional HTTP-based scrapers often fail because they only capture the initial HTML response, missing content that loads asynchronously. Crawlee provides robust solutions for handling AJAX-loaded content through its browser automation capabilities and intelligent waiting mechanisms.
Understanding AJAX Content Loading
AJAX-loaded content presents unique challenges for web scraping:
- Delayed rendering: Content loads after the initial page load
- Dynamic DOM updates: JavaScript modifies the page structure continuously
- Infinite scrolling: New content appears as users scroll
- Event-triggered loading: Content loads in response to user interactions
- API-based data fetching: Data arrives through XHR/Fetch requests
Crawlee addresses these challenges through multiple strategies, depending on which crawler type you use.
Browser-Based Crawlers for AJAX Content
Crawlee's browser automation crawlers (PlaywrightCrawler and PuppeteerCrawler) are specifically designed to handle JavaScript-heavy websites and AJAX content. These crawlers execute JavaScript just like a real browser, allowing AJAX requests to complete and content to render before extraction.
Using PlaywrightCrawler
PlaywrightCrawler is the recommended choice for modern web scraping with full JavaScript support:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, log }) => {
// Wait for AJAX content to load
await page.waitForSelector('.ajax-loaded-content', {
state: 'visible',
timeout: 10000
});
// Extract data after AJAX completes
const data = await page.evaluate(() => {
const items = [];
document.querySelectorAll('.product-item').forEach(el => {
items.push({
title: el.querySelector('.title')?.textContent,
price: el.querySelector('.price')?.textContent,
});
});
return items;
});
log.info(`Scraped ${data.length} items from ${request.url}`);
await crawler.pushData(data);
},
});
await crawler.run(['https://example.com/products']);
Using PuppeteerCrawler
PuppeteerCrawler offers similar functionality with Puppeteer's API:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
requestHandler: async ({ page, request }) => {
// Wait for network to be idle (all AJAX requests complete)
await page.waitForNetworkIdle({ idleTime: 500 });
// Or wait for specific element
await page.waitForSelector('#dynamic-content');
const content = await page.$eval('#dynamic-content', el => el.innerHTML);
},
});
await crawler.run(['https://example.com']);
Waiting Strategies for AJAX Content
Crawlee provides multiple strategies to ensure AJAX content has loaded before scraping:
1. Wait for Selectors
The most reliable method is waiting for specific elements that appear after AJAX completes:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Wait for element to appear
await page.waitForSelector('.ajax-results', {
state: 'visible',
timeout: 30000
});
// Wait for multiple elements
await Promise.all([
page.waitForSelector('.product-list'),
page.waitForSelector('.pagination'),
page.waitForSelector('.filters')
]);
},
});
2. Wait for Network Idle
Similar to how Puppeteer handles AJAX requests, Crawlee can wait for network activity to cease:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Wait until there are no more than 2 network connections
// for at least 500ms
await page.waitForLoadState('networkidle');
// Now safe to extract AJAX-loaded content
const data = await page.evaluate(() => {
return document.querySelector('.ajax-content')?.textContent;
});
},
});
3. Custom Wait Functions
For complex scenarios, implement custom waiting logic:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Wait for custom condition
await page.waitForFunction(() => {
const loader = document.querySelector('.loading-spinner');
const content = document.querySelector('.content-loaded');
return !loader && content && content.children.length > 0;
}, { timeout: 15000 });
// Content is now ready for extraction
},
});
4. Timed Waits
As a last resort, use fixed delays (though less reliable):
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page }) => {
// Wait for 2 seconds
await page.waitForTimeout(2000);
// Proceed with extraction
},
});
Handling Infinite Scroll and Lazy Loading
Many modern websites use infinite scroll to load content dynamically. Crawlee can simulate scrolling to trigger AJAX requests:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, log, infiniteScroll }) => {
// Use built-in infinite scroll helper
await infiniteScroll({
maxScrollHeight: 10000,
waitForSecs: 2,
scrollDownAndUp: true,
});
// Or implement custom scrolling
let previousHeight = 0;
let currentHeight = await page.evaluate(() => document.body.scrollHeight);
while (previousHeight < currentHeight) {
// Scroll to bottom
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
// Wait for new content
await page.waitForTimeout(1000);
await page.waitForLoadState('networkidle');
previousHeight = currentHeight;
currentHeight = await page.evaluate(() => document.body.scrollHeight);
log.info(`Scrolled to ${currentHeight}px`);
}
// Extract all loaded content
const items = await page.$$eval('.item', elements =>
elements.map(el => ({
title: el.querySelector('h2')?.textContent,
description: el.querySelector('p')?.textContent,
}))
);
await crawler.pushData(items);
},
});
Monitoring Network Requests
Similar to monitoring network requests in Puppeteer, you can intercept and analyze AJAX requests to extract data directly from API responses:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, log }) => {
const apiData = [];
// Intercept API requests
page.on('response', async (response) => {
const url = response.url();
// Check if this is the API endpoint we're interested in
if (url.includes('/api/products')) {
try {
const json = await response.json();
apiData.push(...json.products);
log.info(`Captured API data: ${json.products.length} items`);
} catch (e) {
log.error(`Failed to parse JSON from ${url}`);
}
}
});
// Navigate and trigger AJAX requests
await page.goto('https://example.com/products');
// Wait for all requests to complete
await page.waitForLoadState('networkidle');
// Save intercepted API data
await crawler.pushData(apiData);
},
});
Handling Click-Triggered AJAX Content
Some content only loads when users interact with the page:
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, log }) => {
// Click "Load More" button to trigger AJAX
let hasMoreContent = true;
let clickCount = 0;
const maxClicks = 10;
while (hasMoreContent && clickCount < maxClicks) {
try {
// Click the load more button
const loadMoreButton = await page.$('.load-more-button');
if (!loadMoreButton) {
hasMoreContent = false;
break;
}
await loadMoreButton.click();
clickCount++;
// Wait for new content to appear
await page.waitForTimeout(1000);
await page.waitForLoadState('networkidle');
log.info(`Clicked "Load More" ${clickCount} times`);
} catch (e) {
log.error('No more content to load');
hasMoreContent = false;
}
}
// Extract all loaded items
const items = await page.$$eval('.item', elements =>
elements.map(el => el.textContent)
);
await crawler.pushData(items);
},
});
Python Implementation
Crawlee for Python offers similar capabilities:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Wait for AJAX content
await page.wait_for_selector('.ajax-loaded-content', state='visible', timeout=10000)
# Wait for network idle
await page.wait_for_load_state('networkidle')
# Extract data
data = await page.evaluate('''() => {
return Array.from(document.querySelectorAll('.product')).map(el => ({
title: el.querySelector('.title')?.textContent,
price: el.querySelector('.price')?.textContent,
}));
}''')
await context.push_data(data)
await crawler.run(['https://example.com/products'])
Best Practices for AJAX Content Scraping
- Use specific selectors: Wait for unique elements that only appear after AJAX completes
- Combine strategies: Use both selector waiting and network idle detection
- Set reasonable timeouts: Balance between reliability and performance
- Handle errors gracefully: AJAX requests can fail; implement retry logic
- Respect rate limits: Don't overload servers with rapid requests
- Monitor network traffic: Understanding API endpoints can simplify data extraction
- Test thoroughly: AJAX behavior can vary across different pages and conditions
When to Use CheerioCrawler vs Browser Crawlers
CheerioCrawler is faster but cannot handle AJAX content because it doesn't execute JavaScript. Use it only when:
- The website serves all content in initial HTML
- You've identified API endpoints and can call them directly
- Performance is critical and JavaScript execution isn't needed
For AJAX-heavy websites, always use PlaywrightCrawler or PuppeteerCrawler, as they provide the full browser environment needed for JavaScript execution and dynamic content loading.
Conclusion
Crawlee excels at handling AJAX-loaded content through its browser automation crawlers, intelligent waiting mechanisms, and flexible API. By leveraging PlaywrightCrawler or PuppeteerCrawler with proper waiting strategies, you can reliably scrape even the most dynamic, JavaScript-heavy websites. The key is understanding how content loads on your target website and choosing the appropriate waiting strategy—whether that's waiting for specific selectors, network idle states, or custom conditions.
For additional control over browser automation and handling complex single-page applications, Crawlee's integration with Playwright and Puppeteer provides all the tools you need to successfully extract data from modern web applications.