What is JSDOMCrawler in Crawlee and when should I use it?
JSDOMCrawler is one of the crawler classes in the Crawlee framework that provides a middle ground between the lightweight CheerioCrawler and the more resource-intensive browser-based crawlers like PuppeteerCrawler and PlaywrightCrawler. It uses JSDOM, a pure JavaScript implementation of web standards, to parse HTML and execute JavaScript without launching a real browser.
Understanding JSDOMCrawler
JSDOMCrawler combines the speed and efficiency of HTML parsing with the ability to execute JavaScript code on the page. Unlike CheerioCrawler, which can only parse static HTML, JSDOMCrawler can handle pages that require basic JavaScript execution for rendering content. However, it's still significantly faster and uses fewer resources than full browser automation tools.
Key Features
- JavaScript Execution: Runs JavaScript code in a simulated browser environment without launching an actual browser
- DOM Manipulation: Supports standard DOM APIs for interacting with page elements
- Lightweight: Uses less memory and CPU compared to headless browsers
- Fast: Processes pages more quickly than browser-based solutions
- Resource Efficient: Can handle higher concurrency levels than browser crawlers
- Limited Browser Features: Doesn't support all browser features like CSS rendering, canvas, or WebGL
When to Use JSDOMCrawler
JSDOMCrawler is ideal for specific scraping scenarios where you need JavaScript execution but want to avoid the overhead of a full browser.
Perfect Use Cases
- Simple JavaScript Rendering: When the target website uses basic JavaScript to populate content after the initial page load
- DOM Manipulation: Sites that modify the DOM structure through JavaScript but don't rely on complex browser features
- API Calls in JavaScript: Pages that make simple AJAX requests to load data
- High-Volume Scraping: When you need to scrape thousands of pages and want better performance than browser automation
- Server-Side Rendered React/Vue Apps: Applications that perform client-side hydration with minimal JavaScript execution
When NOT to Use JSDOMCrawler
Avoid JSDOMCrawler in these scenarios:
- Complex Single Page Applications (SPAs): Modern frameworks like React, Angular, or Vue with heavy client-side routing
- Advanced Browser Features: Sites requiring WebGL, Canvas, or complex CSS rendering
- Bot Detection: Websites with sophisticated anti-scraping measures that detect JSDOM
- Interactive Elements: Pages requiring user interactions like clicks, scrolls, or form submissions
- WebSocket Connections: Real-time features that depend on WebSocket communication
- Service Workers: Progressive web apps that rely on service workers
Basic JSDOMCrawler Example
Here's how to set up and use JSDOMCrawler in your project:
import { JSDOMCrawler } from 'crawlee';
const crawler = new JSDOMCrawler({
// Maximum number of concurrent requests
maxConcurrency: 50,
// Request handler for processing each page
async requestHandler({ request, window, body }) {
const { document } = window;
// Extract data using standard DOM APIs
const title = document.querySelector('h1')?.textContent;
const description = document.querySelector('meta[name="description"]')?.content;
// Get all product items
const products = Array.from(document.querySelectorAll('.product-item')).map(item => ({
name: item.querySelector('.product-name')?.textContent?.trim(),
price: item.querySelector('.product-price')?.textContent?.trim(),
url: item.querySelector('a')?.href
}));
console.log(`Scraped ${products.length} products from ${request.url}`);
// Save data
await crawler.pushData({
url: request.url,
title,
description,
products
});
},
// Error handler
async failedRequestHandler({ request, error }) {
console.error(`Request ${request.url} failed: ${error.message}`);
}
});
// Add initial URLs to the queue
await crawler.addRequests([
'https://example.com/products',
'https://example.com/categories'
]);
// Start the crawler
await crawler.run();
Advanced Configuration Options
JSDOMCrawler provides several configuration options to optimize your scraping workflow:
import { JSDOMCrawler } from 'crawlee';
const crawler = new JSDOMCrawler({
// Concurrency settings
maxConcurrency: 100,
minConcurrency: 10,
// Request configuration
maxRequestRetries: 3,
maxRequestsPerMinute: 120,
// JSDOM-specific options
runScripts: 'dangerously', // Enable JavaScript execution
resources: 'usable', // Load external resources
// Custom headers
navigationTimeoutSecs: 30,
requestHandlerTimeoutSecs: 60,
async requestHandler({ request, window, body, crawler }) {
const { document } = window;
// Wait for JavaScript to execute
await new Promise(resolve => setTimeout(resolve, 1000));
// Extract data after JS execution
const dynamicContent = document.querySelector('.js-rendered-content')?.textContent;
// Enqueue additional URLs
const links = Array.from(document.querySelectorAll('a.pagination'))
.map(a => a.href)
.filter(href => href);
await crawler.addRequests(links);
await crawler.pushData({
url: request.url,
dynamicContent
});
}
});
await crawler.run(['https://example.com']);
Working with JavaScript-Heavy Pages
When dealing with pages that execute JavaScript to render content, you might need to wait for specific elements or conditions:
import { JSDOMCrawler } from 'crawlee';
const crawler = new JSDOMCrawler({
async requestHandler({ request, window }) {
const { document } = window;
// Simple polling mechanism to wait for content
const waitForSelector = async (selector, timeout = 5000) => {
const startTime = Date.now();
while (Date.now() - startTime < timeout) {
const element = document.querySelector(selector);
if (element) return element;
await new Promise(resolve => setTimeout(resolve, 100));
}
throw new Error(`Timeout waiting for selector: ${selector}`);
};
try {
// Wait for dynamic content to load
await waitForSelector('.dynamic-content');
const content = document.querySelector('.dynamic-content')?.textContent;
await crawler.pushData({
url: request.url,
content
});
} catch (error) {
console.error(`Failed to find content on ${request.url}`);
}
}
});
await crawler.run(['https://example.com']);
Comparison with Other Crawlee Crawlers
Understanding when to use JSDOMCrawler versus other crawler types is crucial for optimal performance:
JSDOMCrawler vs CheerioCrawler
- Speed: CheerioCrawler is faster (no JavaScript execution)
- JavaScript: JSDOMCrawler can execute JavaScript, CheerioCrawler cannot
- Resource Usage: CheerioCrawler uses less memory
- Use Case: Use CheerioCrawler for static HTML, JSDOMCrawler for basic JavaScript rendering
JSDOMCrawler vs PuppeteerCrawler/PlaywrightCrawler
- Performance: JSDOMCrawler is 3-5x faster
- Resource Usage: JSDOMCrawler uses 80% less memory
- Capabilities: Browser crawlers support full browser features (canvas, WebGL, complex interactions)
- Concurrency: JSDOMCrawler can handle 5-10x more concurrent requests
- Use Case: Use browser crawlers for complex SPAs and sites with anti-bot protection
Performance Optimization Tips
To get the most out of JSDOMCrawler, consider these optimization strategies:
import { JSDOMCrawler } from 'crawlee';
const crawler = new JSDOMCrawler({
// Increase concurrency for better throughput
maxConcurrency: 100,
// Disable resource loading if not needed
resources: undefined, // Don't load external resources
// Configure JSDOM options
runScripts: 'outside-only', // Only run inline scripts
// Use autoscaling for dynamic adjustment
autoscaledPoolOptions: {
minConcurrency: 10,
maxConcurrency: 200,
desiredConcurrency: 50
},
async requestHandler({ request, window, crawler }) {
const { document } = window;
// Extract only what you need
const data = {
title: document.title,
// Use efficient selectors
items: Array.from(document.querySelectorAll('.item')).slice(0, 100)
.map(el => el.textContent?.trim())
};
await crawler.pushData(data);
}
});
Handling Common Challenges
Dealing with Async Content
Some pages load content asynchronously after the initial render:
import { JSDOMCrawler } from 'crawlee';
const crawler = new JSDOMCrawler({
async requestHandler({ request, window }) {
const { document } = window;
// Monitor DOM changes
const waitForContent = () => {
return new Promise((resolve) => {
const checkContent = () => {
const content = document.querySelector('.async-content');
if (content && content.children.length > 0) {
resolve(content);
} else {
setTimeout(checkContent, 100);
}
};
checkContent();
// Timeout after 5 seconds
setTimeout(() => resolve(null), 5000);
});
};
const content = await waitForContent();
if (content) {
await crawler.pushData({
url: request.url,
text: content.textContent
});
}
}
});
Working with Forms and Input
JSDOMCrawler can interact with DOM elements programmatically:
import { JSDOMCrawler } from 'crawlee';
const crawler = new JSDOMCrawler({
async requestHandler({ request, window }) {
const { document } = window;
// Simulate form interaction
const searchInput = document.querySelector('input[name="search"]');
if (searchInput) {
searchInput.value = 'test query';
// Trigger input event
const event = new window.Event('input', { bubbles: true });
searchInput.dispatchEvent(event);
}
// Wait for results to update
await new Promise(resolve => setTimeout(resolve, 500));
// Extract search results
const results = Array.from(document.querySelectorAll('.search-result'))
.map(el => el.textContent?.trim());
await crawler.pushData({ results });
}
});
TypeScript Support
JSDOMCrawler works seamlessly with TypeScript:
import { JSDOMCrawler, JSDOMCrawlerOptions } from 'crawlee';
interface Product {
name: string;
price: number;
url: string;
}
const crawlerOptions: JSDOMCrawlerOptions = {
maxConcurrency: 50,
async requestHandler({ request, window, crawler }) {
const { document } = window;
const products: Product[] = Array.from(
document.querySelectorAll('.product')
).map((el): Product => ({
name: el.querySelector('.name')?.textContent || '',
price: parseFloat(el.querySelector('.price')?.textContent || '0'),
url: (el.querySelector('a') as HTMLAnchorElement)?.href || ''
}));
await crawler.pushData({ products });
}
};
const crawler = new JSDOMCrawler(crawlerOptions);
await crawler.run(['https://example.com/products']);
Best Practices
When working with JSDOMCrawler, follow these best practices:
- Test JavaScript Requirements: Verify if your target site actually needs JavaScript execution or if CheerioCrawler would suffice
- Monitor Resource Usage: Keep an eye on memory consumption, especially with high concurrency
- Set Appropriate Timeouts: Configure timeouts to prevent hanging requests
- Handle Errors Gracefully: Implement proper error handling for failed requests
- Use Request Queues: Leverage Crawlee's built-in queue management for large-scale scraping
- Respect Robots.txt: Always check and respect the site's robots.txt file
- Implement Rate Limiting: Use
maxRequestsPerMinute
to avoid overwhelming target servers - Clean Up Resources: Ensure proper cleanup after crawling completes
Conclusion
JSDOMCrawler is a powerful tool in Crawlee's arsenal that bridges the gap between static HTML parsing and full browser automation. It's perfect for sites that require basic JavaScript execution without the overhead of launching actual browsers. By understanding its capabilities and limitations, you can choose the right crawler type for your specific scraping needs and build efficient, scalable web scraping solutions.
For projects requiring more complex browser interactions or handling AJAX requests, consider upgrading to PuppeteerCrawler or PlaywrightCrawler. For purely static HTML sites, CheerioCrawler offers better performance. JSDOMCrawler shines in the middle ground, providing the best balance of performance and JavaScript support for many common web scraping scenarios.