What is JSDOMCrawler in Crawlee and when should I use it?

JSDOMCrawler is one of the crawler classes in the Crawlee framework that provides a middle ground between the lightweight CheerioCrawler and the more resource-intensive browser-based crawlers like PuppeteerCrawler and PlaywrightCrawler. It uses JSDOM, a pure JavaScript implementation of web standards, to parse HTML and execute JavaScript without launching a real browser.

Understanding JSDOMCrawler

JSDOMCrawler combines the speed and efficiency of HTML parsing with the ability to execute JavaScript code on the page. Unlike CheerioCrawler, which can only parse static HTML, JSDOMCrawler can handle pages that require basic JavaScript execution for rendering content. However, it's still significantly faster and uses fewer resources than full browser automation tools.

Key Features

JavaScript Execution: Runs JavaScript code in a simulated browser environment without launching an actual browser
DOM Manipulation: Supports standard DOM APIs for interacting with page elements
Lightweight: Uses less memory and CPU compared to headless browsers
Fast: Processes pages more quickly than browser-based solutions
Resource Efficient: Can handle higher concurrency levels than browser crawlers
Limited Browser Features: Doesn't support all browser features like CSS rendering, canvas, or WebGL

When to Use JSDOMCrawler

JSDOMCrawler is ideal for specific scraping scenarios where you need JavaScript execution but want to avoid the overhead of a full browser.

Perfect Use Cases

Simple JavaScript Rendering: When the target website uses basic JavaScript to populate content after the initial page load
DOM Manipulation: Sites that modify the DOM structure through JavaScript but don't rely on complex browser features
API Calls in JavaScript: Pages that make simple AJAX requests to load data
High-Volume Scraping: When you need to scrape thousands of pages and want better performance than browser automation
Server-Side Rendered React/Vue Apps: Applications that perform client-side hydration with minimal JavaScript execution

When NOT to Use JSDOMCrawler

Avoid JSDOMCrawler in these scenarios:

Complex Single Page Applications (SPAs): Modern frameworks like React, Angular, or Vue with heavy client-side routing
Advanced Browser Features: Sites requiring WebGL, Canvas, or complex CSS rendering
Bot Detection: Websites with sophisticated anti-scraping measures that detect JSDOM
Interactive Elements: Pages requiring user interactions like clicks, scrolls, or form submissions
WebSocket Connections: Real-time features that depend on WebSocket communication
Service Workers: Progressive web apps that rely on service workers

Basic JSDOMCrawler Example

Here's how to set up and use JSDOMCrawler in your project:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    // Maximum number of concurrent requests
    maxConcurrency: 50,

    // Request handler for processing each page
    async requestHandler({ request, window, body }) {
        const { document } = window;

        // Extract data using standard DOM APIs
        const title = document.querySelector('h1')?.textContent;
        const description = document.querySelector('meta[name="description"]')?.content;

        // Get all product items
        const products = Array.from(document.querySelectorAll('.product-item')).map(item => ({
            name: item.querySelector('.product-name')?.textContent?.trim(),
            price: item.querySelector('.product-price')?.textContent?.trim(),
            url: item.querySelector('a')?.href
        }));

        console.log(`Scraped ${products.length} products from ${request.url}`);

        // Save data
        await crawler.pushData({
            url: request.url,
            title,
            description,
            products
        });
    },

    // Error handler
    async failedRequestHandler({ request, error }) {
        console.error(`Request ${request.url} failed: ${error.message}`);
    }
});

// Add initial URLs to the queue
await crawler.addRequests([
    'https://example.com/products',
    'https://example.com/categories'
]);

// Start the crawler
await crawler.run();

Advanced Configuration Options

JSDOMCrawler provides several configuration options to optimize your scraping workflow:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    // Concurrency settings
    maxConcurrency: 100,
    minConcurrency: 10,

    // Request configuration
    maxRequestRetries: 3,
    maxRequestsPerMinute: 120,

    // JSDOM-specific options
    runScripts: 'dangerously', // Enable JavaScript execution
    resources: 'usable', // Load external resources

    // Custom headers
    navigationTimeoutSecs: 30,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, window, body, crawler }) {
        const { document } = window;

        // Wait for JavaScript to execute
        await new Promise(resolve => setTimeout(resolve, 1000));

        // Extract data after JS execution
        const dynamicContent = document.querySelector('.js-rendered-content')?.textContent;

        // Enqueue additional URLs
        const links = Array.from(document.querySelectorAll('a.pagination'))
            .map(a => a.href)
            .filter(href => href);

        await crawler.addRequests(links);

        await crawler.pushData({
            url: request.url,
            dynamicContent
        });
    }
});

await crawler.run(['https://example.com']);

Working with JavaScript-Heavy Pages

When dealing with pages that execute JavaScript to render content, you might need to wait for specific elements or conditions:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    async requestHandler({ request, window }) {
        const { document } = window;

        // Simple polling mechanism to wait for content
        const waitForSelector = async (selector, timeout = 5000) => {
            const startTime = Date.now();

            while (Date.now() - startTime < timeout) {
                const element = document.querySelector(selector);
                if (element) return element;
                await new Promise(resolve => setTimeout(resolve, 100));
            }

            throw new Error(`Timeout waiting for selector: ${selector}`);
        };

        try {
            // Wait for dynamic content to load
            await waitForSelector('.dynamic-content');

            const content = document.querySelector('.dynamic-content')?.textContent;

            await crawler.pushData({
                url: request.url,
                content
            });
        } catch (error) {
            console.error(`Failed to find content on ${request.url}`);
        }
    }
});

await crawler.run(['https://example.com']);

Comparison with Other Crawlee Crawlers

Understanding when to use JSDOMCrawler versus other crawler types is crucial for optimal performance:

JSDOMCrawler vs CheerioCrawler

Speed: CheerioCrawler is faster (no JavaScript execution)
JavaScript: JSDOMCrawler can execute JavaScript, CheerioCrawler cannot
Resource Usage: CheerioCrawler uses less memory
Use Case: Use CheerioCrawler for static HTML, JSDOMCrawler for basic JavaScript rendering

JSDOMCrawler vs PuppeteerCrawler/PlaywrightCrawler

Performance: JSDOMCrawler is 3-5x faster
Resource Usage: JSDOMCrawler uses 80% less memory
Capabilities: Browser crawlers support full browser features (canvas, WebGL, complex interactions)
Concurrency: JSDOMCrawler can handle 5-10x more concurrent requests
Use Case: Use browser crawlers for complex SPAs and sites with anti-bot protection

Performance Optimization Tips

To get the most out of JSDOMCrawler, consider these optimization strategies:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    // Increase concurrency for better throughput
    maxConcurrency: 100,

    // Disable resource loading if not needed
    resources: undefined, // Don't load external resources

    // Configure JSDOM options
    runScripts: 'outside-only', // Only run inline scripts

    // Use autoscaling for dynamic adjustment
    autoscaledPoolOptions: {
        minConcurrency: 10,
        maxConcurrency: 200,
        desiredConcurrency: 50
    },

    async requestHandler({ request, window, crawler }) {
        const { document } = window;

        // Extract only what you need
        const data = {
            title: document.title,
            // Use efficient selectors
            items: Array.from(document.querySelectorAll('.item')).slice(0, 100)
                .map(el => el.textContent?.trim())
        };

        await crawler.pushData(data);
    }
});

Handling Common Challenges

Dealing with Async Content

Some pages load content asynchronously after the initial render:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    async requestHandler({ request, window }) {
        const { document } = window;

        // Monitor DOM changes
        const waitForContent = () => {
            return new Promise((resolve) => {
                const checkContent = () => {
                    const content = document.querySelector('.async-content');
                    if (content && content.children.length > 0) {
                        resolve(content);
                    } else {
                        setTimeout(checkContent, 100);
                    }
                };
                checkContent();

                // Timeout after 5 seconds
                setTimeout(() => resolve(null), 5000);
            });
        };

        const content = await waitForContent();

        if (content) {
            await crawler.pushData({
                url: request.url,
                text: content.textContent
            });
        }
    }
});

Working with Forms and Input

JSDOMCrawler can interact with DOM elements programmatically:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    async requestHandler({ request, window }) {
        const { document } = window;

        // Simulate form interaction
        const searchInput = document.querySelector('input[name="search"]');
        if (searchInput) {
            searchInput.value = 'test query';

            // Trigger input event
            const event = new window.Event('input', { bubbles: true });
            searchInput.dispatchEvent(event);
        }

        // Wait for results to update
        await new Promise(resolve => setTimeout(resolve, 500));

        // Extract search results
        const results = Array.from(document.querySelectorAll('.search-result'))
            .map(el => el.textContent?.trim());

        await crawler.pushData({ results });
    }
});

TypeScript Support

JSDOMCrawler works seamlessly with TypeScript:

import { JSDOMCrawler, JSDOMCrawlerOptions } from 'crawlee';

interface Product {
    name: string;
    price: number;
    url: string;
}

const crawlerOptions: JSDOMCrawlerOptions = {
    maxConcurrency: 50,

    async requestHandler({ request, window, crawler }) {
        const { document } = window;

        const products: Product[] = Array.from(
            document.querySelectorAll('.product')
        ).map((el): Product => ({
            name: el.querySelector('.name')?.textContent || '',
            price: parseFloat(el.querySelector('.price')?.textContent || '0'),
            url: (el.querySelector('a') as HTMLAnchorElement)?.href || ''
        }));

        await crawler.pushData({ products });
    }
};

const crawler = new JSDOMCrawler(crawlerOptions);
await crawler.run(['https://example.com/products']);

Best Practices

When working with JSDOMCrawler, follow these best practices:

Test JavaScript Requirements: Verify if your target site actually needs JavaScript execution or if CheerioCrawler would suffice
Monitor Resource Usage: Keep an eye on memory consumption, especially with high concurrency
Set Appropriate Timeouts: Configure timeouts to prevent hanging requests
Handle Errors Gracefully: Implement proper error handling for failed requests
Use Request Queues: Leverage Crawlee's built-in queue management for large-scale scraping
Respect Robots.txt: Always check and respect the site's robots.txt file
Implement Rate Limiting: Use maxRequestsPerMinute to avoid overwhelming target servers
Clean Up Resources: Ensure proper cleanup after crawling completes

Conclusion

JSDOMCrawler is a powerful tool in Crawlee's arsenal that bridges the gap between static HTML parsing and full browser automation. It's perfect for sites that require basic JavaScript execution without the overhead of launching actual browsers. By understanding its capabilities and limitations, you can choose the right crawler type for your specific scraping needs and build efficient, scalable web scraping solutions.

For projects requiring more complex browser interactions or handling AJAX requests, consider upgrading to PuppeteerCrawler or PlaywrightCrawler. For purely static HTML sites, CheerioCrawler offers better performance. JSDOMCrawler shines in the middle ground, providing the best balance of performance and JavaScript support for many common web scraping scenarios.

Table of contents

What is JSDOMCrawler in Crawlee and when should I use it?

Understanding JSDOMCrawler

Key Features

When to Use JSDOMCrawler

Perfect Use Cases

When NOT to Use JSDOMCrawler

Basic JSDOMCrawler Example

Advanced Configuration Options

Working with JavaScript-Heavy Pages

Comparison with Other Crawlee Crawlers

JSDOMCrawler vs CheerioCrawler

JSDOMCrawler vs PuppeteerCrawler/PlaywrightCrawler

Performance Optimization Tips

Handling Common Challenges

Dealing with Async Content

Working with Forms and Input

TypeScript Support

Best Practices

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I manage request queues in Crawlee?

What is a RequestList in Crawlee and how do I use it?

How do I enqueue links for crawling in Crawlee?

Get Started Now

Support