Table of contents

What is JSDOMCrawler in Crawlee and when should I use it?

JSDOMCrawler is one of the crawler classes in the Crawlee framework that provides a middle ground between the lightweight CheerioCrawler and the more resource-intensive browser-based crawlers like PuppeteerCrawler and PlaywrightCrawler. It uses JSDOM, a pure JavaScript implementation of web standards, to parse HTML and execute JavaScript without launching a real browser.

Understanding JSDOMCrawler

JSDOMCrawler combines the speed and efficiency of HTML parsing with the ability to execute JavaScript code on the page. Unlike CheerioCrawler, which can only parse static HTML, JSDOMCrawler can handle pages that require basic JavaScript execution for rendering content. However, it's still significantly faster and uses fewer resources than full browser automation tools.

Key Features

  • JavaScript Execution: Runs JavaScript code in a simulated browser environment without launching an actual browser
  • DOM Manipulation: Supports standard DOM APIs for interacting with page elements
  • Lightweight: Uses less memory and CPU compared to headless browsers
  • Fast: Processes pages more quickly than browser-based solutions
  • Resource Efficient: Can handle higher concurrency levels than browser crawlers
  • Limited Browser Features: Doesn't support all browser features like CSS rendering, canvas, or WebGL

When to Use JSDOMCrawler

JSDOMCrawler is ideal for specific scraping scenarios where you need JavaScript execution but want to avoid the overhead of a full browser.

Perfect Use Cases

  1. Simple JavaScript Rendering: When the target website uses basic JavaScript to populate content after the initial page load
  2. DOM Manipulation: Sites that modify the DOM structure through JavaScript but don't rely on complex browser features
  3. API Calls in JavaScript: Pages that make simple AJAX requests to load data
  4. High-Volume Scraping: When you need to scrape thousands of pages and want better performance than browser automation
  5. Server-Side Rendered React/Vue Apps: Applications that perform client-side hydration with minimal JavaScript execution

When NOT to Use JSDOMCrawler

Avoid JSDOMCrawler in these scenarios:

  • Complex Single Page Applications (SPAs): Modern frameworks like React, Angular, or Vue with heavy client-side routing
  • Advanced Browser Features: Sites requiring WebGL, Canvas, or complex CSS rendering
  • Bot Detection: Websites with sophisticated anti-scraping measures that detect JSDOM
  • Interactive Elements: Pages requiring user interactions like clicks, scrolls, or form submissions
  • WebSocket Connections: Real-time features that depend on WebSocket communication
  • Service Workers: Progressive web apps that rely on service workers

Basic JSDOMCrawler Example

Here's how to set up and use JSDOMCrawler in your project:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    // Maximum number of concurrent requests
    maxConcurrency: 50,

    // Request handler for processing each page
    async requestHandler({ request, window, body }) {
        const { document } = window;

        // Extract data using standard DOM APIs
        const title = document.querySelector('h1')?.textContent;
        const description = document.querySelector('meta[name="description"]')?.content;

        // Get all product items
        const products = Array.from(document.querySelectorAll('.product-item')).map(item => ({
            name: item.querySelector('.product-name')?.textContent?.trim(),
            price: item.querySelector('.product-price')?.textContent?.trim(),
            url: item.querySelector('a')?.href
        }));

        console.log(`Scraped ${products.length} products from ${request.url}`);

        // Save data
        await crawler.pushData({
            url: request.url,
            title,
            description,
            products
        });
    },

    // Error handler
    async failedRequestHandler({ request, error }) {
        console.error(`Request ${request.url} failed: ${error.message}`);
    }
});

// Add initial URLs to the queue
await crawler.addRequests([
    'https://example.com/products',
    'https://example.com/categories'
]);

// Start the crawler
await crawler.run();

Advanced Configuration Options

JSDOMCrawler provides several configuration options to optimize your scraping workflow:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    // Concurrency settings
    maxConcurrency: 100,
    minConcurrency: 10,

    // Request configuration
    maxRequestRetries: 3,
    maxRequestsPerMinute: 120,

    // JSDOM-specific options
    runScripts: 'dangerously', // Enable JavaScript execution
    resources: 'usable', // Load external resources

    // Custom headers
    navigationTimeoutSecs: 30,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ request, window, body, crawler }) {
        const { document } = window;

        // Wait for JavaScript to execute
        await new Promise(resolve => setTimeout(resolve, 1000));

        // Extract data after JS execution
        const dynamicContent = document.querySelector('.js-rendered-content')?.textContent;

        // Enqueue additional URLs
        const links = Array.from(document.querySelectorAll('a.pagination'))
            .map(a => a.href)
            .filter(href => href);

        await crawler.addRequests(links);

        await crawler.pushData({
            url: request.url,
            dynamicContent
        });
    }
});

await crawler.run(['https://example.com']);

Working with JavaScript-Heavy Pages

When dealing with pages that execute JavaScript to render content, you might need to wait for specific elements or conditions:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    async requestHandler({ request, window }) {
        const { document } = window;

        // Simple polling mechanism to wait for content
        const waitForSelector = async (selector, timeout = 5000) => {
            const startTime = Date.now();

            while (Date.now() - startTime < timeout) {
                const element = document.querySelector(selector);
                if (element) return element;
                await new Promise(resolve => setTimeout(resolve, 100));
            }

            throw new Error(`Timeout waiting for selector: ${selector}`);
        };

        try {
            // Wait for dynamic content to load
            await waitForSelector('.dynamic-content');

            const content = document.querySelector('.dynamic-content')?.textContent;

            await crawler.pushData({
                url: request.url,
                content
            });
        } catch (error) {
            console.error(`Failed to find content on ${request.url}`);
        }
    }
});

await crawler.run(['https://example.com']);

Comparison with Other Crawlee Crawlers

Understanding when to use JSDOMCrawler versus other crawler types is crucial for optimal performance:

JSDOMCrawler vs CheerioCrawler

  • Speed: CheerioCrawler is faster (no JavaScript execution)
  • JavaScript: JSDOMCrawler can execute JavaScript, CheerioCrawler cannot
  • Resource Usage: CheerioCrawler uses less memory
  • Use Case: Use CheerioCrawler for static HTML, JSDOMCrawler for basic JavaScript rendering

JSDOMCrawler vs PuppeteerCrawler/PlaywrightCrawler

  • Performance: JSDOMCrawler is 3-5x faster
  • Resource Usage: JSDOMCrawler uses 80% less memory
  • Capabilities: Browser crawlers support full browser features (canvas, WebGL, complex interactions)
  • Concurrency: JSDOMCrawler can handle 5-10x more concurrent requests
  • Use Case: Use browser crawlers for complex SPAs and sites with anti-bot protection

Performance Optimization Tips

To get the most out of JSDOMCrawler, consider these optimization strategies:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    // Increase concurrency for better throughput
    maxConcurrency: 100,

    // Disable resource loading if not needed
    resources: undefined, // Don't load external resources

    // Configure JSDOM options
    runScripts: 'outside-only', // Only run inline scripts

    // Use autoscaling for dynamic adjustment
    autoscaledPoolOptions: {
        minConcurrency: 10,
        maxConcurrency: 200,
        desiredConcurrency: 50
    },

    async requestHandler({ request, window, crawler }) {
        const { document } = window;

        // Extract only what you need
        const data = {
            title: document.title,
            // Use efficient selectors
            items: Array.from(document.querySelectorAll('.item')).slice(0, 100)
                .map(el => el.textContent?.trim())
        };

        await crawler.pushData(data);
    }
});

Handling Common Challenges

Dealing with Async Content

Some pages load content asynchronously after the initial render:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    async requestHandler({ request, window }) {
        const { document } = window;

        // Monitor DOM changes
        const waitForContent = () => {
            return new Promise((resolve) => {
                const checkContent = () => {
                    const content = document.querySelector('.async-content');
                    if (content && content.children.length > 0) {
                        resolve(content);
                    } else {
                        setTimeout(checkContent, 100);
                    }
                };
                checkContent();

                // Timeout after 5 seconds
                setTimeout(() => resolve(null), 5000);
            });
        };

        const content = await waitForContent();

        if (content) {
            await crawler.pushData({
                url: request.url,
                text: content.textContent
            });
        }
    }
});

Working with Forms and Input

JSDOMCrawler can interact with DOM elements programmatically:

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
    async requestHandler({ request, window }) {
        const { document } = window;

        // Simulate form interaction
        const searchInput = document.querySelector('input[name="search"]');
        if (searchInput) {
            searchInput.value = 'test query';

            // Trigger input event
            const event = new window.Event('input', { bubbles: true });
            searchInput.dispatchEvent(event);
        }

        // Wait for results to update
        await new Promise(resolve => setTimeout(resolve, 500));

        // Extract search results
        const results = Array.from(document.querySelectorAll('.search-result'))
            .map(el => el.textContent?.trim());

        await crawler.pushData({ results });
    }
});

TypeScript Support

JSDOMCrawler works seamlessly with TypeScript:

import { JSDOMCrawler, JSDOMCrawlerOptions } from 'crawlee';

interface Product {
    name: string;
    price: number;
    url: string;
}

const crawlerOptions: JSDOMCrawlerOptions = {
    maxConcurrency: 50,

    async requestHandler({ request, window, crawler }) {
        const { document } = window;

        const products: Product[] = Array.from(
            document.querySelectorAll('.product')
        ).map((el): Product => ({
            name: el.querySelector('.name')?.textContent || '',
            price: parseFloat(el.querySelector('.price')?.textContent || '0'),
            url: (el.querySelector('a') as HTMLAnchorElement)?.href || ''
        }));

        await crawler.pushData({ products });
    }
};

const crawler = new JSDOMCrawler(crawlerOptions);
await crawler.run(['https://example.com/products']);

Best Practices

When working with JSDOMCrawler, follow these best practices:

  1. Test JavaScript Requirements: Verify if your target site actually needs JavaScript execution or if CheerioCrawler would suffice
  2. Monitor Resource Usage: Keep an eye on memory consumption, especially with high concurrency
  3. Set Appropriate Timeouts: Configure timeouts to prevent hanging requests
  4. Handle Errors Gracefully: Implement proper error handling for failed requests
  5. Use Request Queues: Leverage Crawlee's built-in queue management for large-scale scraping
  6. Respect Robots.txt: Always check and respect the site's robots.txt file
  7. Implement Rate Limiting: Use maxRequestsPerMinute to avoid overwhelming target servers
  8. Clean Up Resources: Ensure proper cleanup after crawling completes

Conclusion

JSDOMCrawler is a powerful tool in Crawlee's arsenal that bridges the gap between static HTML parsing and full browser automation. It's perfect for sites that require basic JavaScript execution without the overhead of launching actual browsers. By understanding its capabilities and limitations, you can choose the right crawler type for your specific scraping needs and build efficient, scalable web scraping solutions.

For projects requiring more complex browser interactions or handling AJAX requests, consider upgrading to PuppeteerCrawler or PlaywrightCrawler. For purely static HTML sites, CheerioCrawler offers better performance. JSDOMCrawler shines in the middle ground, providing the best balance of performance and JavaScript support for many common web scraping scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon