How do I Filter and Prioritize Requests in Crawlee?

Crawlee provides powerful mechanisms for filtering and prioritizing requests during web scraping operations. This is essential for efficient crawling, especially when dealing with large websites where you need to control which URLs are processed and in what order.

Understanding Request Management in Crawlee

Before diving into filtering and prioritizing, it's important to understand how Crawlee manages requests. Crawlee uses a RequestQueue or RequestList to store and manage URLs that need to be crawled. These systems allow you to control the flow of requests, apply filters, and set priorities to optimize your crawling strategy.

Filtering Requests in Crawlee

Basic URL Filtering with Globs

Crawlee provides built-in support for filtering URLs using glob patterns. This is particularly useful when you want to crawl only specific sections of a website.

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Only process URLs matching these patterns
    requestHandler: async ({ request, page, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);

        // Extract data here
        const title = await page.title();
        await Dataset.pushData({ url: request.url, title });

        // Enqueue only matching links
        await enqueueLinks({
            globs: [
                'https://example.com/products/**',
                'https://example.com/categories/**',
            ],
        });
    },
});

await crawler.run(['https://example.com']);

Using Exclude Patterns

You can also exclude specific URL patterns to avoid crawling unnecessary pages:

await enqueueLinks({
    globs: ['https://example.com/**'],
    exclude: [
        '**/*.pdf',
        '**/*.jpg',
        '**/*.png',
        '**/admin/**',
        '**/login',
    ],
});

Custom URL Filtering Logic

For more complex filtering requirements, you can implement custom transformation functions:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, enqueueLinks }) => {
        await enqueueLinks({
            // Custom filter function
            transformRequestFunction: (req) => {
                // Filter out URLs with query parameters
                const url = new URL(req.url);
                if (url.search) {
                    return false;
                }

                // Only include URLs with specific characteristics
                if (!url.pathname.includes('/product/')) {
                    return false;
                }

                return req;
            },
        });
    },
});

Python Example for URL Filtering

If you're using Crawlee for Python, the filtering approach is similar:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    # Enqueue links with filtering
    await context.enqueue_links(
        include=[r'https://example\.com/products/.*'],
        exclude=[r'.*\.(jpg|png|pdf)$', r'.*/admin/.*'],
    )

Prioritizing Requests in Crawlee

Request prioritization ensures that important URLs are crawled first. This is crucial when dealing with large websites or when you have time constraints.

Setting Request Priority

You can assign priority values when adding requests to the queue. Higher priority values are processed first:

import { PlaywrightCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add high-priority requests
await requestQueue.addRequest({
    url: 'https://example.com/important-page',
    priority: 10, // Higher number = higher priority
});

// Add normal-priority requests
await requestQueue.addRequest({
    url: 'https://example.com/regular-page',
    priority: 5,
});

// Add low-priority requests
await requestQueue.addRequest({
    url: 'https://example.com/less-important',
    priority: 1,
});

const crawler = new PlaywrightCrawler({
    requestQueue,
    requestHandler: async ({ request, page }) => {
        console.log(`Processing: ${request.url} (Priority: ${request.priority})`);
        // Your scraping logic here
    },
});

await crawler.run();

Dynamic Priority Assignment

You can dynamically assign priorities based on URL characteristics or other criteria:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, enqueueLinks }) => {
        await enqueueLinks({
            transformRequestFunction: (req) => {
                const url = new URL(req.url);

                // Assign priority based on URL structure
                if (url.pathname.includes('/products/')) {
                    req.priority = 10;
                } else if (url.pathname.includes('/categories/')) {
                    req.priority = 7;
                } else {
                    req.priority = 3;
                }

                return req;
            },
        });
    },
});

Depth-Based Prioritization

You can prioritize requests based on their depth in the crawl tree. Pages closer to the starting URL typically have higher priority:

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 100,
    requestHandler: async ({ request, page, enqueueLinks }) => {
        const currentDepth = request.userData.depth || 0;

        await enqueueLinks({
            transformRequestFunction: (req) => {
                // Set depth for the new request
                req.userData = { depth: currentDepth + 1 };

                // Higher priority for pages closer to root
                req.priority = Math.max(1, 10 - currentDepth);

                return req;
            },
        });
    },
});

Combining Filtering and Prioritization

The most effective crawling strategies combine both filtering and prioritization:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 1000,
    requestHandler: async ({ request, page, enqueueLinks, log }) => {
        log.info(`Crawling: ${request.url} (Priority: ${request.priority})`);

        // Extract data
        const data = await page.evaluate(() => ({
            title: document.title,
            description: document.querySelector('meta[name="description"]')?.content,
        }));

        await Dataset.pushData({ ...data, url: request.url });

        // Advanced filtering and prioritization
        await enqueueLinks({
            globs: [
                'https://example.com/products/**',
                'https://example.com/blog/**',
            ],
            exclude: [
                '**/*.pdf',
                '**/admin/**',
                '**/cart/**',
            ],
            transformRequestFunction: (req) => {
                const url = new URL(req.url);
                const depth = (request.userData.depth || 0) + 1;

                // Don't crawl too deep
                if (depth > 5) {
                    return false;
                }

                // Set metadata
                req.userData = { depth };

                // Priority logic
                if (url.pathname.includes('/products/')) {
                    req.priority = 10;
                } else if (url.pathname.includes('/blog/')) {
                    req.priority = 5;
                } else {
                    req.priority = 3;
                }

                return req;
            },
        });
    },
});

await crawler.run(['https://example.com']);

Python Example: Complete Filtering and Prioritization

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
import re

crawler = PlaywrightCrawler(max_requests_per_crawl=1000)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    url = context.request.url
    priority = context.request.user_data.get('priority', 5)
    depth = context.request.user_data.get('depth', 0)

    context.log.info(f'Crawling: {url} (Priority: {priority}, Depth: {depth})')

    # Extract data
    title = await context.page.title()
    await Dataset.push_data({'url': url, 'title': title})

    # Enqueue with filtering and prioritization
    await context.enqueue_links(
        include=[
            r'https://example\.com/products/.*',
            r'https://example\.com/blog/.*',
        ],
        exclude=[
            r'.*\.(pdf|jpg|png)$',
            r'.*/admin/.*',
        ],
        # Set priority and depth for new requests
        user_data={'depth': depth + 1},
    )

await crawler.run(['https://example.com'])

Advanced Techniques

Using Request Labels

Labels help categorize requests and apply different handling logic:

await enqueueLinks({
    transformRequestFunction: (req) => {
        const url = new URL(req.url);

        if (url.pathname.includes('/products/')) {
            req.label = 'PRODUCT';
            req.priority = 10;
        } else if (url.pathname.includes('/categories/')) {
            req.label = 'CATEGORY';
            req.priority = 7;
        }

        return req;
    },
});

// In requestHandler
if (request.label === 'PRODUCT') {
    // Handle product pages differently
} else if (request.label === 'CATEGORY') {
    // Handle category pages differently
}

Deduplication

Crawlee automatically deduplicates requests, but you can customize this behavior:

import { RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add request with custom uniqueKey for deduplication
await requestQueue.addRequest({
    url: 'https://example.com/product?id=123&ref=home',
    uniqueKey: 'product-123', // Custom deduplication key
    priority: 10,
});

Best Practices

Start with broad filters: Begin with inclusive patterns and refine based on crawl results
Use appropriate priority ranges: Keep priorities between 0-10 for clarity
Monitor queue size: Track how many requests are pending to avoid memory issues
Test filters first: Run small test crawls to validate your filtering logic
Consider crawl depth: Limit depth to prevent infinite crawling loops
Balance priority: Don't set everything to high priority, as it defeats the purpose

When working with browser automation for more complex scraping scenarios, understanding how to handle browser sessions in Puppeteer can be helpful since Crawlee builds on similar concepts. Additionally, if you need to monitor network requests in Puppeteer, these techniques complement Crawlee's request management features.

Conclusion

Filtering and prioritizing requests in Crawlee is essential for building efficient, targeted web scrapers. By combining glob patterns, custom transformation functions, and priority settings, you can create sophisticated crawling strategies that focus on the most important content while avoiding unnecessary pages. Whether you're using JavaScript or Python, Crawlee provides flexible APIs to implement these patterns effectively.

Remember to always respect websites' robots.txt files and crawling policies, and implement appropriate rate limiting to avoid overwhelming target servers.

Table of contents

How do I Filter and Prioritize Requests in Crawlee?

Understanding Request Management in Crawlee

Filtering Requests in Crawlee

Basic URL Filtering with Globs

Using Exclude Patterns

Custom URL Filtering Logic

Python Example for URL Filtering

Prioritizing Requests in Crawlee

Setting Request Priority

Dynamic Priority Assignment

Depth-Based Prioritization

Combining Filtering and Prioritization

Python Example: Complete Filtering and Prioritization

Advanced Techniques

Using Request Labels

Deduplication

Best Practices

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I store scraped data using Crawlee datasets?

What data export formats does Crawlee support?

How do I save scraped data to JSON with Crawlee?

Get Started Now

Support