Table of contents

How do I Filter and Prioritize Requests in Crawlee?

Crawlee provides powerful mechanisms for filtering and prioritizing requests during web scraping operations. This is essential for efficient crawling, especially when dealing with large websites where you need to control which URLs are processed and in what order.

Understanding Request Management in Crawlee

Before diving into filtering and prioritizing, it's important to understand how Crawlee manages requests. Crawlee uses a RequestQueue or RequestList to store and manage URLs that need to be crawled. These systems allow you to control the flow of requests, apply filters, and set priorities to optimize your crawling strategy.

Filtering Requests in Crawlee

Basic URL Filtering with Globs

Crawlee provides built-in support for filtering URLs using glob patterns. This is particularly useful when you want to crawl only specific sections of a website.

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Only process URLs matching these patterns
    requestHandler: async ({ request, page, enqueueLinks }) => {
        console.log(`Processing: ${request.url}`);

        // Extract data here
        const title = await page.title();
        await Dataset.pushData({ url: request.url, title });

        // Enqueue only matching links
        await enqueueLinks({
            globs: [
                'https://example.com/products/**',
                'https://example.com/categories/**',
            ],
        });
    },
});

await crawler.run(['https://example.com']);

Using Exclude Patterns

You can also exclude specific URL patterns to avoid crawling unnecessary pages:

await enqueueLinks({
    globs: ['https://example.com/**'],
    exclude: [
        '**/*.pdf',
        '**/*.jpg',
        '**/*.png',
        '**/admin/**',
        '**/login',
    ],
});

Custom URL Filtering Logic

For more complex filtering requirements, you can implement custom transformation functions:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, enqueueLinks }) => {
        await enqueueLinks({
            // Custom filter function
            transformRequestFunction: (req) => {
                // Filter out URLs with query parameters
                const url = new URL(req.url);
                if (url.search) {
                    return false;
                }

                // Only include URLs with specific characteristics
                if (!url.pathname.includes('/product/')) {
                    return false;
                }

                return req;
            },
        });
    },
});

Python Example for URL Filtering

If you're using Crawlee for Python, the filtering approach is similar:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    # Enqueue links with filtering
    await context.enqueue_links(
        include=[r'https://example\.com/products/.*'],
        exclude=[r'.*\.(jpg|png|pdf)$', r'.*/admin/.*'],
    )

Prioritizing Requests in Crawlee

Request prioritization ensures that important URLs are crawled first. This is crucial when dealing with large websites or when you have time constraints.

Setting Request Priority

You can assign priority values when adding requests to the queue. Higher priority values are processed first:

import { PlaywrightCrawler, RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add high-priority requests
await requestQueue.addRequest({
    url: 'https://example.com/important-page',
    priority: 10, // Higher number = higher priority
});

// Add normal-priority requests
await requestQueue.addRequest({
    url: 'https://example.com/regular-page',
    priority: 5,
});

// Add low-priority requests
await requestQueue.addRequest({
    url: 'https://example.com/less-important',
    priority: 1,
});

const crawler = new PlaywrightCrawler({
    requestQueue,
    requestHandler: async ({ request, page }) => {
        console.log(`Processing: ${request.url} (Priority: ${request.priority})`);
        // Your scraping logic here
    },
});

await crawler.run();

Dynamic Priority Assignment

You can dynamically assign priorities based on URL characteristics or other criteria:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ request, page, enqueueLinks }) => {
        await enqueueLinks({
            transformRequestFunction: (req) => {
                const url = new URL(req.url);

                // Assign priority based on URL structure
                if (url.pathname.includes('/products/')) {
                    req.priority = 10;
                } else if (url.pathname.includes('/categories/')) {
                    req.priority = 7;
                } else {
                    req.priority = 3;
                }

                return req;
            },
        });
    },
});

Depth-Based Prioritization

You can prioritize requests based on their depth in the crawl tree. Pages closer to the starting URL typically have higher priority:

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 100,
    requestHandler: async ({ request, page, enqueueLinks }) => {
        const currentDepth = request.userData.depth || 0;

        await enqueueLinks({
            transformRequestFunction: (req) => {
                // Set depth for the new request
                req.userData = { depth: currentDepth + 1 };

                // Higher priority for pages closer to root
                req.priority = Math.max(1, 10 - currentDepth);

                return req;
            },
        });
    },
});

Combining Filtering and Prioritization

The most effective crawling strategies combine both filtering and prioritization:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 1000,
    requestHandler: async ({ request, page, enqueueLinks, log }) => {
        log.info(`Crawling: ${request.url} (Priority: ${request.priority})`);

        // Extract data
        const data = await page.evaluate(() => ({
            title: document.title,
            description: document.querySelector('meta[name="description"]')?.content,
        }));

        await Dataset.pushData({ ...data, url: request.url });

        // Advanced filtering and prioritization
        await enqueueLinks({
            globs: [
                'https://example.com/products/**',
                'https://example.com/blog/**',
            ],
            exclude: [
                '**/*.pdf',
                '**/admin/**',
                '**/cart/**',
            ],
            transformRequestFunction: (req) => {
                const url = new URL(req.url);
                const depth = (request.userData.depth || 0) + 1;

                // Don't crawl too deep
                if (depth > 5) {
                    return false;
                }

                // Set metadata
                req.userData = { depth };

                // Priority logic
                if (url.pathname.includes('/products/')) {
                    req.priority = 10;
                } else if (url.pathname.includes('/blog/')) {
                    req.priority = 5;
                } else {
                    req.priority = 3;
                }

                return req;
            },
        });
    },
});

await crawler.run(['https://example.com']);

Python Example: Complete Filtering and Prioritization

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
import re

crawler = PlaywrightCrawler(max_requests_per_crawl=1000)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    url = context.request.url
    priority = context.request.user_data.get('priority', 5)
    depth = context.request.user_data.get('depth', 0)

    context.log.info(f'Crawling: {url} (Priority: {priority}, Depth: {depth})')

    # Extract data
    title = await context.page.title()
    await Dataset.push_data({'url': url, 'title': title})

    # Enqueue with filtering and prioritization
    await context.enqueue_links(
        include=[
            r'https://example\.com/products/.*',
            r'https://example\.com/blog/.*',
        ],
        exclude=[
            r'.*\.(pdf|jpg|png)$',
            r'.*/admin/.*',
        ],
        # Set priority and depth for new requests
        user_data={'depth': depth + 1},
    )

await crawler.run(['https://example.com'])

Advanced Techniques

Using Request Labels

Labels help categorize requests and apply different handling logic:

await enqueueLinks({
    transformRequestFunction: (req) => {
        const url = new URL(req.url);

        if (url.pathname.includes('/products/')) {
            req.label = 'PRODUCT';
            req.priority = 10;
        } else if (url.pathname.includes('/categories/')) {
            req.label = 'CATEGORY';
            req.priority = 7;
        }

        return req;
    },
});

// In requestHandler
if (request.label === 'PRODUCT') {
    // Handle product pages differently
} else if (request.label === 'CATEGORY') {
    // Handle category pages differently
}

Deduplication

Crawlee automatically deduplicates requests, but you can customize this behavior:

import { RequestQueue } from 'crawlee';

const requestQueue = await RequestQueue.open();

// Add request with custom uniqueKey for deduplication
await requestQueue.addRequest({
    url: 'https://example.com/product?id=123&ref=home',
    uniqueKey: 'product-123', // Custom deduplication key
    priority: 10,
});

Best Practices

  1. Start with broad filters: Begin with inclusive patterns and refine based on crawl results
  2. Use appropriate priority ranges: Keep priorities between 0-10 for clarity
  3. Monitor queue size: Track how many requests are pending to avoid memory issues
  4. Test filters first: Run small test crawls to validate your filtering logic
  5. Consider crawl depth: Limit depth to prevent infinite crawling loops
  6. Balance priority: Don't set everything to high priority, as it defeats the purpose

When working with browser automation for more complex scraping scenarios, understanding how to handle browser sessions in Puppeteer can be helpful since Crawlee builds on similar concepts. Additionally, if you need to monitor network requests in Puppeteer, these techniques complement Crawlee's request management features.

Conclusion

Filtering and prioritizing requests in Crawlee is essential for building efficient, targeted web scrapers. By combining glob patterns, custom transformation functions, and priority settings, you can create sophisticated crawling strategies that focus on the most important content while avoiding unnecessary pages. Whether you're using JavaScript or Python, Crawlee provides flexible APIs to implement these patterns effectively.

Remember to always respect websites' robots.txt files and crawling policies, and implement appropriate rate limiting to avoid overwhelming target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon