How do I Filter and Prioritize Requests in Crawlee?
Crawlee provides powerful mechanisms for filtering and prioritizing requests during web scraping operations. This is essential for efficient crawling, especially when dealing with large websites where you need to control which URLs are processed and in what order.
Understanding Request Management in Crawlee
Before diving into filtering and prioritizing, it's important to understand how Crawlee manages requests. Crawlee uses a RequestQueue or RequestList to store and manage URLs that need to be crawled. These systems allow you to control the flow of requests, apply filters, and set priorities to optimize your crawling strategy.
Filtering Requests in Crawlee
Basic URL Filtering with Globs
Crawlee provides built-in support for filtering URLs using glob patterns. This is particularly useful when you want to crawl only specific sections of a website.
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Only process URLs matching these patterns
requestHandler: async ({ request, page, enqueueLinks }) => {
console.log(`Processing: ${request.url}`);
// Extract data here
const title = await page.title();
await Dataset.pushData({ url: request.url, title });
// Enqueue only matching links
await enqueueLinks({
globs: [
'https://example.com/products/**',
'https://example.com/categories/**',
],
});
},
});
await crawler.run(['https://example.com']);
Using Exclude Patterns
You can also exclude specific URL patterns to avoid crawling unnecessary pages:
await enqueueLinks({
globs: ['https://example.com/**'],
exclude: [
'**/*.pdf',
'**/*.jpg',
'**/*.png',
'**/admin/**',
'**/login',
],
});
Custom URL Filtering Logic
For more complex filtering requirements, you can implement custom transformation functions:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page, enqueueLinks }) => {
await enqueueLinks({
// Custom filter function
transformRequestFunction: (req) => {
// Filter out URLs with query parameters
const url = new URL(req.url);
if (url.search) {
return false;
}
// Only include URLs with specific characteristics
if (!url.pathname.includes('/product/')) {
return false;
}
return req;
},
});
},
});
Python Example for URL Filtering
If you're using Crawlee for Python, the filtering approach is similar:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# Enqueue links with filtering
await context.enqueue_links(
include=[r'https://example\.com/products/.*'],
exclude=[r'.*\.(jpg|png|pdf)$', r'.*/admin/.*'],
)
Prioritizing Requests in Crawlee
Request prioritization ensures that important URLs are crawled first. This is crucial when dealing with large websites or when you have time constraints.
Setting Request Priority
You can assign priority values when adding requests to the queue. Higher priority values are processed first:
import { PlaywrightCrawler, RequestQueue } from 'crawlee';
const requestQueue = await RequestQueue.open();
// Add high-priority requests
await requestQueue.addRequest({
url: 'https://example.com/important-page',
priority: 10, // Higher number = higher priority
});
// Add normal-priority requests
await requestQueue.addRequest({
url: 'https://example.com/regular-page',
priority: 5,
});
// Add low-priority requests
await requestQueue.addRequest({
url: 'https://example.com/less-important',
priority: 1,
});
const crawler = new PlaywrightCrawler({
requestQueue,
requestHandler: async ({ request, page }) => {
console.log(`Processing: ${request.url} (Priority: ${request.priority})`);
// Your scraping logic here
},
});
await crawler.run();
Dynamic Priority Assignment
You can dynamically assign priorities based on URL characteristics or other criteria:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
requestHandler: async ({ request, page, enqueueLinks }) => {
await enqueueLinks({
transformRequestFunction: (req) => {
const url = new URL(req.url);
// Assign priority based on URL structure
if (url.pathname.includes('/products/')) {
req.priority = 10;
} else if (url.pathname.includes('/categories/')) {
req.priority = 7;
} else {
req.priority = 3;
}
return req;
},
});
},
});
Depth-Based Prioritization
You can prioritize requests based on their depth in the crawl tree. Pages closer to the starting URL typically have higher priority:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 100,
requestHandler: async ({ request, page, enqueueLinks }) => {
const currentDepth = request.userData.depth || 0;
await enqueueLinks({
transformRequestFunction: (req) => {
// Set depth for the new request
req.userData = { depth: currentDepth + 1 };
// Higher priority for pages closer to root
req.priority = Math.max(1, 10 - currentDepth);
return req;
},
});
},
});
Combining Filtering and Prioritization
The most effective crawling strategies combine both filtering and prioritization:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 1000,
requestHandler: async ({ request, page, enqueueLinks, log }) => {
log.info(`Crawling: ${request.url} (Priority: ${request.priority})`);
// Extract data
const data = await page.evaluate(() => ({
title: document.title,
description: document.querySelector('meta[name="description"]')?.content,
}));
await Dataset.pushData({ ...data, url: request.url });
// Advanced filtering and prioritization
await enqueueLinks({
globs: [
'https://example.com/products/**',
'https://example.com/blog/**',
],
exclude: [
'**/*.pdf',
'**/admin/**',
'**/cart/**',
],
transformRequestFunction: (req) => {
const url = new URL(req.url);
const depth = (request.userData.depth || 0) + 1;
// Don't crawl too deep
if (depth > 5) {
return false;
}
// Set metadata
req.userData = { depth };
// Priority logic
if (url.pathname.includes('/products/')) {
req.priority = 10;
} else if (url.pathname.includes('/blog/')) {
req.priority = 5;
} else {
req.priority = 3;
}
return req;
},
});
},
});
await crawler.run(['https://example.com']);
Python Example: Complete Filtering and Prioritization
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
import re
crawler = PlaywrightCrawler(max_requests_per_crawl=1000)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
url = context.request.url
priority = context.request.user_data.get('priority', 5)
depth = context.request.user_data.get('depth', 0)
context.log.info(f'Crawling: {url} (Priority: {priority}, Depth: {depth})')
# Extract data
title = await context.page.title()
await Dataset.push_data({'url': url, 'title': title})
# Enqueue with filtering and prioritization
await context.enqueue_links(
include=[
r'https://example\.com/products/.*',
r'https://example\.com/blog/.*',
],
exclude=[
r'.*\.(pdf|jpg|png)$',
r'.*/admin/.*',
],
# Set priority and depth for new requests
user_data={'depth': depth + 1},
)
await crawler.run(['https://example.com'])
Advanced Techniques
Using Request Labels
Labels help categorize requests and apply different handling logic:
await enqueueLinks({
transformRequestFunction: (req) => {
const url = new URL(req.url);
if (url.pathname.includes('/products/')) {
req.label = 'PRODUCT';
req.priority = 10;
} else if (url.pathname.includes('/categories/')) {
req.label = 'CATEGORY';
req.priority = 7;
}
return req;
},
});
// In requestHandler
if (request.label === 'PRODUCT') {
// Handle product pages differently
} else if (request.label === 'CATEGORY') {
// Handle category pages differently
}
Deduplication
Crawlee automatically deduplicates requests, but you can customize this behavior:
import { RequestQueue } from 'crawlee';
const requestQueue = await RequestQueue.open();
// Add request with custom uniqueKey for deduplication
await requestQueue.addRequest({
url: 'https://example.com/product?id=123&ref=home',
uniqueKey: 'product-123', // Custom deduplication key
priority: 10,
});
Best Practices
- Start with broad filters: Begin with inclusive patterns and refine based on crawl results
- Use appropriate priority ranges: Keep priorities between 0-10 for clarity
- Monitor queue size: Track how many requests are pending to avoid memory issues
- Test filters first: Run small test crawls to validate your filtering logic
- Consider crawl depth: Limit depth to prevent infinite crawling loops
- Balance priority: Don't set everything to high priority, as it defeats the purpose
When working with browser automation for more complex scraping scenarios, understanding how to handle browser sessions in Puppeteer can be helpful since Crawlee builds on similar concepts. Additionally, if you need to monitor network requests in Puppeteer, these techniques complement Crawlee's request management features.
Conclusion
Filtering and prioritizing requests in Crawlee is essential for building efficient, targeted web scrapers. By combining glob patterns, custom transformation functions, and priority settings, you can create sophisticated crawling strategies that focus on the most important content while avoiding unnecessary pages. Whether you're using JavaScript or Python, Crawlee provides flexible APIs to implement these patterns effectively.
Remember to always respect websites' robots.txt files and crawling policies, and implement appropriate rate limiting to avoid overwhelming target servers.