How do I Enqueue Links for Crawling in Crawlee?
Enqueueing links is a fundamental operation in Crawlee that allows you to dynamically discover and add new URLs to your crawling queue. Crawlee provides several powerful methods for link enqueueing, from automatic link discovery to manual queue management. This guide covers all the essential techniques for efficiently managing your crawl queue.
Understanding Crawlee's Request Queue
Before diving into link enqueueing, it's important to understand that Crawlee uses a RequestQueue
to manage URLs that need to be crawled. The queue ensures that:
- URLs are processed in order
- Duplicate URLs are automatically filtered
- Failed requests can be retried
- The crawl can be paused and resumed
Using enqueueLinks() - The Primary Method
The enqueueLinks()
method is the most common and convenient way to add links to your crawl queue. It automatically discovers links on the current page and adds them to the queue.
Basic Usage in JavaScript/TypeScript
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Automatically enqueue all links found on the page
await enqueueLinks();
// Your scraping logic here
const title = await page.title();
console.log(`Title: ${title}`);
},
});
await crawler.run(['https://example.com']);
Basic Usage in Python
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
print(f'Processing: {context.request.url}')
# Automatically enqueue all links found on the page
await context.enqueue_links()
# Your scraping logic here
title = await context.page.title()
print(f'Title: {title}')
await crawler.run(['https://example.com'])
Filtering Links with CSS Selectors
One of the most powerful features of enqueueLinks()
is the ability to filter which links to enqueue using CSS selectors. This is crucial for focused crawling.
Using the selector Option
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
console.log(`Crawling: ${request.url}`);
// Only enqueue links from article sections
await enqueueLinks({
selector: 'article a.post-link',
});
// Extract article data
const articles = [];
$('article').each((i, el) => {
articles.push({
title: $(el).find('h2').text(),
link: $(el).find('a').attr('href'),
});
});
},
});
In Python:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
print(f'Crawling: {context.request.url}')
# Only enqueue links from article sections
await context.enqueue_links(
selector='article a.post-link'
)
# Extract article data
for article in context.soup.select('article'):
title = article.select_one('h2').get_text()
link = article.select_one('a')['href']
Advanced Filtering with globs and RegExp
Crawlee allows you to filter URLs using glob patterns or regular expressions, giving you fine-grained control over which links to follow.
Using Glob Patterns
await enqueueLinks({
// Only enqueue product pages
globs: ['https://example.com/products/**'],
});
// Or exclude certain patterns
await enqueueLinks({
globs: ['https://example.com/**'],
exclude: [
'https://example.com/admin/**',
'https://example.com/login/**',
],
});
Using Regular Expressions
await enqueueLinks({
// Only enqueue URLs matching the pattern
regexps: [/https:\/\/example\.com\/category\/[\w-]+\/product-\d+/],
});
In Python:
import re
await context.enqueue_links(
# Only enqueue product pages
patterns=[re.compile(r'https://example\.com/category/[\w-]+/product-\d+')]
)
Manually Adding URLs to the Queue
For more control, you can manually add URLs to the request queue. This is useful when you need to construct URLs programmatically or add URLs from external sources.
Adding Single Requests
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, crawler }) {
console.log(`Processing: ${request.url}`);
// Manually construct and add URLs
const categoryIds = [1, 2, 3, 4, 5];
for (const id of categoryIds) {
await crawler.addRequests([
{
url: `https://example.com/category/${id}`,
label: 'CATEGORY',
userData: { categoryId: id },
}
]);
}
},
});
In Python:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context) -> None:
print(f'Processing: {context.request.url}')
# Manually construct and add URLs
category_ids = [1, 2, 3, 4, 5]
for category_id in category_ids:
await context.add_requests([
{
'url': f'https://example.com/category/{category_id}',
'label': 'CATEGORY',
'user_data': {'category_id': category_id},
}
])
Using Different Handlers for Different Page Types
Crawlee's labeling system allows you to route different types of pages to different handlers, which is particularly useful when handling browser sessions or managing complex crawling workflows.
Route-Based Crawling
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler();
// Handler for listing pages
crawler.router.addHandler('LIST', async ({ request, page, enqueueLinks }) => {
console.log(`Processing listing: ${request.url}`);
// Enqueue product links with a specific label
await enqueueLinks({
selector: 'a.product-link',
label: 'PRODUCT',
});
});
// Handler for product pages
crawler.router.addHandler('PRODUCT', async ({ request, page }) => {
console.log(`Processing product: ${request.url}`);
const product = await page.evaluate(() => ({
name: document.querySelector('h1.product-name')?.textContent,
price: document.querySelector('.price')?.textContent,
description: document.querySelector('.description')?.textContent,
}));
console.log(product);
});
// Start with listing pages
await crawler.run([
{ url: 'https://example.com/products', label: 'LIST' }
]);
Transforming Requests Before Enqueueing
You can modify requests before they're added to the queue using the transformRequestFunction
option.
await enqueueLinks({
selector: 'a.product-link',
transformRequestFunction: (request) => {
// Add custom headers
request.headers = {
...request.headers,
'X-Custom-Header': 'value',
};
// Add user data
request.userData = {
...request.userData,
timestamp: Date.now(),
};
return request;
},
});
Handling Pagination
A common use case for link enqueueing is handling pagination. Here's a practical example:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
// Extract data from current page
const items = [];
$('.item').each((i, el) => {
items.push({
title: $(el).find('.title').text(),
price: $(el).find('.price').text(),
});
});
console.log(`Found ${items.length} items on ${request.url}`);
// Enqueue next page if it exists
await enqueueLinks({
selector: 'a.next-page',
});
},
});
await crawler.run(['https://example.com/products?page=1']);
Working with Request Queue Directly
For advanced use cases, you can access the RequestQueue
directly to have full control over request management, similar to how you might navigate to different pages using Puppeteer.
import { PlaywrightCrawler, RequestQueue } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, crawler }) {
const requestQueue = await crawler.getRequestQueue();
// Add multiple requests at once
await requestQueue.addRequests([
{ url: 'https://example.com/page1' },
{ url: 'https://example.com/page2' },
{ url: 'https://example.com/page3' },
]);
// Check if a URL has been processed
const isFinished = await requestQueue.isFinished();
console.log(`Queue finished: ${isFinished}`);
},
});
Limiting Crawl Depth
To prevent your crawler from going too deep, you can implement depth limiting:
import { CheerioCrawler } from 'crawlee';
const MAX_DEPTH = 3;
const crawler = new CheerioCrawler({
async requestHandler({ request, enqueueLinks }) {
const depth = request.userData.depth || 0;
console.log(`Processing ${request.url} at depth ${depth}`);
// Only enqueue links if we haven't reached max depth
if (depth < MAX_DEPTH) {
await enqueueLinks({
transformRequestFunction: (req) => {
req.userData = {
...req.userData,
depth: depth + 1,
};
return req;
},
});
}
},
});
await crawler.run([
{ url: 'https://example.com', userData: { depth: 0 } }
]);
Best Practices for Link Enqueueing
1. Use Specific Selectors
Always use the most specific CSS selectors possible to avoid enqueueing unwanted links:
// Good - specific selector
await enqueueLinks({ selector: 'main article a.read-more' });
// Bad - too broad
await enqueueLinks({ selector: 'a' });
2. Implement URL Filtering
Use globs or regexps to ensure you only crawl relevant pages:
await enqueueLinks({
globs: ['https://example.com/blog/**'],
exclude: ['**/*?page=*', '**/tag/**'],
});
3. Add Meaningful Labels
Label your requests to make routing and debugging easier:
await enqueueLinks({
selector: 'a.category',
label: 'CATEGORY',
});
4. Monitor Queue Size
Keep track of your queue to prevent memory issues:
const requestQueue = await crawler.getRequestQueue();
const queueSize = await requestQueue.getInfo();
console.log(`Queue has ${queueSize.totalRequestCount} total requests`);
Handling AJAX-Loaded Content
When dealing with dynamically loaded content, you may need to wait for elements to appear before enqueueing links, similar to handling AJAX requests using Puppeteer:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, enqueueLinks }) {
// Wait for dynamic content to load
await page.waitForSelector('.dynamic-links', { timeout: 5000 });
// Now enqueue the dynamically loaded links
await enqueueLinks({
selector: '.dynamic-links a',
});
},
});
Conclusion
Enqueueing links in Crawlee is a flexible and powerful feature that forms the backbone of any web scraping project. Whether you're using the convenient enqueueLinks()
method with selectors and filters, manually managing the request queue, or implementing complex routing logic, Crawlee provides all the tools you need to build efficient and maintainable crawlers.
The key is to start simple with basic link enqueueing and gradually add filtering, labeling, and depth control as your scraping needs become more sophisticated. By following the best practices outlined in this guide, you'll be able to build robust crawlers that efficiently discover and process web content.