What is a RequestList in Crawlee and how do I use it?
A RequestList in Crawlee is a data structure that manages a static list of URLs to be crawled. It's one of two primary sources for managing crawl targets in Crawlee (the other being RequestQueue). RequestList is ideal when you have a predetermined list of URLs that you want to scrape, such as product pages, search result URLs, or a sitemap.
Unlike RequestQueue, which is designed for dynamic crawling where new URLs are discovered and added during the crawl, RequestList works with a fixed set of URLs defined at initialization. This makes it perfect for scenarios where you know all the URLs upfront.
Key Features of RequestList
RequestList provides several important features:
- Persistence: Automatically saves progress to disk, allowing you to resume interrupted crawls
- Deduplication: Ensures each URL is processed only once, even if added multiple times
- Error Handling: Tracks failed requests and allows retry logic
- Memory Efficiency: Handles large lists of URLs without loading everything into memory at once
- State Management: Tracks which URLs have been processed, are pending, or have failed
Basic Usage
Here's a simple example of creating and using a RequestList in JavaScript/TypeScript:
import { CheerioCrawler, RequestList } from 'crawlee';
// Create a RequestList with URLs
const requestList = await RequestList.open('my-list', [
{ url: 'https://example.com/page1' },
{ url: 'https://example.com/page2' },
{ url: 'https://example.com/page3' },
]);
// Create a crawler that uses the RequestList
const crawler = new CheerioCrawler({
requestList,
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text();
console.log(`Title of ${request.url}: ${title}`);
// Extract data as needed
const data = {
url: request.url,
title: title,
// Add more extracted data
};
await Dataset.pushData(data);
},
});
// Run the crawler
await crawler.run();
Python Implementation
Crawlee also supports Python, and RequestList works similarly:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.request_list import RequestList
async def main():
# Create a RequestList with URLs
request_list = await RequestList.open(
name='my-list',
requests=[
{'url': 'https://example.com/page1'},
{'url': 'https://example.com/page2'},
{'url': 'https://example.com/page3'},
]
)
# Create a crawler
crawler = BeautifulSoupCrawler(
request_list=request_list,
)
@crawler.router.default_handler
async def request_handler(context):
title = context.soup.title.string if context.soup.title else 'No title'
context.log.info(f'Title of {context.request.url}: {title}')
# Extract and save data
await context.push_data({
'url': str(context.request.url),
'title': title,
})
# Run the crawler
await crawler.run()
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Adding Custom Data to Requests
You can attach custom metadata to each request in the RequestList. This is useful for passing additional context that you'll need during processing:
const requestList = await RequestList.open('my-list', [
{
url: 'https://example.com/products/laptop',
userData: {
category: 'electronics',
priority: 'high',
},
},
{
url: 'https://example.com/products/book',
userData: {
category: 'books',
priority: 'low',
},
},
]);
const crawler = new CheerioCrawler({
requestList,
async requestHandler({ request, $ }) {
const category = request.userData.category;
const priority = request.userData.priority;
console.log(`Processing ${category} with ${priority} priority`);
// Use the custom data in your scraping logic
},
});
Loading URLs from a File
For large lists of URLs, you can load them from a file:
import { readFileSync } from 'fs';
import { RequestList } from 'crawlee';
// Read URLs from a file (one URL per line)
const urls = readFileSync('urls.txt', 'utf-8')
.split('\n')
.filter(url => url.trim())
.map(url => ({ url: url.trim() }));
const requestList = await RequestList.open('my-list', urls);
Or from a JSON file:
import { readFileSync } from 'fs';
const urls = JSON.parse(readFileSync('urls.json', 'utf-8'));
const requestList = await RequestList.open('my-list', urls);
RequestList vs RequestQueue
Understanding when to use RequestList versus RequestQueue is crucial:
Use RequestList when: - You have a predetermined list of URLs - URLs are known before the crawl starts - You're scraping a specific set of pages (e.g., from a sitemap) - You need to crawl the same list multiple times - You want simple, file-based persistence
Use RequestQueue when: - URLs are discovered during crawling - You need to follow links dynamically - The crawl scope expands as you discover new pages - You need distributed crawling across multiple machines - You want cloud-based persistence
You can also use both together:
const requestList = await RequestList.open('start-urls', [
{ url: 'https://example.com/category1' },
{ url: 'https://example.com/category2' },
]);
const requestQueue = await RequestQueue.open('discovered-urls');
const crawler = new PlaywrightCrawler({
requestList,
requestQueue,
async requestHandler({ request, page, enqueueLinks }) {
// Scrape the current page
const title = await page.title();
console.log(`Title: ${title}`);
// Discover and add new links to the queue
await enqueueLinks({
selector: 'a.product-link',
transformRequestFunction: (req) => {
req.userData = { foundOn: request.url };
return req;
},
});
},
});
await crawler.run();
This approach is similar to how you navigate to different pages using Puppeteer, where you can combine predefined navigation with dynamic link discovery.
Handling Request Persistence
RequestList automatically persists its state to disk. This means if your crawler crashes or is interrupted, you can resume from where you left off:
// First run - will process all URLs
const requestList = await RequestList.open('my-persistent-list', [
{ url: 'https://example.com/page1' },
{ url: 'https://example.com/page2' },
{ url: 'https://example.com/page3' },
]);
// If the crawler crashes and you restart it, RequestList will skip
// already processed URLs automatically
To start fresh and ignore previous state:
const requestList = await RequestList.open('my-list', sources, {
persistStateKey: undefined, // Don't persist state
});
Advanced Configuration
RequestList supports several configuration options:
const requestList = await RequestList.open('advanced-list', sources, {
persistStateKey: 'my-custom-state', // Custom state key
persistRequestsKey: 'my-custom-requests', // Custom requests key
keepDuplicateUrls: false, // Remove duplicate URLs (default)
});
Error Handling and Retries
RequestList integrates with Crawlee's retry mechanism. Failed requests are automatically retried according to your crawler configuration:
const crawler = new CheerioCrawler({
requestList,
maxRequestRetries: 3, // Retry failed requests up to 3 times
async requestHandler({ request, $ }) {
// Your scraping logic
},
async failedRequestHandler({ request, error }) {
console.log(`Request ${request.url} failed: ${error.message}`);
// Handle permanently failed requests
},
});
This is particularly useful when dealing with browser sessions and timeouts in more complex scraping scenarios.
Working with Different Crawler Types
RequestList works seamlessly with all Crawlee crawler types:
With CheerioCrawler (for static HTML)
import { CheerioCrawler, RequestList } from 'crawlee';
const requestList = await RequestList.open('cheerio-list', urls);
const crawler = new CheerioCrawler({
requestList,
async requestHandler({ $, request }) {
const title = $('h1').text();
// Fast HTML parsing
},
});
With PlaywrightCrawler (for JavaScript-rendered pages)
import { PlaywrightCrawler, RequestList } from 'crawlee';
const requestList = await RequestList.open('playwright-list', urls);
const crawler = new PlaywrightCrawler({
requestList,
async requestHandler({ page, request }) {
await page.waitForSelector('.content');
const content = await page.$eval('.content', el => el.textContent);
// Full browser automation
},
});
With PuppeteerCrawler
import { PuppeteerCrawler, RequestList } from 'crawlee';
const requestList = await RequestList.open('puppeteer-list', urls);
const crawler = new PuppeteerCrawler({
requestList,
async requestHandler({ page, request }) {
// Similar to Playwright but using Puppeteer
const title = await page.title();
},
});
When working with browser automation, you might also need to understand how to handle AJAX requests to ensure all dynamic content is loaded before scraping.
Best Practices
- Use Descriptive Names: Give your RequestList a meaningful name for easier debugging and state management
- Add User Data: Include relevant metadata in
userData
for context during processing - Validate URLs: Ensure URLs are properly formatted before adding them to the list
- Monitor Progress: Use logging to track how many requests have been processed
- Handle Failures: Implement
failedRequestHandler
to deal with permanently failed requests - Consider Memory: For extremely large lists (millions of URLs), consider splitting into multiple smaller RequestLists
- Clean State: Remove old state files when starting fresh crawls to avoid confusion
Checking RequestList Status
You can monitor the state of your RequestList:
// Get statistics about the RequestList
const info = await requestList.getInfo();
console.log(`Total requests: ${info.total}`);
console.log(`Handled requests: ${info.handledCount}`);
console.log(`Pending requests: ${info.pendingCount}`);
// Check if all requests have been handled
const isFinished = await requestList.isFinished();
console.log(`All requests processed: ${isFinished}`);
Conclusion
RequestList is a powerful tool in Crawlee for managing static lists of URLs. It provides persistence, deduplication, and seamless integration with all Crawlee crawler types. Whether you're scraping a predefined set of pages or using it as a starting point for dynamic crawling, RequestList simplifies URL management and makes your scrapers more robust and resumable.
For most use cases involving known URLs, RequestList is the perfect choice. Combine it with RequestQueue when you need to discover and follow links dynamically, giving you the best of both worlds for comprehensive web scraping projects.