What is a RequestList in Crawlee and how do I use it?

A RequestList in Crawlee is a data structure that manages a static list of URLs to be crawled. It's one of two primary sources for managing crawl targets in Crawlee (the other being RequestQueue). RequestList is ideal when you have a predetermined list of URLs that you want to scrape, such as product pages, search result URLs, or a sitemap.

Unlike RequestQueue, which is designed for dynamic crawling where new URLs are discovered and added during the crawl, RequestList works with a fixed set of URLs defined at initialization. This makes it perfect for scenarios where you know all the URLs upfront.

Key Features of RequestList

RequestList provides several important features:

Persistence: Automatically saves progress to disk, allowing you to resume interrupted crawls
Deduplication: Ensures each URL is processed only once, even if added multiple times
Error Handling: Tracks failed requests and allows retry logic
Memory Efficiency: Handles large lists of URLs without loading everything into memory at once
State Management: Tracks which URLs have been processed, are pending, or have failed

Basic Usage

Here's a simple example of creating and using a RequestList in JavaScript/TypeScript:

import { CheerioCrawler, RequestList } from 'crawlee';

// Create a RequestList with URLs
const requestList = await RequestList.open('my-list', [
    { url: 'https://example.com/page1' },
    { url: 'https://example.com/page2' },
    { url: 'https://example.com/page3' },
]);

// Create a crawler that uses the RequestList
const crawler = new CheerioCrawler({
    requestList,
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        console.log(`Title of ${request.url}: ${title}`);

        // Extract data as needed
        const data = {
            url: request.url,
            title: title,
            // Add more extracted data
        };

        await Dataset.pushData(data);
    },
});

// Run the crawler
await crawler.run();

Python Implementation

Crawlee also supports Python, and RequestList works similarly:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
from crawlee.request_list import RequestList

async def main():
    # Create a RequestList with URLs
    request_list = await RequestList.open(
        name='my-list',
        requests=[
            {'url': 'https://example.com/page1'},
            {'url': 'https://example.com/page2'},
            {'url': 'https://example.com/page3'},
        ]
    )

    # Create a crawler
    crawler = BeautifulSoupCrawler(
        request_list=request_list,
    )

    @crawler.router.default_handler
    async def request_handler(context):
        title = context.soup.title.string if context.soup.title else 'No title'
        context.log.info(f'Title of {context.request.url}: {title}')

        # Extract and save data
        await context.push_data({
            'url': str(context.request.url),
            'title': title,
        })

    # Run the crawler
    await crawler.run()

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Adding Custom Data to Requests

You can attach custom metadata to each request in the RequestList. This is useful for passing additional context that you'll need during processing:

const requestList = await RequestList.open('my-list', [
    {
        url: 'https://example.com/products/laptop',
        userData: {
            category: 'electronics',
            priority: 'high',
        },
    },
    {
        url: 'https://example.com/products/book',
        userData: {
            category: 'books',
            priority: 'low',
        },
    },
]);

const crawler = new CheerioCrawler({
    requestList,
    async requestHandler({ request, $ }) {
        const category = request.userData.category;
        const priority = request.userData.priority;

        console.log(`Processing ${category} with ${priority} priority`);
        // Use the custom data in your scraping logic
    },
});

Loading URLs from a File

For large lists of URLs, you can load them from a file:

import { readFileSync } from 'fs';
import { RequestList } from 'crawlee';

// Read URLs from a file (one URL per line)
const urls = readFileSync('urls.txt', 'utf-8')
    .split('\n')
    .filter(url => url.trim())
    .map(url => ({ url: url.trim() }));

const requestList = await RequestList.open('my-list', urls);

Or from a JSON file:

import { readFileSync } from 'fs';

const urls = JSON.parse(readFileSync('urls.json', 'utf-8'));
const requestList = await RequestList.open('my-list', urls);

RequestList vs RequestQueue

Understanding when to use RequestList versus RequestQueue is crucial:

Use RequestList when: - You have a predetermined list of URLs - URLs are known before the crawl starts - You're scraping a specific set of pages (e.g., from a sitemap) - You need to crawl the same list multiple times - You want simple, file-based persistence

Use RequestQueue when: - URLs are discovered during crawling - You need to follow links dynamically - The crawl scope expands as you discover new pages - You need distributed crawling across multiple machines - You want cloud-based persistence

You can also use both together:

const requestList = await RequestList.open('start-urls', [
    { url: 'https://example.com/category1' },
    { url: 'https://example.com/category2' },
]);

const requestQueue = await RequestQueue.open('discovered-urls');

const crawler = new PlaywrightCrawler({
    requestList,
    requestQueue,
    async requestHandler({ request, page, enqueueLinks }) {
        // Scrape the current page
        const title = await page.title();
        console.log(`Title: ${title}`);

        // Discover and add new links to the queue
        await enqueueLinks({
            selector: 'a.product-link',
            transformRequestFunction: (req) => {
                req.userData = { foundOn: request.url };
                return req;
            },
        });
    },
});

await crawler.run();

This approach is similar to how you navigate to different pages using Puppeteer, where you can combine predefined navigation with dynamic link discovery.

Handling Request Persistence

RequestList automatically persists its state to disk. This means if your crawler crashes or is interrupted, you can resume from where you left off:

// First run - will process all URLs
const requestList = await RequestList.open('my-persistent-list', [
    { url: 'https://example.com/page1' },
    { url: 'https://example.com/page2' },
    { url: 'https://example.com/page3' },
]);

// If the crawler crashes and you restart it, RequestList will skip
// already processed URLs automatically

To start fresh and ignore previous state:

const requestList = await RequestList.open('my-list', sources, {
    persistStateKey: undefined, // Don't persist state
});

Advanced Configuration

RequestList supports several configuration options:

const requestList = await RequestList.open('advanced-list', sources, {
    persistStateKey: 'my-custom-state', // Custom state key
    persistRequestsKey: 'my-custom-requests', // Custom requests key
    keepDuplicateUrls: false, // Remove duplicate URLs (default)
});

Error Handling and Retries

RequestList integrates with Crawlee's retry mechanism. Failed requests are automatically retried according to your crawler configuration:

const crawler = new CheerioCrawler({
    requestList,
    maxRequestRetries: 3, // Retry failed requests up to 3 times
    async requestHandler({ request, $ }) {
        // Your scraping logic
    },
    async failedRequestHandler({ request, error }) {
        console.log(`Request ${request.url} failed: ${error.message}`);
        // Handle permanently failed requests
    },
});

This is particularly useful when dealing with browser sessions and timeouts in more complex scraping scenarios.

Working with Different Crawler Types

RequestList works seamlessly with all Crawlee crawler types:

With CheerioCrawler (for static HTML)

import { CheerioCrawler, RequestList } from 'crawlee';

const requestList = await RequestList.open('cheerio-list', urls);
const crawler = new CheerioCrawler({
    requestList,
    async requestHandler({ $, request }) {
        const title = $('h1').text();
        // Fast HTML parsing
    },
});

With PlaywrightCrawler (for JavaScript-rendered pages)

import { PlaywrightCrawler, RequestList } from 'crawlee';

const requestList = await RequestList.open('playwright-list', urls);
const crawler = new PlaywrightCrawler({
    requestList,
    async requestHandler({ page, request }) {
        await page.waitForSelector('.content');
        const content = await page.$eval('.content', el => el.textContent);
        // Full browser automation
    },
});

With PuppeteerCrawler

import { PuppeteerCrawler, RequestList } from 'crawlee';

const requestList = await RequestList.open('puppeteer-list', urls);
const crawler = new PuppeteerCrawler({
    requestList,
    async requestHandler({ page, request }) {
        // Similar to Playwright but using Puppeteer
        const title = await page.title();
    },
});

When working with browser automation, you might also need to understand how to handle AJAX requests to ensure all dynamic content is loaded before scraping.

Best Practices

Use Descriptive Names: Give your RequestList a meaningful name for easier debugging and state management
Add User Data: Include relevant metadata in userData for context during processing
Validate URLs: Ensure URLs are properly formatted before adding them to the list
Monitor Progress: Use logging to track how many requests have been processed
Handle Failures: Implement failedRequestHandler to deal with permanently failed requests
Consider Memory: For extremely large lists (millions of URLs), consider splitting into multiple smaller RequestLists
Clean State: Remove old state files when starting fresh crawls to avoid confusion

Checking RequestList Status

You can monitor the state of your RequestList:

// Get statistics about the RequestList
const info = await requestList.getInfo();
console.log(`Total requests: ${info.total}`);
console.log(`Handled requests: ${info.handledCount}`);
console.log(`Pending requests: ${info.pendingCount}`);

// Check if all requests have been handled
const isFinished = await requestList.isFinished();
console.log(`All requests processed: ${isFinished}`);

Conclusion

RequestList is a powerful tool in Crawlee for managing static lists of URLs. It provides persistence, deduplication, and seamless integration with all Crawlee crawler types. Whether you're scraping a predefined set of pages or using it as a starting point for dynamic crawling, RequestList simplifies URL management and makes your scrapers more robust and resumable.

For most use cases involving known URLs, RequestList is the perfect choice. Combine it with RequestQueue when you need to discover and follow links dynamically, giving you the best of both worlds for comprehensive web scraping projects.

Table of contents

What is a RequestList in Crawlee and how do I use it?

Key Features of RequestList

Basic Usage

Python Implementation

Adding Custom Data to Requests

Loading URLs from a File

RequestList vs RequestQueue

Handling Request Persistence

Advanced Configuration

Error Handling and Retries

Working with Different Crawler Types

With CheerioCrawler (for static HTML)

With PlaywrightCrawler (for JavaScript-rendered pages)

With PuppeteerCrawler

Best Practices

Checking RequestList Status

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I enqueue links for crawling in Crawlee?

How does Crawlee handle request retries and error recovery?

How do I filter and prioritize requests in Crawlee?

Get Started Now

Support