How do I Enqueue Links for Crawling in Crawlee?

Enqueueing links is a fundamental operation in Crawlee that allows you to dynamically discover and add new URLs to your crawling queue. Crawlee provides several powerful methods for link enqueueing, from automatic link discovery to manual queue management. This guide covers all the essential techniques for efficiently managing your crawl queue.

Understanding Crawlee's Request Queue

Before diving into link enqueueing, it's important to understand that Crawlee uses a RequestQueue to manage URLs that need to be crawled. The queue ensures that:

URLs are processed in order
Duplicate URLs are automatically filtered
Failed requests can be retried
The crawl can be paused and resumed

Using enqueueLinks() - The Primary Method

The enqueueLinks() method is the most common and convenient way to add links to your crawl queue. It automatically discovers links on the current page and adds them to the queue.

Basic Usage in JavaScript/TypeScript

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        console.log(`Processing: ${request.url}`);

        // Automatically enqueue all links found on the page
        await enqueueLinks();

        // Your scraping logic here
        const title = await page.title();
        console.log(`Title: ${title}`);
    },
});

await crawler.run(['https://example.com']);

Basic Usage in Python

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler()

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    print(f'Processing: {context.request.url}')

    # Automatically enqueue all links found on the page
    await context.enqueue_links()

    # Your scraping logic here
    title = await context.page.title()
    print(f'Title: {title}')

await crawler.run(['https://example.com'])

Filtering Links with CSS Selectors

One of the most powerful features of enqueueLinks() is the ability to filter which links to enqueue using CSS selectors. This is crucial for focused crawling.

Using the selector Option

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        console.log(`Crawling: ${request.url}`);

        // Only enqueue links from article sections
        await enqueueLinks({
            selector: 'article a.post-link',
        });

        // Extract article data
        const articles = [];
        $('article').each((i, el) => {
            articles.push({
                title: $(el).find('h2').text(),
                link: $(el).find('a').attr('href'),
            });
        });
    },
});

In Python:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    print(f'Crawling: {context.request.url}')

    # Only enqueue links from article sections
    await context.enqueue_links(
        selector='article a.post-link'
    )

    # Extract article data
    for article in context.soup.select('article'):
        title = article.select_one('h2').get_text()
        link = article.select_one('a')['href']

Advanced Filtering with globs and RegExp

Crawlee allows you to filter URLs using glob patterns or regular expressions, giving you fine-grained control over which links to follow.

Using Glob Patterns

await enqueueLinks({
    // Only enqueue product pages
    globs: ['https://example.com/products/**'],
});

// Or exclude certain patterns
await enqueueLinks({
    globs: ['https://example.com/**'],
    exclude: [
        'https://example.com/admin/**',
        'https://example.com/login/**',
    ],
});

Using Regular Expressions

await enqueueLinks({
    // Only enqueue URLs matching the pattern
    regexps: [/https:\/\/example\.com\/category\/[\w-]+\/product-\d+/],
});

In Python:

import re

await context.enqueue_links(
    # Only enqueue product pages
    patterns=[re.compile(r'https://example\.com/category/[\w-]+/product-\d+')]
)

Manually Adding URLs to the Queue

For more control, you can manually add URLs to the request queue. This is useful when you need to construct URLs programmatically or add URLs from external sources.

Adding Single Requests

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, crawler }) {
        console.log(`Processing: ${request.url}`);

        // Manually construct and add URLs
        const categoryIds = [1, 2, 3, 4, 5];
        for (const id of categoryIds) {
            await crawler.addRequests([
                {
                    url: `https://example.com/category/${id}`,
                    label: 'CATEGORY',
                    userData: { categoryId: id },
                }
            ]);
        }
    },
});

In Python:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler

crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def request_handler(context) -> None:
    print(f'Processing: {context.request.url}')

    # Manually construct and add URLs
    category_ids = [1, 2, 3, 4, 5]
    for category_id in category_ids:
        await context.add_requests([
            {
                'url': f'https://example.com/category/{category_id}',
                'label': 'CATEGORY',
                'user_data': {'category_id': category_id},
            }
        ])

Using Different Handlers for Different Page Types

Crawlee's labeling system allows you to route different types of pages to different handlers, which is particularly useful when handling browser sessions or managing complex crawling workflows.

Route-Based Crawling

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler();

// Handler for listing pages
crawler.router.addHandler('LIST', async ({ request, page, enqueueLinks }) => {
    console.log(`Processing listing: ${request.url}`);

    // Enqueue product links with a specific label
    await enqueueLinks({
        selector: 'a.product-link',
        label: 'PRODUCT',
    });
});

// Handler for product pages
crawler.router.addHandler('PRODUCT', async ({ request, page }) => {
    console.log(`Processing product: ${request.url}`);

    const product = await page.evaluate(() => ({
        name: document.querySelector('h1.product-name')?.textContent,
        price: document.querySelector('.price')?.textContent,
        description: document.querySelector('.description')?.textContent,
    }));

    console.log(product);
});

// Start with listing pages
await crawler.run([
    { url: 'https://example.com/products', label: 'LIST' }
]);

Transforming Requests Before Enqueueing

You can modify requests before they're added to the queue using the transformRequestFunction option.

await enqueueLinks({
    selector: 'a.product-link',
    transformRequestFunction: (request) => {
        // Add custom headers
        request.headers = {
            ...request.headers,
            'X-Custom-Header': 'value',
        };

        // Add user data
        request.userData = {
            ...request.userData,
            timestamp: Date.now(),
        };

        return request;
    },
});

Handling Pagination

A common use case for link enqueueing is handling pagination. Here's a practical example:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        // Extract data from current page
        const items = [];
        $('.item').each((i, el) => {
            items.push({
                title: $(el).find('.title').text(),
                price: $(el).find('.price').text(),
            });
        });

        console.log(`Found ${items.length} items on ${request.url}`);

        // Enqueue next page if it exists
        await enqueueLinks({
            selector: 'a.next-page',
        });
    },
});

await crawler.run(['https://example.com/products?page=1']);

Working with Request Queue Directly

For advanced use cases, you can access the RequestQueue directly to have full control over request management, similar to how you might navigate to different pages using Puppeteer.

import { PlaywrightCrawler, RequestQueue } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, crawler }) {
        const requestQueue = await crawler.getRequestQueue();

        // Add multiple requests at once
        await requestQueue.addRequests([
            { url: 'https://example.com/page1' },
            { url: 'https://example.com/page2' },
            { url: 'https://example.com/page3' },
        ]);

        // Check if a URL has been processed
        const isFinished = await requestQueue.isFinished();
        console.log(`Queue finished: ${isFinished}`);
    },
});

Limiting Crawl Depth

To prevent your crawler from going too deep, you can implement depth limiting:

import { CheerioCrawler } from 'crawlee';

const MAX_DEPTH = 3;

const crawler = new CheerioCrawler({
    async requestHandler({ request, enqueueLinks }) {
        const depth = request.userData.depth || 0;

        console.log(`Processing ${request.url} at depth ${depth}`);

        // Only enqueue links if we haven't reached max depth
        if (depth < MAX_DEPTH) {
            await enqueueLinks({
                transformRequestFunction: (req) => {
                    req.userData = {
                        ...req.userData,
                        depth: depth + 1,
                    };
                    return req;
                },
            });
        }
    },
});

await crawler.run([
    { url: 'https://example.com', userData: { depth: 0 } }
]);

Best Practices for Link Enqueueing

1. Use Specific Selectors

Always use the most specific CSS selectors possible to avoid enqueueing unwanted links:

// Good - specific selector
await enqueueLinks({ selector: 'main article a.read-more' });

// Bad - too broad
await enqueueLinks({ selector: 'a' });

2. Implement URL Filtering

Use globs or regexps to ensure you only crawl relevant pages:

await enqueueLinks({
    globs: ['https://example.com/blog/**'],
    exclude: ['**/*?page=*', '**/tag/**'],
});

3. Add Meaningful Labels

Label your requests to make routing and debugging easier:

await enqueueLinks({
    selector: 'a.category',
    label: 'CATEGORY',
});

4. Monitor Queue Size

Keep track of your queue to prevent memory issues:

const requestQueue = await crawler.getRequestQueue();
const queueSize = await requestQueue.getInfo();
console.log(`Queue has ${queueSize.totalRequestCount} total requests`);

Handling AJAX-Loaded Content

When dealing with dynamically loaded content, you may need to wait for elements to appear before enqueueing links, similar to handling AJAX requests using Puppeteer:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, enqueueLinks }) {
        // Wait for dynamic content to load
        await page.waitForSelector('.dynamic-links', { timeout: 5000 });

        // Now enqueue the dynamically loaded links
        await enqueueLinks({
            selector: '.dynamic-links a',
        });
    },
});

Conclusion

Enqueueing links in Crawlee is a flexible and powerful feature that forms the backbone of any web scraping project. Whether you're using the convenient enqueueLinks() method with selectors and filters, manually managing the request queue, or implementing complex routing logic, Crawlee provides all the tools you need to build efficient and maintainable crawlers.

The key is to start simple with basic link enqueueing and gradually add filtering, labeling, and depth control as your scraping needs become more sophisticated. By following the best practices outlined in this guide, you'll be able to build robust crawlers that efficiently discover and process web content.

Table of contents

How do I Enqueue Links for Crawling in Crawlee?

Understanding Crawlee's Request Queue

Using enqueueLinks() - The Primary Method

Basic Usage in JavaScript/TypeScript

Basic Usage in Python

Filtering Links with CSS Selectors

Using the selector Option

Advanced Filtering with globs and RegExp

Using Glob Patterns

Using Regular Expressions

Manually Adding URLs to the Queue

Adding Single Requests

Using Different Handlers for Different Page Types

Route-Based Crawling

Transforming Requests Before Enqueueing

Handling Pagination

Working with Request Queue Directly

Limiting Crawl Depth

Best Practices for Link Enqueueing

1. Use Specific Selectors

2. Implement URL Filtering

3. Add Meaningful Labels

4. Monitor Queue Size

Handling AJAX-Loaded Content

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How does Crawlee handle request retries and error recovery?

How do I filter and prioritize requests in Crawlee?

How do I store scraped data using Crawlee datasets?

Get Started Now

Support