Table of contents

How do I Handle Cookies and Sessions in Crawlee?

Handling cookies and sessions is essential for web scraping scenarios that require authentication, maintaining state across multiple requests, or simulating realistic user behavior. Crawlee provides built-in session management capabilities that automatically handle cookies, proxy rotation, and request persistence, making it easier to scrape websites that require login or track user sessions.

Understanding Session Management in Crawlee

Crawlee's session management is designed to help you maintain persistent state across multiple requests while handling common challenges like:

  • Cookie persistence - Automatically storing and reusing cookies across requests
  • Session rotation - Distributing requests across multiple sessions to avoid rate limiting
  • Proxy integration - Combining sessions with proxy rotation for better anonymity
  • Session retirement - Automatically retiring sessions that encounter errors or blocks

The SessionPool class is at the heart of Crawlee's session management, providing automatic session creation, rotation, and lifecycle management.

Basic Cookie Handling with CheerioCrawler

For simple cookie handling with static websites, you can use CheerioCrawler with session management enabled:

import { CheerioCrawler, SessionPool } from 'crawlee';

const crawler = new CheerioCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    maxRequestRetries: 2,
    sessionPoolOptions: {
        maxPoolSize: 10,
        sessionOptions: {
            maxUsageCount: 50,
            maxErrorScore: 3,
        },
    },
    async requestHandler({ request, $, session }) {
        console.log(`Processing ${request.url} with session ${session.id}`);

        // The session automatically handles cookies
        // Extract data as needed
        const title = $('title').text();
        console.log(`Title: ${title}`);

        // Session cookies are automatically persisted
        console.log(`Session cookies: ${JSON.stringify(session.getCookies())}`);
    },
});

await crawler.run(['https://example.com']);

Advanced Cookie Management with PlaywrightCrawler

For websites that require JavaScript execution and more complex authentication flows, use PlaywrightCrawler which integrates seamlessly with browser session management:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 5,
    },
    async requestHandler({ request, page, session }) {
        console.log(`Session ID: ${session.id}`);

        // Login logic (only runs once per session)
        if (request.url.includes('login')) {
            await page.fill('input[name="username"]', 'your-username');
            await page.fill('input[name="password"]', 'your-password');
            await page.click('button[type="submit"]');
            await page.waitForNavigation();

            // Cookies are automatically stored in the session
            console.log('Login successful, cookies stored');
        }

        // Extract protected data
        const content = await page.content();
        console.log(`Scraped ${content.length} characters`);
    },
});

await crawler.run([
    'https://example.com/login',
    'https://example.com/protected-page-1',
    'https://example.com/protected-page-2',
]);

Manual Cookie Management

Sometimes you need fine-grained control over cookies. Here's how to manually set, get, and delete cookies:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    async requestHandler({ page, session }) {
        // Set custom cookies
        const cookies = [
            {
                name: 'auth_token',
                value: 'your-token-here',
                domain: '.example.com',
                path: '/',
                expires: Date.now() + 86400000, // 1 day
                httpOnly: true,
                secure: true,
                sameSite: 'Lax',
            },
        ];

        await page.context().addCookies(cookies);

        // Navigate to page with cookies
        await page.goto('https://example.com/dashboard');

        // Get all cookies from the browser
        const browserCookies = await page.context().cookies();
        console.log('Browser cookies:', browserCookies);

        // Store cookies in session for reuse
        browserCookies.forEach(cookie => {
            session.setCookies([cookie], 'https://example.com');
        });

        // Retrieve cookies from session
        const sessionCookies = session.getCookies('https://example.com');
        console.log('Session cookies:', sessionCookies);
    },
});

await crawler.run(['https://example.com']);

Session Pool Configuration

The SessionPool allows you to configure how sessions are created, rotated, and retired:

import { CheerioCrawler, SessionPool } from 'crawlee';

const sessionPool = new SessionPool({
    maxPoolSize: 20, // Maximum number of sessions
    sessionOptions: {
        maxAgeSecs: 3600, // Session expires after 1 hour
        maxUsageCount: 100, // Max requests per session
        maxErrorScore: 5, // Retire session after 5 errors
    },
    createSessionFunction: (sessionPool) => {
        // Custom session creation logic
        const session = sessionPool.createSession();

        // Pre-configure cookies for each new session
        session.setCookies([{
            name: 'initial_cookie',
            value: 'initial_value',
            domain: '.example.com',
        }], 'https://example.com');

        return session;
    },
});

const crawler = new CheerioCrawler({
    sessionPool,
    persistCookiesPerSession: true,
    async requestHandler({ request, session }) {
        console.log(`Session ${session.id} usage: ${session.usageCount}`);
        console.log(`Session error score: ${session.errorScore}`);

        // Your scraping logic here
    },
});

await crawler.run(['https://example.com']);

Handling Authentication with Sessions

Here's a complete example of handling login authentication and maintaining sessions across multiple pages:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
        maxPoolSize: 3,
    },
    preNavigationHooks: [
        async ({ page, request, session }) => {
            // Check if session is already authenticated
            const cookies = session.getCookies(request.url);
            const isAuthenticated = cookies.some(c => c.name === 'auth_token');

            if (!isAuthenticated && !request.url.includes('login')) {
                // Redirect to login page first
                console.log('Session not authenticated, logging in...');
                await page.goto('https://example.com/login');

                await page.fill('#username', 'user@example.com');
                await page.fill('#password', 'securePassword123');
                await page.click('button[type="submit"]');

                // Wait for authentication to complete
                await page.waitForURL('**/dashboard', { timeout: 10000 });

                // Store authentication cookies
                const authCookies = await page.context().cookies();
                authCookies.forEach(cookie => {
                    session.setCookies([cookie], request.url);
                });

                console.log('Authentication successful');
            }
        },
    ],
    async requestHandler({ request, page, session }) {
        console.log(`Scraping ${request.url} with session ${session.id}`);

        // Scrape protected content
        const data = await page.evaluate(() => {
            return {
                title: document.querySelector('h1')?.textContent,
                userData: document.querySelector('.user-profile')?.textContent,
            };
        });

        await Dataset.pushData({
            url: request.url,
            sessionId: session.id,
            ...data,
        });
    },
});

await crawler.run([
    'https://example.com/dashboard',
    'https://example.com/profile',
    'https://example.com/settings',
]);

Python: Cookie and Session Management in Crawlee

Crawlee for Python also supports session management with similar functionality:

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

crawler = PlaywrightCrawler(
    use_session_pool=True,
    persist_cookies_per_session=True,
    session_pool_options={
        'max_pool_size': 5,
        'session_options': {
            'max_usage_count': 50,
            'max_error_score': 3,
        },
    },
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    page = context.page
    session = context.session

    print(f'Processing with session {session.id}')

    # Handle authentication
    if 'login' in context.request.url:
        await page.fill('input[name="username"]', 'your-username')
        await page.fill('input[name="password"]', 'your-password')
        await page.click('button[type="submit"]')
        await page.wait_for_url('**/dashboard')
        print('Login successful')

    # Get cookies from session
    cookies = session.cookies
    print(f'Session has {len(cookies)} cookies')

    # Extract data
    title = await page.title()
    await context.push_data({'title': title, 'session_id': session.id})

await crawler.run([
    'https://example.com/login',
    'https://example.com/protected-page',
])

Session Persistence and Storage

Crawlee can persist sessions to disk, allowing you to resume scraping across multiple runs:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    // Sessions are automatically stored in ./storage/sessions
    sessionPoolOptions: {
        persistStateKeyValueStoreId: 'my-session-store',
        maxPoolSize: 10,
    },
    async requestHandler({ request, page, session }) {
        console.log(`Using session ${session.id}`);
        // Your scraping logic
    },
});

await crawler.run(['https://example.com']);
// Sessions are saved and can be reused in the next run

Best Practices for Cookie and Session Management

  1. Enable session pooling - Always use useSessionPool: true for sites requiring authentication
  2. Configure appropriate limits - Set maxPoolSize, maxUsageCount, and maxErrorScore based on your needs
  3. Handle session retirement - Monitor session.errorScore and handle blocked sessions gracefully
  4. Combine with proxy rotation - Use sessions with proxies for better anonymity and rate limit avoidance
  5. Persist cookies carefully - Only enable persistCookiesPerSession when necessary to avoid stale data
  6. Test authentication flows - Verify that your authentication handling works correctly before scaling up
  7. Monitor session health - Log session metrics like usageCount and errorScore to detect issues early

Common Pitfalls and Solutions

Sessions Getting Blocked

If sessions are frequently getting blocked, increase the session pool size and reduce usage per session:

sessionPoolOptions: {
    maxPoolSize: 20, // More sessions
    sessionOptions: {
        maxUsageCount: 10, // Fewer requests per session
        maxErrorScore: 2, // Retire sessions quickly if blocked
    },
}

Cookies Not Persisting

Ensure both useSessionPool and persistCookiesPerSession are enabled:

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    persistCookiesPerSession: true,
    // ... rest of configuration
});

Authentication Failing

Use pre-navigation hooks to ensure authentication happens before scraping protected pages:

preNavigationHooks: [
    async ({ page, session, request }) => {
        const isAuthenticated = session.getCookies(request.url)
            .some(c => c.name === 'session_id');

        if (!isAuthenticated) {
            await performLogin(page, session);
        }
    },
]

Conclusion

Crawlee's session management system provides a robust foundation for handling cookies and maintaining state across web scraping sessions. By leveraging SessionPool, you can build scalable scrapers that handle authentication, cookie persistence, and session rotation automatically. Whether you're scraping static sites with CheerioCrawler or dynamic JavaScript applications with PlaywrightCrawler, Crawlee's session management adapts to your needs while maintaining reliability and performance.

For more advanced scenarios involving complex user interactions and AJAX requests, combine session management with Crawlee's powerful browser automation capabilities to create sophisticated web scraping solutions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon