Does Crawlee Support Session Management for Authenticated Scraping?

Yes, Crawlee provides robust built-in support for session management through its SessionPool feature. This powerful capability makes authenticated web scraping significantly easier by automatically managing cookies, user agents, proxy rotation, and session state across multiple requests.

Session management is crucial when scraping websites that require login credentials, maintain user state, or implement anti-bot measures that track user behavior across multiple requests.

Understanding Crawlee's SessionPool

The SessionPool in Crawlee is a sophisticated session management system that handles:

Cookie persistence across requests
Automatic session rotation when sessions become blocked or expired
User agent management to avoid detection
Proxy rotation integrated with session state
Session retirement based on error rates and usage patterns
Concurrent session limits to prevent overloading the target site

Basic Session Management Setup

Here's how to implement basic session management in Crawlee for authenticated scraping:

JavaScript/TypeScript Example

import { PlaywrightCrawler, SessionPool } from 'crawlee';

// Create a session pool with custom configuration
const sessionPool = await SessionPool.open({
    maxPoolSize: 20,              // Maximum number of sessions
    sessionOptions: {
        maxAgeSecs: 3600,         // Session expires after 1 hour
        maxUsageCount: 50,        // Retire session after 50 uses
        maxErrorScore: 3,         // Retire after 3 errors
    },
});

const crawler = new PlaywrightCrawler({
    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxAgeSecs: 3600,
            maxUsageCount: 50,
        },
    },

    async requestHandler({ request, page, session, log }) {
        // Session is automatically managed
        log.info(`Processing ${request.url} with session ${session.id}`);

        // Cookies are automatically persisted across requests
        const cookies = await page.context().cookies();
        log.info(`Current cookies: ${JSON.stringify(cookies)}`);

        // Extract data as needed
        const data = await page.evaluate(() => {
            return {
                title: document.title,
                content: document.body.innerText,
            };
        });

        // Mark session as good if request succeeds
        session.markGood();

        return data;
    },

    failedRequestHandler({ request, session, log }, error) {
        // Mark session as bad on errors
        session.markBad();
        log.error(`Request failed for session ${session.id}: ${error}`);
    },
});

await crawler.run(['https://example.com']);

Python Example

from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.sessions import SessionPool

async def main():
    # Create crawler with session pool configuration
    crawler = PlaywrightCrawler(
        session_pool=SessionPool(
            max_pool_size=20,
            session_options={
                'max_age_secs': 3600,
                'max_usage_count': 50,
                'max_error_score': 3,
            }
        )
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # Access the session associated with this request
        session = context.session
        context.log.info(f'Processing {context.request.url} with session {session.id}')

        # Extract data with session persistence
        title = await context.page.title()

        # Mark session as successful
        session.mark_good()

        # Save data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'session_id': session.id,
        })

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Implementing Login Authentication

For websites requiring login, you can combine session management with authentication handling in Puppeteer techniques:

JavaScript Login Example

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    sessionPoolOptions: {
        maxPoolSize: 10,
        sessionOptions: {
            maxAgeSecs: 7200, // Keep session for 2 hours
        },
    },

    // Perform login before crawling
    preNavigationHooks: [
        async ({ page, session, request, log }) => {
            // Check if this session needs authentication
            if (!session.userData.isLoggedIn) {
                log.info(`Logging in with session ${session.id}`);

                // Navigate to login page
                await page.goto('https://example.com/login');

                // Fill in credentials
                await page.fill('input[name="username"]', 'your_username');
                await page.fill('input[name="password"]', 'your_password');

                // Submit login form
                await page.click('button[type="submit"]');

                // Wait for navigation after login
                await page.waitForNavigation();

                // Mark session as logged in
                session.userData.isLoggedIn = true;

                log.info(`Login successful for session ${session.id}`);
            }
        },
    ],

    async requestHandler({ page, request, session, log }) {
        log.info(`Scraping authenticated page: ${request.url}`);

        // Check if still authenticated
        const isAuthenticated = await page.evaluate(() => {
            return document.querySelector('.user-profile') !== null;
        });

        if (!isAuthenticated) {
            // Session expired, mark it bad to trigger rotation
            session.markBad();
            throw new Error('Session expired, retrying with new session');
        }

        // Extract data from authenticated page
        const data = await page.evaluate(() => {
            return {
                title: document.title,
                userData: document.querySelector('.user-profile')?.textContent,
            };
        });

        session.markGood();

        return data;
    },
});

await crawler.run(['https://example.com/dashboard']);

Advanced Session Management Features

Custom Session Data Storage

You can store custom data within sessions for complex authentication scenarios:

async requestHandler({ page, session, log }) {
    // Store custom data in the session
    if (!session.userData.authToken) {
        const token = await page.evaluate(() => {
            return localStorage.getItem('authToken');
        });
        session.userData.authToken = token;
        log.info(`Stored auth token in session ${session.id}`);
    }

    // Reuse stored authentication data
    await page.evaluate((token) => {
        localStorage.setItem('authToken', token);
    }, session.userData.authToken);
}

Session Retirement and Rotation

Crawlee automatically retires sessions based on error rates, but you can also manually control this:

async requestHandler({ page, session, log }) {
    try {
        // Check for signs of session being detected
        const isBlocked = await page.evaluate(() => {
            return document.body.innerText.includes('Access Denied');
        });

        if (isBlocked) {
            // Force retire this session
            session.retire();
            throw new Error('Session blocked, rotating to new session');
        }

        // Process normally
        session.markGood();

    } catch (error) {
        // Increase error score
        session.markBad();
        throw error;
    }
}

Integrating with Proxy Rotation

Sessions work seamlessly with proxy rotation to maintain anonymity:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    proxyConfiguration: {
        proxyUrls: [
            'http://proxy1.example.com:8000',
            'http://proxy2.example.com:8000',
            'http://proxy3.example.com:8000',
        ],
    },

    sessionPoolOptions: {
        maxPoolSize: 15,
        sessionOptions: {
            maxAgeSecs: 3600,
        },
    },

    async requestHandler({ page, session, proxyInfo, log }) {
        log.info(`Using proxy ${proxyInfo.url} with session ${session.id}`);

        // Each session maintains its own cookies with the assigned proxy
        const data = await page.evaluate(() => ({
            title: document.title,
            ipAddress: document.querySelector('.ip-display')?.textContent,
        }));

        session.markGood();
        return data;
    },
});

await crawler.run(['https://example.com']);

Handling Session Cookies Manually

While Crawlee handles cookies automatically, you can also manage them manually when needed:

async requestHandler({ page, session, log }) {
    // Get current cookies
    const cookies = await page.context().cookies();

    // Store important cookies in session userData
    session.userData.cookies = cookies;

    // Restore cookies in a new browser context
    if (session.userData.savedCookies) {
        await page.context().addCookies(session.userData.savedCookies);
    }

    // Set specific cookie
    await page.context().addCookies([{
        name: 'session_token',
        value: 'your_token_here',
        domain: 'example.com',
        path: '/',
        httpOnly: true,
        secure: true,
    }]);
}

Best Practices for Session Management

Set appropriate session limits: Configure maxPoolSize based on the target website's capacity and your scraping needs
Use reasonable session lifetimes: Set maxAgeSecs to balance between session reuse and avoiding stale sessions
Monitor session health: Implement proper error handling and use markGood() and markBad() to help Crawlee identify problematic sessions
Implement graceful degradation: Handle session expiration gracefully and allow automatic rotation
Store minimal session data: Only store necessary authentication data in session.userData to keep memory usage low
Test session persistence: Verify that sessions maintain state correctly across multiple requests, similar to handling browser sessions in Puppeteer
Combine with rate limiting: Use Crawlee's maxConcurrency and maxRequestsPerMinute options to avoid overwhelming the target site

Monitoring Session Performance

Track session statistics to optimize your scraping configuration:

const crawler = new PlaywrightCrawler({
    sessionPoolOptions: { maxPoolSize: 20 },

    async requestHandler({ session, log }) {
        // Log session statistics
        log.info(`Session ${session.id} stats:`, {
            usageCount: session.usageCount,
            errorScore: session.errorScore,
            maxErrorScore: session.maxErrorScore,
            age: Date.now() - session.createdAt,
        });

        // Process request
        session.markGood();
    },
});

Conclusion

Crawlee's SessionPool provides a comprehensive solution for authenticated web scraping with automatic cookie management, session rotation, and error handling. By leveraging these features, you can build robust scrapers that maintain authentication state across thousands of requests while avoiding detection and handling failures gracefully.

The session management system integrates seamlessly with Crawlee's other features like proxy rotation, request queueing, and browser automation, making it an excellent choice for complex authenticated scraping scenarios.

Table of contents

Does Crawlee Support Session Management for Authenticated Scraping?

Understanding Crawlee's SessionPool

Basic Session Management Setup

JavaScript/TypeScript Example

Python Example

Implementing Login Authentication

JavaScript Login Example

Advanced Session Management Features

Custom Session Data Storage

Session Retirement and Rotation

Integrating with Proxy Rotation

Handling Session Cookies Manually

Best Practices for Session Management

Monitoring Session Performance

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I handle cookies and sessions in Crawlee?

What proxy providers work best with Crawlee?

How do I implement IP rotation with Crawlee?

Get Started Now

Support