How do I Handle Cookies and Sessions in Crawlee?
Handling cookies and sessions is essential for web scraping scenarios that require authentication, maintaining state across multiple requests, or simulating realistic user behavior. Crawlee provides built-in session management capabilities that automatically handle cookies, proxy rotation, and request persistence, making it easier to scrape websites that require login or track user sessions.
Understanding Session Management in Crawlee
Crawlee's session management is designed to help you maintain persistent state across multiple requests while handling common challenges like:
- Cookie persistence - Automatically storing and reusing cookies across requests
- Session rotation - Distributing requests across multiple sessions to avoid rate limiting
- Proxy integration - Combining sessions with proxy rotation for better anonymity
- Session retirement - Automatically retiring sessions that encounter errors or blocks
The SessionPool
class is at the heart of Crawlee's session management, providing automatic session creation, rotation, and lifecycle management.
Basic Cookie Handling with CheerioCrawler
For simple cookie handling with static websites, you can use CheerioCrawler
with session management enabled:
import { CheerioCrawler, SessionPool } from 'crawlee';
const crawler = new CheerioCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
maxRequestRetries: 2,
sessionPoolOptions: {
maxPoolSize: 10,
sessionOptions: {
maxUsageCount: 50,
maxErrorScore: 3,
},
},
async requestHandler({ request, $, session }) {
console.log(`Processing ${request.url} with session ${session.id}`);
// The session automatically handles cookies
// Extract data as needed
const title = $('title').text();
console.log(`Title: ${title}`);
// Session cookies are automatically persisted
console.log(`Session cookies: ${JSON.stringify(session.getCookies())}`);
},
});
await crawler.run(['https://example.com']);
Advanced Cookie Management with PlaywrightCrawler
For websites that require JavaScript execution and more complex authentication flows, use PlaywrightCrawler
which integrates seamlessly with browser session management:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
sessionPoolOptions: {
maxPoolSize: 5,
},
async requestHandler({ request, page, session }) {
console.log(`Session ID: ${session.id}`);
// Login logic (only runs once per session)
if (request.url.includes('login')) {
await page.fill('input[name="username"]', 'your-username');
await page.fill('input[name="password"]', 'your-password');
await page.click('button[type="submit"]');
await page.waitForNavigation();
// Cookies are automatically stored in the session
console.log('Login successful, cookies stored');
}
// Extract protected data
const content = await page.content();
console.log(`Scraped ${content.length} characters`);
},
});
await crawler.run([
'https://example.com/login',
'https://example.com/protected-page-1',
'https://example.com/protected-page-2',
]);
Manual Cookie Management
Sometimes you need fine-grained control over cookies. Here's how to manually set, get, and delete cookies:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
async requestHandler({ page, session }) {
// Set custom cookies
const cookies = [
{
name: 'auth_token',
value: 'your-token-here',
domain: '.example.com',
path: '/',
expires: Date.now() + 86400000, // 1 day
httpOnly: true,
secure: true,
sameSite: 'Lax',
},
];
await page.context().addCookies(cookies);
// Navigate to page with cookies
await page.goto('https://example.com/dashboard');
// Get all cookies from the browser
const browserCookies = await page.context().cookies();
console.log('Browser cookies:', browserCookies);
// Store cookies in session for reuse
browserCookies.forEach(cookie => {
session.setCookies([cookie], 'https://example.com');
});
// Retrieve cookies from session
const sessionCookies = session.getCookies('https://example.com');
console.log('Session cookies:', sessionCookies);
},
});
await crawler.run(['https://example.com']);
Session Pool Configuration
The SessionPool
allows you to configure how sessions are created, rotated, and retired:
import { CheerioCrawler, SessionPool } from 'crawlee';
const sessionPool = new SessionPool({
maxPoolSize: 20, // Maximum number of sessions
sessionOptions: {
maxAgeSecs: 3600, // Session expires after 1 hour
maxUsageCount: 100, // Max requests per session
maxErrorScore: 5, // Retire session after 5 errors
},
createSessionFunction: (sessionPool) => {
// Custom session creation logic
const session = sessionPool.createSession();
// Pre-configure cookies for each new session
session.setCookies([{
name: 'initial_cookie',
value: 'initial_value',
domain: '.example.com',
}], 'https://example.com');
return session;
},
});
const crawler = new CheerioCrawler({
sessionPool,
persistCookiesPerSession: true,
async requestHandler({ request, session }) {
console.log(`Session ${session.id} usage: ${session.usageCount}`);
console.log(`Session error score: ${session.errorScore}`);
// Your scraping logic here
},
});
await crawler.run(['https://example.com']);
Handling Authentication with Sessions
Here's a complete example of handling login authentication and maintaining sessions across multiple pages:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
sessionPoolOptions: {
maxPoolSize: 3,
},
preNavigationHooks: [
async ({ page, request, session }) => {
// Check if session is already authenticated
const cookies = session.getCookies(request.url);
const isAuthenticated = cookies.some(c => c.name === 'auth_token');
if (!isAuthenticated && !request.url.includes('login')) {
// Redirect to login page first
console.log('Session not authenticated, logging in...');
await page.goto('https://example.com/login');
await page.fill('#username', 'user@example.com');
await page.fill('#password', 'securePassword123');
await page.click('button[type="submit"]');
// Wait for authentication to complete
await page.waitForURL('**/dashboard', { timeout: 10000 });
// Store authentication cookies
const authCookies = await page.context().cookies();
authCookies.forEach(cookie => {
session.setCookies([cookie], request.url);
});
console.log('Authentication successful');
}
},
],
async requestHandler({ request, page, session }) {
console.log(`Scraping ${request.url} with session ${session.id}`);
// Scrape protected content
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent,
userData: document.querySelector('.user-profile')?.textContent,
};
});
await Dataset.pushData({
url: request.url,
sessionId: session.id,
...data,
});
},
});
await crawler.run([
'https://example.com/dashboard',
'https://example.com/profile',
'https://example.com/settings',
]);
Python: Cookie and Session Management in Crawlee
Crawlee for Python also supports session management with similar functionality:
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
crawler = PlaywrightCrawler(
use_session_pool=True,
persist_cookies_per_session=True,
session_pool_options={
'max_pool_size': 5,
'session_options': {
'max_usage_count': 50,
'max_error_score': 3,
},
},
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
session = context.session
print(f'Processing with session {session.id}')
# Handle authentication
if 'login' in context.request.url:
await page.fill('input[name="username"]', 'your-username')
await page.fill('input[name="password"]', 'your-password')
await page.click('button[type="submit"]')
await page.wait_for_url('**/dashboard')
print('Login successful')
# Get cookies from session
cookies = session.cookies
print(f'Session has {len(cookies)} cookies')
# Extract data
title = await page.title()
await context.push_data({'title': title, 'session_id': session.id})
await crawler.run([
'https://example.com/login',
'https://example.com/protected-page',
])
Session Persistence and Storage
Crawlee can persist sessions to disk, allowing you to resume scraping across multiple runs:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
// Sessions are automatically stored in ./storage/sessions
sessionPoolOptions: {
persistStateKeyValueStoreId: 'my-session-store',
maxPoolSize: 10,
},
async requestHandler({ request, page, session }) {
console.log(`Using session ${session.id}`);
// Your scraping logic
},
});
await crawler.run(['https://example.com']);
// Sessions are saved and can be reused in the next run
Best Practices for Cookie and Session Management
- Enable session pooling - Always use
useSessionPool: true
for sites requiring authentication - Configure appropriate limits - Set
maxPoolSize
,maxUsageCount
, andmaxErrorScore
based on your needs - Handle session retirement - Monitor
session.errorScore
and handle blocked sessions gracefully - Combine with proxy rotation - Use sessions with proxies for better anonymity and rate limit avoidance
- Persist cookies carefully - Only enable
persistCookiesPerSession
when necessary to avoid stale data - Test authentication flows - Verify that your authentication handling works correctly before scaling up
- Monitor session health - Log session metrics like
usageCount
anderrorScore
to detect issues early
Common Pitfalls and Solutions
Sessions Getting Blocked
If sessions are frequently getting blocked, increase the session pool size and reduce usage per session:
sessionPoolOptions: {
maxPoolSize: 20, // More sessions
sessionOptions: {
maxUsageCount: 10, // Fewer requests per session
maxErrorScore: 2, // Retire sessions quickly if blocked
},
}
Cookies Not Persisting
Ensure both useSessionPool
and persistCookiesPerSession
are enabled:
const crawler = new PlaywrightCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
// ... rest of configuration
});
Authentication Failing
Use pre-navigation hooks to ensure authentication happens before scraping protected pages:
preNavigationHooks: [
async ({ page, session, request }) => {
const isAuthenticated = session.getCookies(request.url)
.some(c => c.name === 'session_id');
if (!isAuthenticated) {
await performLogin(page, session);
}
},
]
Conclusion
Crawlee's session management system provides a robust foundation for handling cookies and maintaining state across web scraping sessions. By leveraging SessionPool
, you can build scalable scrapers that handle authentication, cookie persistence, and session rotation automatically. Whether you're scraping static sites with CheerioCrawler
or dynamic JavaScript applications with PlaywrightCrawler
, Crawlee's session management adapts to your needs while maintaining reliability and performance.
For more advanced scenarios involving complex user interactions and AJAX requests, combine session management with Crawlee's powerful browser automation capabilities to create sophisticated web scraping solutions.