Does Crawlee integrate with Apify SDK?
Yes, Crawlee integrates seamlessly with the Apify SDK, providing a powerful combination for building, deploying, and scaling web scraping projects. Crawlee was developed by Apify and is designed to work natively with the Apify platform, allowing developers to leverage cloud infrastructure, distributed storage, and advanced scheduling capabilities.
The integration between Crawlee and Apify SDK enables you to run your Crawlee scrapers locally during development and deploy them to the Apify cloud platform for production use without significant code changes. This makes it ideal for developers who want the flexibility of local development with the power of cloud-based execution.
Understanding the Relationship Between Crawlee and Apify
Crawlee is a modern web scraping and browser automation library for Node.js and Python, while the Apify SDK provides additional platform-specific features for running scrapers on Apify's cloud infrastructure. When you use Crawlee with the Apify platform, you automatically gain access to:
- Distributed storage for scraped data, screenshots, and key-value stores
- Proxy management with automatic rotation and session handling
- Scheduled runs for periodic scraping tasks
- Monitoring and logging through the Apify console
- Actor input/output handling for easy configuration
- Webhooks for event-driven workflows
The integration is seamless because Crawlee automatically detects when it's running on the Apify platform and uses Apify-specific storage and configuration without requiring code changes.
Setting Up Crawlee with Apify SDK
Installation
To use Crawlee with Apify SDK capabilities, you can start with a standard Crawlee installation:
# Install Crawlee
npm install crawlee
# For Apify-specific features, install the Apify CLI
npm install -g apify-cli
# Initialize an Apify project with Crawlee
apify create my-scraper
When creating an Apify project, you'll be prompted to choose a template. Select one of the Crawlee templates (Cheerio, Playwright, or Puppeteer) based on your scraping needs.
Basic Crawlee Script with Apify Integration
Here's a basic example showing how Crawlee automatically integrates with Apify platform features:
import { PlaywrightCrawler } from 'crawlee';
// This crawler works both locally and on Apify platform
const crawler = new PlaywrightCrawler({
// When running on Apify, this automatically uses Apify storage
async requestHandler({ page, request, enqueueLinks, pushData }) {
console.log(`Processing: ${request.url}`);
const title = await page.title();
const heading = await page.locator('h1').textContent();
// pushData automatically uses Dataset API on Apify
await pushData({
url: request.url,
title,
heading,
});
// enqueueLinks automatically uses RequestQueue on Apify
await enqueueLinks({
globs: ['https://example.com/**'],
});
},
maxRequestsPerCrawl: 50,
});
// Add initial URLs
await crawler.addRequests(['https://example.com']);
// Run the crawler
await crawler.run();
console.log('Crawler finished.');
This same code runs locally using file-based storage and on Apify using cloud-based distributed storage.
Accessing Apify Platform Features from Crawlee
Using Apify Storage
When your Crawlee scraper runs on the Apify platform, it automatically gains access to three types of storage:
1. Dataset Storage
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, pushData }) {
// This data is stored in Apify Dataset on platform
await pushData({
product: await page.locator('.product-name').textContent(),
price: await page.locator('.price').textContent(),
timestamp: new Date().toISOString(),
});
},
});
2. Key-Value Store
import { KeyValueStore } from 'crawlee';
// Store arbitrary data like screenshots or JSON files
const store = await KeyValueStore.open();
await store.setValue('screenshot', await page.screenshot(), { contentType: 'image/png' });
await store.setValue('config', { lastRun: new Date(), itemsProcessed: 100 });
3. Request Queue
import { RequestQueue } from 'crawlee';
// Manage URLs to be crawled
const queue = await RequestQueue.open();
await queue.addRequest({ url: 'https://example.com/page1' });
await queue.addRequest({ url: 'https://example.com/page2' });
Reading Apify Actor Input
When running as an Apify Actor, you can read configuration from the Actor input:
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';
await Actor.init();
// Get input from Apify platform
const input = await Actor.getInput();
const { startUrls, maxPages, searchTerm } = input;
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: maxPages || 100,
async requestHandler({ page, pushData }) {
// Use input parameters in your scraping logic
if (searchTerm) {
await page.fill('input[type="search"]', searchTerm);
await page.click('button[type="submit"]');
}
// Scrape and store data
await pushData({
/* scraped data */
});
},
});
await crawler.addRequests(startUrls);
await crawler.run();
await Actor.exit();
Python Integration with Apify SDK
Crawlee for Python also integrates with Apify platform when deployed as an Actor:
from crawlee.playwright_crawler import PlaywrightCrawler
from apify import Actor
async def main():
async with Actor:
# Get Actor input
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('start_urls', ['https://example.com'])
crawler = PlaywrightCrawler(
max_requests_per_crawl=50,
)
@crawler.router.default_handler
async def request_handler(context):
# Extract data
data = {
'url': context.request.url,
'title': await context.page.title(),
}
# Store in Apify Dataset
await context.push_data(data)
# Enqueue links
await context.enqueue_links()
await crawler.run(start_urls)
Deploying Crawlee to Apify Platform
Using Apify CLI
# Login to Apify
apify login
# Create a new Actor
apify create my-crawler
# Deploy to Apify platform
apify push
Configuring Actor Settings
Create an actor.json
file to configure your Actor:
{
"actorSpecification": 1,
"name": "my-crawlee-scraper",
"version": "1.0.0",
"buildTag": "latest",
"environmentVariables": {},
"dockerfile": "./Dockerfile",
"readme": "./README.md",
"input": "./input_schema.json",
"storages": {
"dataset": {
"actorSpecification": 1,
"views": {
"overview": {
"title": "Overview",
"transformation": {
"fields": ["url", "title", "price"]
}
}
}
}
}
}
Advanced Features: Proxy Integration
Crawlee automatically uses Apify Proxy when running on the platform. You can configure proxy settings that work both locally and on Apify:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
// On Apify: uses Apify Proxy
// Locally: uses proxy URLs if provided
proxyUrls: ['http://proxy1.com:8000', 'http://proxy2.com:8000'],
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
async requestHandler({ page, request, pushData }) {
// Your scraping logic with automatic proxy rotation
await pushData({
url: request.url,
content: await page.content(),
});
},
});
Monitoring and Debugging
When running on Apify, you get built-in monitoring features:
import { log } from 'crawlee';
// Logs are automatically sent to Apify console
log.info('Starting crawler');
log.debug('Processing URL', { url: request.url });
log.warning('Rate limit approaching');
log.error('Failed to process page', { error: error.message });
Similar to how you can handle browser sessions in Puppeteer, Crawlee manages sessions automatically, and when running on Apify, these sessions are distributed across the cloud infrastructure for better reliability.
Benefits of Using Crawlee with Apify SDK
1. Environment Portability
Write once, run anywhere. Your Crawlee code works identically in local development and on the Apify cloud platform.
2. Automatic Scaling
Apify automatically scales your Crawlee scrapers based on workload, distributing requests across multiple instances when needed.
3. Persistent Storage
Data stored during scraping is automatically persisted in the cloud and accessible via API even after the scraper finishes.
4. Built-in Proxy Management
Access to Apify's proxy services with automatic rotation and residential IP options without additional configuration.
5. Scheduling and Webhooks
Schedule your Crawlee scrapers to run periodically and trigger webhooks on completion, similar to how you might monitor network requests in Puppeteer but with cloud-based orchestration.
Migrating Existing Crawlee Projects to Apify
If you have an existing Crawlee project, migrating to Apify is straightforward:
- Add Apify initialization:
import { Actor } from 'apify';
await Actor.init();
// Your existing Crawlee code here
await Actor.exit();
- Create input schema (
input_schema.json
):
{
"title": "My Crawler Input",
"type": "object",
"properties": {
"startUrls": {
"title": "Start URLs",
"type": "array",
"description": "URLs to start crawling from",
"editor": "requestListSources"
}
},
"required": ["startUrls"]
}
- Deploy:
apify push
Handling Dynamic Content
When dealing with JavaScript-heavy websites, Crawlee's integration with Apify makes it easy to handle AJAX requests using Puppeteer or Playwright, with automatic resource management on the cloud platform:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ page, pushData }) {
// Wait for AJAX content to load
await page.waitForSelector('.ajax-content');
await page.waitForLoadState('networkidle');
const data = await page.evaluate(() => {
// Extract dynamically loaded content
return Array.from(document.querySelectorAll('.item')).map(item => ({
title: item.querySelector('.title')?.textContent,
value: item.querySelector('.value')?.textContent,
}));
});
await pushData(data);
},
});
Conclusion
Crawlee's integration with Apify SDK provides a powerful, production-ready solution for web scraping projects. The seamless compatibility allows developers to build and test locally while deploying to a robust cloud infrastructure with minimal changes. Whether you're scraping small datasets or running large-scale distributed crawls, the Crawlee-Apify combination offers the tools, storage, and scaling capabilities needed for professional web scraping applications.
The automatic detection of the Apify environment, combined with unified APIs for storage and queue management, makes it possible to write portable code that works efficiently in both development and production environments. For developers serious about web scraping at scale, this integration represents one of the most developer-friendly solutions available today.