How do I use Crawlee with Apify platform?
Crawlee is deeply integrated with the Apify platform, providing a seamless way to develop locally and deploy your web scraping projects to the cloud. The Apify platform offers cloud infrastructure, scheduling, monitoring, and data storage capabilities that complement Crawlee's powerful scraping features.
Understanding Crawlee and Apify Integration
Crawlee was originally developed by Apify as their in-house scraping library and was later open-sourced. This native integration means Crawlee is designed to work perfectly with Apify's cloud infrastructure, making deployment straightforward and efficient.
When you run Crawlee locally, it stores data in local directories. On Apify, the same code automatically uses Apify's cloud storage, request queues, and datasets without any code changes.
Installing the Apify CLI
The Apify CLI is the primary tool for creating, testing, and deploying Crawlee projects to the Apify platform.
# Install Apify CLI globally
npm install -g apify-cli
# Or using Yarn
yarn global add apify-cli
# Verify installation
apify --version
After installation, log in to your Apify account:
apify login
This command will open your browser and prompt you to authenticate with your Apify account.
Creating a New Crawlee Project for Apify
The Apify CLI provides templates for quickly creating Crawlee projects:
# Create a new Crawlee project with Playwright
apify create my-crawler --template crawlee-playwright-javascript
# Or with Puppeteer
apify create my-crawler --template crawlee-puppeteer-javascript
# Or with Cheerio for static pages
apify create my-crawler --template crawlee-cheerio-javascript
# For TypeScript projects
apify create my-crawler --template crawlee-playwright-typescript
This creates a project structure optimized for both local development and Apify deployment:
my-crawler/
├── src/
│ ├── main.js # Main crawler logic
│ └── routes.js # Request handlers
├── storage/ # Local storage (gitignored)
├── .actor/
│ ├── actor.json # Apify Actor configuration
│ └── INPUT_SCHEMA.json # Input form definition
├── package.json
└── README.md
Basic Crawlee Project for Apify
Here's a simple Crawlee scraper configured for Apify deployment:
// src/main.js
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';
// Initialize the Actor
await Actor.init();
// Get input from Apify platform (or use defaults locally)
const input = await Actor.getInput();
const {
startUrls = ['https://crawlee.dev'],
maxRequestsPerCrawl = 20,
} = input || {};
// Create the crawler
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl,
async requestHandler({ request, page, enqueueLinks }) {
console.log(`Processing: ${request.url}`);
// Extract data
const title = await page.title();
const content = await page.locator('body').textContent();
// Save to Apify Dataset (or local storage when running locally)
await Actor.pushData({
url: request.url,
title,
contentLength: content.length,
timestamp: new Date().toISOString(),
});
// Enqueue links for crawling
await enqueueLinks({
strategy: 'same-domain',
});
},
async failedRequestHandler({ request }) {
console.error(`Request ${request.url} failed`);
},
});
// Run the crawler with start URLs
await crawler.run(startUrls);
// Exit the Actor
await Actor.exit();
Python Implementation with Crawlee
For Python developers, Crawlee also integrates with Apify:
# src/main.py
from apify import Actor
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
async def main():
async with Actor:
# Get input from Apify platform
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('startUrls', ['https://crawlee.dev'])
max_requests = actor_input.get('maxRequestsPerCrawl', 20)
# Create the crawler
crawler = PlaywrightCrawler(
max_requests_per_crawl=max_requests,
)
# Define request handler
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
Actor.log.info(f'Processing: {context.request.url}')
# Extract data
title = await context.page.title()
content = await context.page.text_content('body')
# Save to Apify Dataset
await context.push_data({
'url': context.request.url,
'title': title,
'contentLength': len(content),
})
# Enqueue links
await context.enqueue_links()
# Run the crawler
await crawler.run(start_urls)
Configuring Actor Input Schema
The INPUT_SCHEMA.json
file defines the input form users see when running your Actor on Apify:
{
"title": "Crawlee Web Scraper",
"type": "object",
"schemaVersion": 1,
"properties": {
"startUrls": {
"title": "Start URLs",
"type": "array",
"description": "URLs to start the crawl",
"editor": "requestListSources",
"prefill": [
{ "url": "https://crawlee.dev" }
]
},
"maxRequestsPerCrawl": {
"title": "Max requests per crawl",
"type": "integer",
"description": "Maximum number of pages to crawl",
"default": 20,
"minimum": 1
},
"proxyConfiguration": {
"title": "Proxy configuration",
"type": "object",
"editor": "proxy",
"description": "Select proxies to use"
}
},
"required": ["startUrls"]
}
Running Locally vs. On Apify
Local Development
# Run locally with default input
apify run
# Run with custom input
echo '{"startUrls": ["https://example.com"], "maxRequestsPerCrawl": 10}' | apify run
When running locally, Crawlee uses local storage in the storage/
directory.
Deploying to Apify
# Build and push to Apify platform
apify push
# Deploy with a specific version tag
apify push --version-number 1.0.0
After pushing, your Actor is available in the Apify Console where you can: - Run it on-demand - Schedule periodic runs - Configure notifications - Access scraped data - Monitor performance
Handling Browser Automation in the Cloud
When deploying Crawlee scrapers that use Puppeteer for browser automation or Playwright, Apify automatically handles browser dependencies. You don't need to worry about installing Chrome or configuring headless browsers.
For complex interactions like handling authentication, your local code works identically on Apify:
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';
await Actor.init();
const crawler = new PlaywrightCrawler({
async requestHandler({ page, request }) {
// Authentication works the same locally and on Apify
if (request.url.includes('login')) {
await page.fill('#username', 'myuser');
await page.fill('#password', 'mypassword');
await page.click('button[type="submit"]');
await page.waitForNavigation();
}
// Rest of your scraping logic
const data = await page.evaluate(() => {
return {
title: document.title,
// ... extract data
};
});
await Actor.pushData(data);
},
});
await crawler.run(['https://example.com/login']);
await Actor.exit();
Using Apify Storage with Crawlee
Crawlee's storage system automatically uses Apify's cloud storage when running on the platform:
Datasets
// Save structured data
await Actor.pushData({
productName: 'Example Product',
price: 29.99,
inStock: true,
});
// Or push multiple records
await Actor.pushData([
{ id: 1, name: 'Product 1' },
{ id: 2, name: 'Product 2' },
]);
Key-Value Store
// Store files, screenshots, or arbitrary data
await Actor.setValue('screenshot', buffer, { contentType: 'image/png' });
// Store JSON data
await Actor.setValue('config', { lastProcessed: new Date() });
// Retrieve values
const config = await Actor.getValue('config');
Request Queue
Crawlee's request queue automatically uses Apify's distributed queue:
const crawler = new PlaywrightCrawler({
async requestHandler({ request, enqueueLinks }) {
// Enqueued links automatically use Apify's request queue
await enqueueLinks({
selector: 'a.product-link',
baseUrl: request.loadedUrl,
});
},
});
Proxy Configuration on Apify
Apify provides residential and datacenter proxies that integrate seamlessly with Crawlee:
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';
await Actor.init();
const input = await Actor.getInput();
const proxyConfiguration = await Actor.createProxyConfiguration({
groups: input.proxyConfiguration?.groups || ['RESIDENTIAL'],
countryCode: input.proxyConfiguration?.countryCode || 'US',
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
async requestHandler({ page }) {
// All requests automatically use configured proxies
const content = await page.content();
// Process content...
},
});
await crawler.run(['https://example.com']);
await Actor.exit();
Scheduling and Monitoring
Once deployed, you can schedule your Crawlee scraper to run automatically:
- Scheduled Runs: Configure cron-style schedules in the Apify Console
- Webhooks: Trigger runs via HTTP requests or integrate with other services
- Monitoring: View logs, performance metrics, and receive alerts
- Data Retention: Automatic data storage with configurable retention policies
Best Practices for Apify Deployment
1. Use Environment Variables for Secrets
// Access secrets securely
await Actor.init();
const apiKey = await Actor.getValue('API_KEY');
2. Implement Proper Error Handling
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
async failedRequestHandler({ request }) {
await Actor.pushData({
url: request.url,
error: true,
errorMessage: request.errorMessages,
});
},
});
3. Use Memory Efficiently
const crawler = new PlaywrightCrawler({
maxConcurrency: 10, // Adjust based on available memory
async requestHandler({ page }) {
// Close unnecessary resources
await page.close();
},
});
4. Log Important Information
Actor.log.info('Starting crawl...');
Actor.log.debug('Processing URL:', url);
Actor.log.error('Failed to extract data:', error);
Migrating Existing Crawlee Projects
If you have an existing Crawlee project, migrating to Apify is straightforward:
- Wrap your code with
Actor.init()
andActor.exit()
- Replace local storage calls with
Actor.pushData()
- Add input handling with
Actor.getInput()
- Create
.actor/actor.json
andINPUT_SCHEMA.json
files - Test locally with
apify run
- Deploy with
apify push
Conclusion
The integration between Crawlee and Apify platform provides a powerful combination for web scraping projects. You can develop and test locally with Crawlee's excellent developer experience, then deploy to Apify's cloud infrastructure with minimal changes. This approach gives you the best of both worlds: local development flexibility and cloud scalability for production workloads.
Whether you're scraping small websites or running large-scale data extraction operations, the Crawlee-Apify combination provides the tools and infrastructure needed for reliable, maintainable web scraping solutions.