How do I use Crawlee with Apify platform?

Crawlee is deeply integrated with the Apify platform, providing a seamless way to develop locally and deploy your web scraping projects to the cloud. The Apify platform offers cloud infrastructure, scheduling, monitoring, and data storage capabilities that complement Crawlee's powerful scraping features.

Understanding Crawlee and Apify Integration

Crawlee was originally developed by Apify as their in-house scraping library and was later open-sourced. This native integration means Crawlee is designed to work perfectly with Apify's cloud infrastructure, making deployment straightforward and efficient.

When you run Crawlee locally, it stores data in local directories. On Apify, the same code automatically uses Apify's cloud storage, request queues, and datasets without any code changes.

Installing the Apify CLI

The Apify CLI is the primary tool for creating, testing, and deploying Crawlee projects to the Apify platform.

# Install Apify CLI globally
npm install -g apify-cli

# Or using Yarn
yarn global add apify-cli

# Verify installation
apify --version

After installation, log in to your Apify account:

apify login

This command will open your browser and prompt you to authenticate with your Apify account.

Creating a New Crawlee Project for Apify

The Apify CLI provides templates for quickly creating Crawlee projects:

# Create a new Crawlee project with Playwright
apify create my-crawler --template crawlee-playwright-javascript

# Or with Puppeteer
apify create my-crawler --template crawlee-puppeteer-javascript

# Or with Cheerio for static pages
apify create my-crawler --template crawlee-cheerio-javascript

# For TypeScript projects
apify create my-crawler --template crawlee-playwright-typescript

This creates a project structure optimized for both local development and Apify deployment:

my-crawler/
├── src/
│   ├── main.js          # Main crawler logic
│   └── routes.js        # Request handlers
├── storage/             # Local storage (gitignored)
├── .actor/
│   ├── actor.json       # Apify Actor configuration
│   └── INPUT_SCHEMA.json # Input form definition
├── package.json
└── README.md

Basic Crawlee Project for Apify

Here's a simple Crawlee scraper configured for Apify deployment:

// src/main.js
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

// Initialize the Actor
await Actor.init();

// Get input from Apify platform (or use defaults locally)
const input = await Actor.getInput();
const {
    startUrls = ['https://crawlee.dev'],
    maxRequestsPerCrawl = 20,
} = input || {};

// Create the crawler
const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl,
    async requestHandler({ request, page, enqueueLinks }) {
        console.log(`Processing: ${request.url}`);

        // Extract data
        const title = await page.title();
        const content = await page.locator('body').textContent();

        // Save to Apify Dataset (or local storage when running locally)
        await Actor.pushData({
            url: request.url,
            title,
            contentLength: content.length,
            timestamp: new Date().toISOString(),
        });

        // Enqueue links for crawling
        await enqueueLinks({
            strategy: 'same-domain',
        });
    },

    async failedRequestHandler({ request }) {
        console.error(`Request ${request.url} failed`);
    },
});

// Run the crawler with start URLs
await crawler.run(startUrls);

// Exit the Actor
await Actor.exit();

Python Implementation with Crawlee

For Python developers, Crawlee also integrates with Apify:

# src/main.py
from apify import Actor
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext

async def main():
    async with Actor:
        # Get input from Apify platform
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('startUrls', ['https://crawlee.dev'])
        max_requests = actor_input.get('maxRequestsPerCrawl', 20)

        # Create the crawler
        crawler = PlaywrightCrawler(
            max_requests_per_crawl=max_requests,
        )

        # Define request handler
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            Actor.log.info(f'Processing: {context.request.url}')

            # Extract data
            title = await context.page.title()
            content = await context.page.text_content('body')

            # Save to Apify Dataset
            await context.push_data({
                'url': context.request.url,
                'title': title,
                'contentLength': len(content),
            })

            # Enqueue links
            await context.enqueue_links()

        # Run the crawler
        await crawler.run(start_urls)

Configuring Actor Input Schema

The INPUT_SCHEMA.json file defines the input form users see when running your Actor on Apify:

{
    "title": "Crawlee Web Scraper",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start the crawl",
            "editor": "requestListSources",
            "prefill": [
                { "url": "https://crawlee.dev" }
            ]
        },
        "maxRequestsPerCrawl": {
            "title": "Max requests per crawl",
            "type": "integer",
            "description": "Maximum number of pages to crawl",
            "default": 20,
            "minimum": 1
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "editor": "proxy",
            "description": "Select proxies to use"
        }
    },
    "required": ["startUrls"]
}

Running Locally vs. On Apify

Local Development

# Run locally with default input
apify run

# Run with custom input
echo '{"startUrls": ["https://example.com"], "maxRequestsPerCrawl": 10}' | apify run

When running locally, Crawlee uses local storage in the storage/ directory.

Deploying to Apify

# Build and push to Apify platform
apify push

# Deploy with a specific version tag
apify push --version-number 1.0.0

After pushing, your Actor is available in the Apify Console where you can: - Run it on-demand - Schedule periodic runs - Configure notifications - Access scraped data - Monitor performance

Handling Browser Automation in the Cloud

When deploying Crawlee scrapers that use Puppeteer for browser automation or Playwright, Apify automatically handles browser dependencies. You don't need to worry about installing Chrome or configuring headless browsers.

For complex interactions like handling authentication, your local code works identically on Apify:

import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

await Actor.init();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        // Authentication works the same locally and on Apify
        if (request.url.includes('login')) {
            await page.fill('#username', 'myuser');
            await page.fill('#password', 'mypassword');
            await page.click('button[type="submit"]');
            await page.waitForNavigation();
        }

        // Rest of your scraping logic
        const data = await page.evaluate(() => {
            return {
                title: document.title,
                // ... extract data
            };
        });

        await Actor.pushData(data);
    },
});

await crawler.run(['https://example.com/login']);
await Actor.exit();

Using Apify Storage with Crawlee

Crawlee's storage system automatically uses Apify's cloud storage when running on the platform:

Datasets

// Save structured data
await Actor.pushData({
    productName: 'Example Product',
    price: 29.99,
    inStock: true,
});

// Or push multiple records
await Actor.pushData([
    { id: 1, name: 'Product 1' },
    { id: 2, name: 'Product 2' },
]);

Key-Value Store

// Store files, screenshots, or arbitrary data
await Actor.setValue('screenshot', buffer, { contentType: 'image/png' });

// Store JSON data
await Actor.setValue('config', { lastProcessed: new Date() });

// Retrieve values
const config = await Actor.getValue('config');

Request Queue

Crawlee's request queue automatically uses Apify's distributed queue:

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, enqueueLinks }) {
        // Enqueued links automatically use Apify's request queue
        await enqueueLinks({
            selector: 'a.product-link',
            baseUrl: request.loadedUrl,
        });
    },
});

Proxy Configuration on Apify

Apify provides residential and datacenter proxies that integrate seamlessly with Crawlee:

import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: input.proxyConfiguration?.groups || ['RESIDENTIAL'],
    countryCode: input.proxyConfiguration?.countryCode || 'US',
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    async requestHandler({ page }) {
        // All requests automatically use configured proxies
        const content = await page.content();
        // Process content...
    },
});

await crawler.run(['https://example.com']);
await Actor.exit();

Scheduling and Monitoring

Once deployed, you can schedule your Crawlee scraper to run automatically:

Scheduled Runs: Configure cron-style schedules in the Apify Console
Webhooks: Trigger runs via HTTP requests or integrate with other services
Monitoring: View logs, performance metrics, and receive alerts
Data Retention: Automatic data storage with configurable retention policies

Best Practices for Apify Deployment

1. Use Environment Variables for Secrets

// Access secrets securely
await Actor.init();
const apiKey = await Actor.getValue('API_KEY');

2. Implement Proper Error Handling

const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    async failedRequestHandler({ request }) {
        await Actor.pushData({
            url: request.url,
            error: true,
            errorMessage: request.errorMessages,
        });
    },
});

3. Use Memory Efficiently

const crawler = new PlaywrightCrawler({
    maxConcurrency: 10, // Adjust based on available memory
    async requestHandler({ page }) {
        // Close unnecessary resources
        await page.close();
    },
});

4. Log Important Information

Actor.log.info('Starting crawl...');
Actor.log.debug('Processing URL:', url);
Actor.log.error('Failed to extract data:', error);

Migrating Existing Crawlee Projects

If you have an existing Crawlee project, migrating to Apify is straightforward:

Wrap your code with Actor.init() and Actor.exit()
Replace local storage calls with Actor.pushData()
Add input handling with Actor.getInput()
Create .actor/actor.json and INPUT_SCHEMA.json files
Test locally with apify run
Deploy with apify push

Conclusion

The integration between Crawlee and Apify platform provides a powerful combination for web scraping projects. You can develop and test locally with Crawlee's excellent developer experience, then deploy to Apify's cloud infrastructure with minimal changes. This approach gives you the best of both worlds: local development flexibility and cloud scalability for production workloads.

Whether you're scraping small websites or running large-scale data extraction operations, the Crawlee-Apify combination provides the tools and infrastructure needed for reliable, maintainable web scraping solutions.

Table of contents