Does Crawlee integrate with Apify SDK?

Yes, Crawlee integrates seamlessly with the Apify SDK, providing a powerful combination for building, deploying, and scaling web scraping projects. Crawlee was developed by Apify and is designed to work natively with the Apify platform, allowing developers to leverage cloud infrastructure, distributed storage, and advanced scheduling capabilities.

The integration between Crawlee and Apify SDK enables you to run your Crawlee scrapers locally during development and deploy them to the Apify cloud platform for production use without significant code changes. This makes it ideal for developers who want the flexibility of local development with the power of cloud-based execution.

Understanding the Relationship Between Crawlee and Apify

Crawlee is a modern web scraping and browser automation library for Node.js and Python, while the Apify SDK provides additional platform-specific features for running scrapers on Apify's cloud infrastructure. When you use Crawlee with the Apify platform, you automatically gain access to:

Distributed storage for scraped data, screenshots, and key-value stores
Proxy management with automatic rotation and session handling
Scheduled runs for periodic scraping tasks
Monitoring and logging through the Apify console
Actor input/output handling for easy configuration
Webhooks for event-driven workflows

The integration is seamless because Crawlee automatically detects when it's running on the Apify platform and uses Apify-specific storage and configuration without requiring code changes.

Setting Up Crawlee with Apify SDK

Installation

To use Crawlee with Apify SDK capabilities, you can start with a standard Crawlee installation:

# Install Crawlee
npm install crawlee

# For Apify-specific features, install the Apify CLI
npm install -g apify-cli

# Initialize an Apify project with Crawlee
apify create my-scraper

When creating an Apify project, you'll be prompted to choose a template. Select one of the Crawlee templates (Cheerio, Playwright, or Puppeteer) based on your scraping needs.

Basic Crawlee Script with Apify Integration

Here's a basic example showing how Crawlee automatically integrates with Apify platform features:

import { PlaywrightCrawler } from 'crawlee';

// This crawler works both locally and on Apify platform
const crawler = new PlaywrightCrawler({
    // When running on Apify, this automatically uses Apify storage
    async requestHandler({ page, request, enqueueLinks, pushData }) {
        console.log(`Processing: ${request.url}`);

        const title = await page.title();
        const heading = await page.locator('h1').textContent();

        // pushData automatically uses Dataset API on Apify
        await pushData({
            url: request.url,
            title,
            heading,
        });

        // enqueueLinks automatically uses RequestQueue on Apify
        await enqueueLinks({
            globs: ['https://example.com/**'],
        });
    },

    maxRequestsPerCrawl: 50,
});

// Add initial URLs
await crawler.addRequests(['https://example.com']);

// Run the crawler
await crawler.run();

console.log('Crawler finished.');

This same code runs locally using file-based storage and on Apify using cloud-based distributed storage.

Accessing Apify Platform Features from Crawlee

Using Apify Storage

When your Crawlee scraper runs on the Apify platform, it automatically gains access to three types of storage:

1. Dataset Storage

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, pushData }) {
        // This data is stored in Apify Dataset on platform
        await pushData({
            product: await page.locator('.product-name').textContent(),
            price: await page.locator('.price').textContent(),
            timestamp: new Date().toISOString(),
        });
    },
});

2. Key-Value Store

import { KeyValueStore } from 'crawlee';

// Store arbitrary data like screenshots or JSON files
const store = await KeyValueStore.open();
await store.setValue('screenshot', await page.screenshot(), { contentType: 'image/png' });
await store.setValue('config', { lastRun: new Date(), itemsProcessed: 100 });

3. Request Queue

import { RequestQueue } from 'crawlee';

// Manage URLs to be crawled
const queue = await RequestQueue.open();
await queue.addRequest({ url: 'https://example.com/page1' });
await queue.addRequest({ url: 'https://example.com/page2' });

Reading Apify Actor Input

When running as an Apify Actor, you can read configuration from the Actor input:

import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

await Actor.init();

// Get input from Apify platform
const input = await Actor.getInput();
const { startUrls, maxPages, searchTerm } = input;

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: maxPages || 100,
    async requestHandler({ page, pushData }) {
        // Use input parameters in your scraping logic
        if (searchTerm) {
            await page.fill('input[type="search"]', searchTerm);
            await page.click('button[type="submit"]');
        }

        // Scrape and store data
        await pushData({
            /* scraped data */
        });
    },
});

await crawler.addRequests(startUrls);
await crawler.run();

await Actor.exit();

Python Integration with Apify SDK

Crawlee for Python also integrates with Apify platform when deployed as an Actor:

from crawlee.playwright_crawler import PlaywrightCrawler
from apify import Actor

async def main():
    async with Actor:
        # Get Actor input
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls', ['https://example.com'])

        crawler = PlaywrightCrawler(
            max_requests_per_crawl=50,
        )

        @crawler.router.default_handler
        async def request_handler(context):
            # Extract data
            data = {
                'url': context.request.url,
                'title': await context.page.title(),
            }

            # Store in Apify Dataset
            await context.push_data(data)

            # Enqueue links
            await context.enqueue_links()

        await crawler.run(start_urls)

Deploying Crawlee to Apify Platform

Using Apify CLI

# Login to Apify
apify login

# Create a new Actor
apify create my-crawler

# Deploy to Apify platform
apify push

Configuring Actor Settings

Create an actor.json file to configure your Actor:

{
    "actorSpecification": 1,
    "name": "my-crawlee-scraper",
    "version": "1.0.0",
    "buildTag": "latest",
    "environmentVariables": {},
    "dockerfile": "./Dockerfile",
    "readme": "./README.md",
    "input": "./input_schema.json",
    "storages": {
        "dataset": {
            "actorSpecification": 1,
            "views": {
                "overview": {
                    "title": "Overview",
                    "transformation": {
                        "fields": ["url", "title", "price"]
                    }
                }
            }
        }
    }
}

Advanced Features: Proxy Integration

Crawlee automatically uses Apify Proxy when running on the platform. You can configure proxy settings that work both locally and on Apify:

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    // On Apify: uses Apify Proxy
    // Locally: uses proxy URLs if provided
    proxyUrls: ['http://proxy1.com:8000', 'http://proxy2.com:8000'],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    async requestHandler({ page, request, pushData }) {
        // Your scraping logic with automatic proxy rotation
        await pushData({
            url: request.url,
            content: await page.content(),
        });
    },
});

Monitoring and Debugging

When running on Apify, you get built-in monitoring features:

import { log } from 'crawlee';

// Logs are automatically sent to Apify console
log.info('Starting crawler');
log.debug('Processing URL', { url: request.url });
log.warning('Rate limit approaching');
log.error('Failed to process page', { error: error.message });

Similar to how you can handle browser sessions in Puppeteer, Crawlee manages sessions automatically, and when running on Apify, these sessions are distributed across the cloud infrastructure for better reliability.

Benefits of Using Crawlee with Apify SDK

1. Environment Portability

Write once, run anywhere. Your Crawlee code works identically in local development and on the Apify cloud platform.

2. Automatic Scaling

Apify automatically scales your Crawlee scrapers based on workload, distributing requests across multiple instances when needed.

3. Persistent Storage

Data stored during scraping is automatically persisted in the cloud and accessible via API even after the scraper finishes.

4. Built-in Proxy Management

Access to Apify's proxy services with automatic rotation and residential IP options without additional configuration.

5. Scheduling and Webhooks

Schedule your Crawlee scrapers to run periodically and trigger webhooks on completion, similar to how you might monitor network requests in Puppeteer but with cloud-based orchestration.

Migrating Existing Crawlee Projects to Apify

If you have an existing Crawlee project, migrating to Apify is straightforward:

Add Apify initialization:

import { Actor } from 'apify';

await Actor.init();
// Your existing Crawlee code here
await Actor.exit();

Create input schema (input_schema.json):

{
    "title": "My Crawler Input",
    "type": "object",
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start crawling from",
            "editor": "requestListSources"
        }
    },
    "required": ["startUrls"]
}

Deploy:

apify push

Handling Dynamic Content

When dealing with JavaScript-heavy websites, Crawlee's integration with Apify makes it easy to handle AJAX requests using Puppeteer or Playwright, with automatic resource management on the cloud platform:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, pushData }) {
        // Wait for AJAX content to load
        await page.waitForSelector('.ajax-content');
        await page.waitForLoadState('networkidle');

        const data = await page.evaluate(() => {
            // Extract dynamically loaded content
            return Array.from(document.querySelectorAll('.item')).map(item => ({
                title: item.querySelector('.title')?.textContent,
                value: item.querySelector('.value')?.textContent,
            }));
        });

        await pushData(data);
    },
});

Conclusion

Crawlee's integration with Apify SDK provides a powerful, production-ready solution for web scraping projects. The seamless compatibility allows developers to build and test locally while deploying to a robust cloud infrastructure with minimal changes. Whether you're scraping small datasets or running large-scale distributed crawls, the Crawlee-Apify combination offers the tools, storage, and scaling capabilities needed for professional web scraping applications.

The automatic detection of the Apify environment, combined with unified APIs for storage and queue management, makes it possible to write portable code that works efficiently in both development and production environments. For developers serious about web scraping at scale, this integration represents one of the most developer-friendly solutions available today.

Table of contents