How do I Deploy Crawlee to AWS or Google Cloud?

Deploying Crawlee scrapers to cloud platforms like AWS and Google Cloud enables you to run web scraping tasks at scale with better reliability, cost efficiency, and automation. This guide covers multiple deployment strategies for both platforms, from serverless functions to containerized solutions.

Overview of Deployment Options

Crawlee can be deployed to cloud platforms in several ways:

AWS Lambda: Serverless functions for lightweight scraping tasks
AWS EC2: Virtual machines for long-running scrapers
AWS ECS/Fargate: Container orchestration for scalable deployments
Google Cloud Functions: Serverless execution for simple scrapers
Google Cloud Run: Container-based serverless platform
Google Kubernetes Engine (GKE): Full Kubernetes orchestration

Deploying to AWS Lambda

AWS Lambda is ideal for scheduled scraping tasks with predictable workloads. However, Lambda has limitations: 15-minute execution timeout and limited disk space.

Prerequisites

npm install --save-dev serverless serverless-plugin-typescript
npm install @crawlee/playwright-crawler aws-sdk

Project Structure

Create a serverless.yml configuration:

service: crawlee-scraper

provider:
  name: aws
  runtime: nodejs18.x
  region: us-east-1
  memorySize: 3008
  timeout: 900
  environment:
    NODE_ENV: production
    CRAWLEE_STORAGE_DIR: /tmp/crawlee_storage

functions:
  scraper:
    handler: src/handler.scrape
    events:
      - schedule: rate(1 hour)
    layers:
      - arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:31

plugins:
  - serverless-plugin-typescript

Lambda Handler Implementation

// src/handler.js
import { PlaywrightCrawler } from '@crawlee/playwright-crawler';
import chromium from '@sparticuz/chromium';
import { Dataset } from '@crawlee/core';

export const scrape = async (event, context) => {
  const crawler = new PlaywrightCrawler({
    launchContext: {
      launcher: {
        executablePath: await chromium.executablePath(),
        args: chromium.args,
        headless: chromium.headless,
      },
    },
    maxRequestsPerCrawl: 50,

    async requestHandler({ request, page, enqueueLinks, log }) {
      log.info(`Processing ${request.url}...`);

      const title = await page.title();
      const content = await page.content();

      await Dataset.pushData({
        url: request.url,
        title,
        timestamp: new Date().toISOString(),
      });

      await enqueueLinks({
        globs: ['https://example.com/**'],
      });
    },

    failedRequestHandler({ request, log }) {
      log.error(`Request ${request.url} failed`);
    },
  });

  await crawler.run(['https://example.com']);

  // Export data to S3
  const data = await Dataset.getData();

  return {
    statusCode: 200,
    body: JSON.stringify({
      message: 'Scraping completed',
      itemsScraped: data.items.length,
    }),
  };
};

Deploy to Lambda

# Install Serverless Framework globally
npm install -g serverless

# Configure AWS credentials
serverless config credentials --provider aws --key YOUR_KEY --secret YOUR_SECRET

# Deploy the function
serverless deploy

# Test the function
serverless invoke -f scraper

# View logs
serverless logs -f scraper -t

Deploying to AWS EC2

For long-running scrapers or when you need more control, EC2 instances are ideal. Similar to using Puppeteer with Docker, you can containerize your Crawlee application for EC2 deployment.

Dockerfile for Crawlee

FROM node:18-slim

# Install dependencies for Playwright
RUN apt-get update && apt-get install -y \
    libnss3 \
    libnspr4 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libcups2 \
    libdrm2 \
    libxkbcommon0 \
    libxcomposite1 \
    libxdamage1 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libasound2 \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Install Playwright browsers
RUN npx playwright install chromium

# Copy application files
COPY . .

# Set environment variables
ENV NODE_ENV=production
ENV CRAWLEE_STORAGE_DIR=/app/storage

# Run the scraper
CMD ["node", "src/main.js"]

EC2 Deployment Steps

# Build the Docker image
docker build -t crawlee-scraper .

# Push to Amazon ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com

docker tag crawlee-scraper:latest YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/crawlee-scraper:latest
docker push YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/crawlee-scraper:latest

# Launch EC2 instance with Docker
# (Use EC2 console or AWS CLI)

# SSH into EC2 and run container
ssh -i your-key.pem ec2-user@your-instance-ip
docker pull YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/crawlee-scraper:latest
docker run -d --restart unless-stopped crawlee-scraper

Setting Up Auto-Scaling

Create a docker-compose.yml for easier management:

version: '3.8'

services:
  scraper:
    image: crawlee-scraper:latest
    restart: unless-stopped
    environment:
      - NODE_ENV=production
      - MAX_CONCURRENCY=10
    volumes:
      - ./storage:/app/storage
      - ./logs:/app/logs
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"

Deploying to Google Cloud Functions

Google Cloud Functions provide a serverless option similar to AWS Lambda, but with different constraints and pricing.

Cloud Function Configuration

Create a package.json:

{
  "name": "crawlee-gcloud-function",
  "version": "1.0.0",
  "main": "index.js",
  "dependencies": {
    "@crawlee/cheerio-crawler": "^3.7.0",
    "@google-cloud/functions-framework": "^3.3.0",
    "@google-cloud/storage": "^7.7.0"
  },
  "scripts": {
    "start": "functions-framework --target=scrapeWebsite"
  }
}

Function Implementation

// index.js
const { CheerioCrawler } = require('@crawlee/cheerio-crawler');
const { Dataset } = require('@crawlee/core');
const { Storage } = require('@google-cloud/storage');

exports.scrapeWebsite = async (req, res) => {
  const startUrl = req.body.url || 'https://example.com';

  const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 100,

    async requestHandler({ request, $, enqueueLinks, log }) {
      log.info(`Processing ${request.url}`);

      const title = $('title').text();
      const headings = $('h1').map((i, el) => $(el).text()).get();

      await Dataset.pushData({
        url: request.url,
        title,
        headings,
        scrapedAt: new Date().toISOString(),
      });

      await enqueueLinks({
        selector: 'a[href]',
        baseUrl: request.loadedUrl,
      });
    },
  });

  await crawler.run([startUrl]);

  // Upload results to Cloud Storage
  const data = await Dataset.getData();
  const storage = new Storage();
  const bucket = storage.bucket('your-bucket-name');
  const file = bucket.file(`scrapes/${Date.now()}.json`);

  await file.save(JSON.stringify(data.items, null, 2));

  res.status(200).json({
    success: true,
    itemsScraped: data.items.length,
    storageLocation: `gs://${bucket.name}/${file.name}`,
  });
};

Deploy to Cloud Functions

# Install Google Cloud SDK
# Visit: https://cloud.google.com/sdk/docs/install

# Authenticate
gcloud auth login

# Set project
gcloud config set project YOUR_PROJECT_ID

# Deploy function
gcloud functions deploy scrapeWebsite \
  --runtime nodejs18 \
  --trigger-http \
  --allow-unauthenticated \
  --memory 2GB \
  --timeout 540s \
  --region us-central1

# Test the function
curl -X POST https://REGION-PROJECT_ID.cloudfunctions.net/scrapeWebsite \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Deploying to Google Cloud Run

Google Cloud Run offers more flexibility than Cloud Functions and supports containerized applications with longer execution times.

Cloud Run Dockerfile

FROM node:18-slim

# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
    wget \
    ca-certificates \
    fonts-liberation \
    libasound2 \
    libatk-bridge2.0-0 \
    libatk1.0-0 \
    libcups2 \
    libdbus-1-3 \
    libgdk-pixbuf2.0-0 \
    libnspr4 \
    libnss3 \
    libx11-xcb1 \
    libxcomposite1 \
    libxdamage1 \
    libxrandr2 \
    xdg-utils \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /usr/src/app

COPY package*.json ./
RUN npm ci --only=production

# Install Playwright
RUN npx playwright install chromium

COPY . .

# Cloud Run requires port 8080
ENV PORT=8080
ENV NODE_ENV=production

CMD ["node", "server.js"]

Express Server Wrapper

// server.js
const express = require('express');
const { PlaywrightCrawler } = require('@crawlee/playwright-crawler');
const { Dataset } = require('@crawlee/core');

const app = express();
app.use(express.json());

app.post('/scrape', async (req, res) => {
  const { url, maxPages = 10 } = req.body;

  if (!url) {
    return res.status(400).json({ error: 'URL is required' });
  }

  try {
    const crawler = new PlaywrightCrawler({
      maxRequestsPerCrawl: maxPages,
      async requestHandler({ request, page, log }) {
        log.info(`Scraping ${request.url}`);

        const data = await page.evaluate(() => ({
          title: document.title,
          url: window.location.href,
          textContent: document.body.innerText.substring(0, 1000),
        }));

        await Dataset.pushData(data);
      },
    });

    await crawler.run([url]);
    const results = await Dataset.getData();

    res.json({
      success: true,
      items: results.items,
      count: results.items.length,
    });
  } catch (error) {
    res.status(500).json({ error: error.message });
  }
});

const PORT = process.env.PORT || 8080;
app.listen(PORT, () => {
  console.log(`Server running on port ${PORT}`);
});

Deploy to Cloud Run

# Build and push to Container Registry
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/crawlee-scraper

# Deploy to Cloud Run
gcloud run deploy crawlee-scraper \
  --image gcr.io/YOUR_PROJECT_ID/crawlee-scraper \
  --platform managed \
  --region us-central1 \
  --memory 2Gi \
  --cpu 2 \
  --timeout 900 \
  --allow-unauthenticated \
  --max-instances 10

# Test the deployment
curl -X POST https://crawlee-scraper-xxxxx.run.app/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "maxPages": 5}'

Best Practices for Cloud Deployment

1. Storage Configuration

Store scraped data in cloud storage instead of local disk:

import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { Dataset } from '@crawlee/core';

async function saveToS3(data) {
  const s3Client = new S3Client({ region: 'us-east-1' });

  await s3Client.send(new PutObjectCommand({
    Bucket: 'your-scraping-bucket',
    Key: `scrapes/${Date.now()}.json`,
    Body: JSON.stringify(data),
    ContentType: 'application/json',
  }));
}

2. Error Handling and Monitoring

Implement comprehensive logging and monitoring:

import { PlaywrightCrawler, log } from '@crawlee/playwright-crawler';

const crawler = new PlaywrightCrawler({
  maxRequestRetries: 3,

  async failedRequestHandler({ request, log }) {
    log.error(`Request failed: ${request.url}`, {
      errorMessage: request.errorMessages,
      retryCount: request.retryCount,
    });

    // Send to monitoring service
    await sendToCloudWatch({
      metric: 'FailedRequest',
      url: request.url,
      timestamp: new Date(),
    });
  },
});

3. Resource Optimization

Configure Crawlee for cloud environments:

const crawler = new PlaywrightCrawler({
  maxConcurrency: 5,
  maxRequestsPerCrawl: 100,
  requestHandlerTimeoutSecs: 60,

  launchContext: {
    launchOptions: {
      headless: true,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
      ],
    },
  },
});

4. Scheduled Execution

Use cloud-native scheduling for regular scraping tasks:

AWS EventBridge:

aws events put-rule \
  --name scraper-schedule \
  --schedule-expression "rate(1 hour)"

aws events put-targets \
  --rule scraper-schedule \
  --targets "Id"="1","Arn"="arn:aws:lambda:REGION:ACCOUNT:function:crawlee-scraper"

Google Cloud Scheduler:

gcloud scheduler jobs create http scraper-job \
  --schedule="0 */1 * * *" \
  --uri="https://crawlee-scraper-xxxxx.run.app/scrape" \
  --http-method=POST \
  --message-body='{"url":"https://example.com"}'

Monitoring and Debugging

CloudWatch Logs (AWS)

import { CloudWatchLogsClient, PutLogEventsCommand } from '@aws-sdk/client-cloudwatch-logs';

async function logToCloudWatch(message, level = 'INFO') {
  const client = new CloudWatchLogsClient({ region: 'us-east-1' });

  await client.send(new PutLogEventsCommand({
    logGroupName: '/aws/lambda/crawlee-scraper',
    logStreamName: new Date().toISOString().split('T')[0],
    logEvents: [{
      message: JSON.stringify({ level, message, timestamp: Date.now() }),
      timestamp: Date.now(),
    }],
  }));
}

Google Cloud Logging

const { Logging } = require('@google-cloud/logging');

const logging = new Logging();
const log = logging.log('crawlee-scraper');

async function logToGCloud(message, severity = 'INFO') {
  const metadata = { severity, resource: { type: 'cloud_run_revision' } };
  const entry = log.entry(metadata, message);
  await log.write(entry);
}

Cost Optimization Tips

Use CheerioCrawler for simple HTML scraping instead of PlaywrightCrawler to reduce memory and CPU usage
Implement request filtering to avoid scraping unnecessary pages
Cache results in Redis or cloud storage to minimize redundant scraping
Use spot instances (AWS) or preemptible VMs (GCP) for cost savings
Set appropriate concurrency limits based on your cloud resources

Conclusion

Deploying Crawlee to AWS or Google Cloud provides scalability, reliability, and cost-effectiveness for web scraping operations. Choose serverless options (Lambda, Cloud Functions) for scheduled, lightweight tasks, or containerized solutions (EC2, Cloud Run, GKE) for complex, long-running scrapers. Always implement proper error handling, monitoring, and storage solutions to ensure production-ready deployments.

For more information on browser automation in cloud environments, check out handling browser sessions in Puppeteer and running multiple pages in parallel with Puppeteer.

Table of contents