How do I Deploy Crawlee to AWS or Google Cloud?
Deploying Crawlee scrapers to cloud platforms like AWS and Google Cloud enables you to run web scraping tasks at scale with better reliability, cost efficiency, and automation. This guide covers multiple deployment strategies for both platforms, from serverless functions to containerized solutions.
Overview of Deployment Options
Crawlee can be deployed to cloud platforms in several ways:
- AWS Lambda: Serverless functions for lightweight scraping tasks
- AWS EC2: Virtual machines for long-running scrapers
- AWS ECS/Fargate: Container orchestration for scalable deployments
- Google Cloud Functions: Serverless execution for simple scrapers
- Google Cloud Run: Container-based serverless platform
- Google Kubernetes Engine (GKE): Full Kubernetes orchestration
Deploying to AWS Lambda
AWS Lambda is ideal for scheduled scraping tasks with predictable workloads. However, Lambda has limitations: 15-minute execution timeout and limited disk space.
Prerequisites
npm install --save-dev serverless serverless-plugin-typescript
npm install @crawlee/playwright-crawler aws-sdk
Project Structure
Create a serverless.yml
configuration:
service: crawlee-scraper
provider:
name: aws
runtime: nodejs18.x
region: us-east-1
memorySize: 3008
timeout: 900
environment:
NODE_ENV: production
CRAWLEE_STORAGE_DIR: /tmp/crawlee_storage
functions:
scraper:
handler: src/handler.scrape
events:
- schedule: rate(1 hour)
layers:
- arn:aws:lambda:us-east-1:764866452798:layer:chrome-aws-lambda:31
plugins:
- serverless-plugin-typescript
Lambda Handler Implementation
// src/handler.js
import { PlaywrightCrawler } from '@crawlee/playwright-crawler';
import chromium from '@sparticuz/chromium';
import { Dataset } from '@crawlee/core';
export const scrape = async (event, context) => {
const crawler = new PlaywrightCrawler({
launchContext: {
launcher: {
executablePath: await chromium.executablePath(),
args: chromium.args,
headless: chromium.headless,
},
},
maxRequestsPerCrawl: 50,
async requestHandler({ request, page, enqueueLinks, log }) {
log.info(`Processing ${request.url}...`);
const title = await page.title();
const content = await page.content();
await Dataset.pushData({
url: request.url,
title,
timestamp: new Date().toISOString(),
});
await enqueueLinks({
globs: ['https://example.com/**'],
});
},
failedRequestHandler({ request, log }) {
log.error(`Request ${request.url} failed`);
},
});
await crawler.run(['https://example.com']);
// Export data to S3
const data = await Dataset.getData();
return {
statusCode: 200,
body: JSON.stringify({
message: 'Scraping completed',
itemsScraped: data.items.length,
}),
};
};
Deploy to Lambda
# Install Serverless Framework globally
npm install -g serverless
# Configure AWS credentials
serverless config credentials --provider aws --key YOUR_KEY --secret YOUR_SECRET
# Deploy the function
serverless deploy
# Test the function
serverless invoke -f scraper
# View logs
serverless logs -f scraper -t
Deploying to AWS EC2
For long-running scrapers or when you need more control, EC2 instances are ideal. Similar to using Puppeteer with Docker, you can containerize your Crawlee application for EC2 deployment.
Dockerfile for Crawlee
FROM node:18-slim
# Install dependencies for Playwright
RUN apt-get update && apt-get install -y \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libxkbcommon0 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libasound2 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Install Playwright browsers
RUN npx playwright install chromium
# Copy application files
COPY . .
# Set environment variables
ENV NODE_ENV=production
ENV CRAWLEE_STORAGE_DIR=/app/storage
# Run the scraper
CMD ["node", "src/main.js"]
EC2 Deployment Steps
# Build the Docker image
docker build -t crawlee-scraper .
# Push to Amazon ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com
docker tag crawlee-scraper:latest YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/crawlee-scraper:latest
docker push YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/crawlee-scraper:latest
# Launch EC2 instance with Docker
# (Use EC2 console or AWS CLI)
# SSH into EC2 and run container
ssh -i your-key.pem ec2-user@your-instance-ip
docker pull YOUR_ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/crawlee-scraper:latest
docker run -d --restart unless-stopped crawlee-scraper
Setting Up Auto-Scaling
Create a docker-compose.yml
for easier management:
version: '3.8'
services:
scraper:
image: crawlee-scraper:latest
restart: unless-stopped
environment:
- NODE_ENV=production
- MAX_CONCURRENCY=10
volumes:
- ./storage:/app/storage
- ./logs:/app/logs
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Deploying to Google Cloud Functions
Google Cloud Functions provide a serverless option similar to AWS Lambda, but with different constraints and pricing.
Cloud Function Configuration
Create a package.json
:
{
"name": "crawlee-gcloud-function",
"version": "1.0.0",
"main": "index.js",
"dependencies": {
"@crawlee/cheerio-crawler": "^3.7.0",
"@google-cloud/functions-framework": "^3.3.0",
"@google-cloud/storage": "^7.7.0"
},
"scripts": {
"start": "functions-framework --target=scrapeWebsite"
}
}
Function Implementation
// index.js
const { CheerioCrawler } = require('@crawlee/cheerio-crawler');
const { Dataset } = require('@crawlee/core');
const { Storage } = require('@google-cloud/storage');
exports.scrapeWebsite = async (req, res) => {
const startUrl = req.body.url || 'https://example.com';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ request, $, enqueueLinks, log }) {
log.info(`Processing ${request.url}`);
const title = $('title').text();
const headings = $('h1').map((i, el) => $(el).text()).get();
await Dataset.pushData({
url: request.url,
title,
headings,
scrapedAt: new Date().toISOString(),
});
await enqueueLinks({
selector: 'a[href]',
baseUrl: request.loadedUrl,
});
},
});
await crawler.run([startUrl]);
// Upload results to Cloud Storage
const data = await Dataset.getData();
const storage = new Storage();
const bucket = storage.bucket('your-bucket-name');
const file = bucket.file(`scrapes/${Date.now()}.json`);
await file.save(JSON.stringify(data.items, null, 2));
res.status(200).json({
success: true,
itemsScraped: data.items.length,
storageLocation: `gs://${bucket.name}/${file.name}`,
});
};
Deploy to Cloud Functions
# Install Google Cloud SDK
# Visit: https://cloud.google.com/sdk/docs/install
# Authenticate
gcloud auth login
# Set project
gcloud config set project YOUR_PROJECT_ID
# Deploy function
gcloud functions deploy scrapeWebsite \
--runtime nodejs18 \
--trigger-http \
--allow-unauthenticated \
--memory 2GB \
--timeout 540s \
--region us-central1
# Test the function
curl -X POST https://REGION-PROJECT_ID.cloudfunctions.net/scrapeWebsite \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Deploying to Google Cloud Run
Google Cloud Run offers more flexibility than Cloud Functions and supports containerized applications with longer execution times.
Cloud Run Dockerfile
FROM node:18-slim
# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
wget \
ca-certificates \
fonts-liberation \
libasound2 \
libatk-bridge2.0-0 \
libatk1.0-0 \
libcups2 \
libdbus-1-3 \
libgdk-pixbuf2.0-0 \
libnspr4 \
libnss3 \
libx11-xcb1 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
xdg-utils \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /usr/src/app
COPY package*.json ./
RUN npm ci --only=production
# Install Playwright
RUN npx playwright install chromium
COPY . .
# Cloud Run requires port 8080
ENV PORT=8080
ENV NODE_ENV=production
CMD ["node", "server.js"]
Express Server Wrapper
// server.js
const express = require('express');
const { PlaywrightCrawler } = require('@crawlee/playwright-crawler');
const { Dataset } = require('@crawlee/core');
const app = express();
app.use(express.json());
app.post('/scrape', async (req, res) => {
const { url, maxPages = 10 } = req.body;
if (!url) {
return res.status(400).json({ error: 'URL is required' });
}
try {
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: maxPages,
async requestHandler({ request, page, log }) {
log.info(`Scraping ${request.url}`);
const data = await page.evaluate(() => ({
title: document.title,
url: window.location.href,
textContent: document.body.innerText.substring(0, 1000),
}));
await Dataset.pushData(data);
},
});
await crawler.run([url]);
const results = await Dataset.getData();
res.json({
success: true,
items: results.items,
count: results.items.length,
});
} catch (error) {
res.status(500).json({ error: error.message });
}
});
const PORT = process.env.PORT || 8080;
app.listen(PORT, () => {
console.log(`Server running on port ${PORT}`);
});
Deploy to Cloud Run
# Build and push to Container Registry
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/crawlee-scraper
# Deploy to Cloud Run
gcloud run deploy crawlee-scraper \
--image gcr.io/YOUR_PROJECT_ID/crawlee-scraper \
--platform managed \
--region us-central1 \
--memory 2Gi \
--cpu 2 \
--timeout 900 \
--allow-unauthenticated \
--max-instances 10
# Test the deployment
curl -X POST https://crawlee-scraper-xxxxx.run.app/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "maxPages": 5}'
Best Practices for Cloud Deployment
1. Storage Configuration
Store scraped data in cloud storage instead of local disk:
import { S3Client, PutObjectCommand } from '@aws-sdk/client-s3';
import { Dataset } from '@crawlee/core';
async function saveToS3(data) {
const s3Client = new S3Client({ region: 'us-east-1' });
await s3Client.send(new PutObjectCommand({
Bucket: 'your-scraping-bucket',
Key: `scrapes/${Date.now()}.json`,
Body: JSON.stringify(data),
ContentType: 'application/json',
}));
}
2. Error Handling and Monitoring
Implement comprehensive logging and monitoring:
import { PlaywrightCrawler, log } from '@crawlee/playwright-crawler';
const crawler = new PlaywrightCrawler({
maxRequestRetries: 3,
async failedRequestHandler({ request, log }) {
log.error(`Request failed: ${request.url}`, {
errorMessage: request.errorMessages,
retryCount: request.retryCount,
});
// Send to monitoring service
await sendToCloudWatch({
metric: 'FailedRequest',
url: request.url,
timestamp: new Date(),
});
},
});
3. Resource Optimization
Configure Crawlee for cloud environments:
const crawler = new PlaywrightCrawler({
maxConcurrency: 5,
maxRequestsPerCrawl: 100,
requestHandlerTimeoutSecs: 60,
launchContext: {
launchOptions: {
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-gpu',
],
},
},
});
4. Scheduled Execution
Use cloud-native scheduling for regular scraping tasks:
AWS EventBridge:
aws events put-rule \
--name scraper-schedule \
--schedule-expression "rate(1 hour)"
aws events put-targets \
--rule scraper-schedule \
--targets "Id"="1","Arn"="arn:aws:lambda:REGION:ACCOUNT:function:crawlee-scraper"
Google Cloud Scheduler:
gcloud scheduler jobs create http scraper-job \
--schedule="0 */1 * * *" \
--uri="https://crawlee-scraper-xxxxx.run.app/scrape" \
--http-method=POST \
--message-body='{"url":"https://example.com"}'
Monitoring and Debugging
CloudWatch Logs (AWS)
import { CloudWatchLogsClient, PutLogEventsCommand } from '@aws-sdk/client-cloudwatch-logs';
async function logToCloudWatch(message, level = 'INFO') {
const client = new CloudWatchLogsClient({ region: 'us-east-1' });
await client.send(new PutLogEventsCommand({
logGroupName: '/aws/lambda/crawlee-scraper',
logStreamName: new Date().toISOString().split('T')[0],
logEvents: [{
message: JSON.stringify({ level, message, timestamp: Date.now() }),
timestamp: Date.now(),
}],
}));
}
Google Cloud Logging
const { Logging } = require('@google-cloud/logging');
const logging = new Logging();
const log = logging.log('crawlee-scraper');
async function logToGCloud(message, severity = 'INFO') {
const metadata = { severity, resource: { type: 'cloud_run_revision' } };
const entry = log.entry(metadata, message);
await log.write(entry);
}
Cost Optimization Tips
- Use CheerioCrawler for simple HTML scraping instead of PlaywrightCrawler to reduce memory and CPU usage
- Implement request filtering to avoid scraping unnecessary pages
- Cache results in Redis or cloud storage to minimize redundant scraping
- Use spot instances (AWS) or preemptible VMs (GCP) for cost savings
- Set appropriate concurrency limits based on your cloud resources
Conclusion
Deploying Crawlee to AWS or Google Cloud provides scalability, reliability, and cost-effectiveness for web scraping operations. Choose serverless options (Lambda, Cloud Functions) for scheduled, lightweight tasks, or containerized solutions (EC2, Cloud Run, GKE) for complex, long-running scrapers. Always implement proper error handling, monitoring, and storage solutions to ensure production-ready deployments.
For more information on browser automation in cloud environments, check out handling browser sessions in Puppeteer and running multiple pages in parallel with Puppeteer.