How do I Schedule Crawlee Scrapers to Run Automatically?
Scheduling Crawlee scrapers to run automatically is essential for maintaining up-to-date data collections, monitoring websites, and building reliable web scraping pipelines. Whether you need to scrape data hourly, daily, or at custom intervals, there are several robust methods to automate your Crawlee scrapers.
In this comprehensive guide, we'll explore multiple approaches to scheduling Crawlee scrapers, from simple cron jobs to cloud-based solutions, along with best practices for production deployments.
Why Automate Crawlee Scrapers?
Before diving into implementation, let's understand the benefits of automated scraping:
- Consistent Data Collection: Gather fresh data at regular intervals without manual intervention
- Time Efficiency: Run scrapers during off-peak hours to avoid detection or rate limiting
- Scalability: Handle multiple scraping tasks across different schedules
- Reliability: Automatic retries and monitoring ensure data collection continues even if individual runs fail
- Cost Optimization: Schedule scrapers strategically to minimize resource usage and API costs
Method 1: Using Node.js Cron Jobs with node-cron
The node-cron
package provides a simple, Node.js-native way to schedule tasks using cron syntax. This is ideal for applications that need to run scrapers within the same process.
Installation
npm install node-cron
# or
yarn add node-cron
Basic Implementation
const cron = require('node-cron');
const { PlaywrightCrawler } = require('crawlee');
// Define your scraper
async function runScraper() {
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
const title = await page.title();
console.log(`Title: ${title}`);
// Your scraping logic here
await enqueueLinks({
globs: ['https://example.com/**'],
});
},
maxRequestsPerCrawl: 100,
});
await crawler.run(['https://example.com']);
}
// Schedule scraper to run every day at 2:00 AM
cron.schedule('0 2 * * *', async () => {
console.log('Starting scheduled scraper at', new Date().toISOString());
try {
await runScraper();
console.log('Scraper completed successfully');
} catch (error) {
console.error('Scraper failed:', error);
}
});
console.log('Scheduler started. Waiting for scheduled tasks...');
Advanced Scheduling with Multiple Tasks
const cron = require('node-cron');
const { CheerioCrawler, Dataset } = require('crawlee');
// Product scraper
async function scrapeProducts() {
const crawler = new CheerioCrawler({
requestHandler: async ({ $, request }) => {
const products = [];
$('.product').each((i, element) => {
products.push({
name: $(element).find('.product-name').text(),
price: $(element).find('.price').text(),
scrapedAt: new Date().toISOString(),
});
});
await Dataset.pushData(products);
},
});
await crawler.run(['https://example-store.com/products']);
}
// News scraper
async function scrapeNews() {
const crawler = new CheerioCrawler({
requestHandler: async ({ $, request }) => {
const articles = [];
$('.article').each((i, element) => {
articles.push({
title: $(element).find('.title').text(),
author: $(element).find('.author').text(),
scrapedAt: new Date().toISOString(),
});
});
await Dataset.pushData(articles);
},
});
await crawler.run(['https://example-news.com']);
}
// Run products scraper every 6 hours
cron.schedule('0 */6 * * *', async () => {
console.log('Running products scraper...');
await scrapeProducts();
});
// Run news scraper every hour
cron.schedule('0 * * * *', async () => {
console.log('Running news scraper...');
await scrapeNews();
});
// Run daily report at midnight
cron.schedule('0 0 * * *', async () => {
console.log('Generating daily report...');
// Your reporting logic
});
Common Cron Patterns
// Every minute
cron.schedule('* * * * *', task);
// Every 5 minutes
cron.schedule('*/5 * * * *', task);
// Every hour at minute 30
cron.schedule('30 * * * *', task);
// Every day at 3:15 AM
cron.schedule('15 3 * * *', task);
// Every Monday at 9:00 AM
cron.schedule('0 9 * * 1', task);
// First day of every month at midnight
cron.schedule('0 0 1 * *', task);
// Every weekday at 6:00 PM
cron.schedule('0 18 * * 1-5', task);
Method 2: System Cron Jobs (Linux/macOS)
For production environments, system cron jobs offer reliability and process isolation. This method runs your scraper as a separate process, making it ideal for deploying Crawlee scrapers to the cloud.
Create a Scraper Script
First, create an executable scraper script:
#!/usr/bin/env node
// scraper.js
const { PlaywrightCrawler, Dataset } = require('crawlee');
async function main() {
console.log('Starting scraper at', new Date().toISOString());
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request, enqueueLinks }) => {
const data = {
url: request.url,
title: await page.title(),
timestamp: new Date().toISOString(),
};
await Dataset.pushData(data);
await enqueueLinks({
globs: ['https://example.com/**'],
});
},
maxRequestsPerCrawl: 50,
});
await crawler.run(['https://example.com']);
console.log('Scraper completed at', new Date().toISOString());
}
main().catch((error) => {
console.error('Scraper failed:', error);
process.exit(1);
});
Make it executable:
chmod +x scraper.js
Set Up Cron Job
Edit your crontab:
crontab -e
Add your scheduled task:
# Run scraper every day at 2:00 AM
0 2 * * * cd /path/to/your/project && /usr/bin/node scraper.js >> /var/log/scraper.log 2>&1
# Run every 4 hours with logging
0 */4 * * * cd /path/to/your/project && /usr/bin/node scraper.js >> /var/log/scraper.log 2>&1
# Run on weekdays at 9 AM
0 9 * * 1-5 cd /path/to/your/project && /usr/bin/node scraper.js >> /var/log/scraper.log 2>&1
Environment Variables in Cron
Since cron runs with limited environment variables, create a wrapper script:
#!/bin/bash
# run-scraper.sh
# Load environment variables
export NODE_ENV=production
export API_KEY=your_api_key
source /path/to/.env
# Navigate to project directory
cd /path/to/your/project
# Run the scraper
/usr/bin/node scraper.js
Update crontab:
0 2 * * * /path/to/run-scraper.sh >> /var/log/scraper.log 2>&1
Method 3: PM2 with Cron Module
PM2 is a production process manager for Node.js that includes built-in cron functionality, making it perfect for managing scheduled scrapers.
Installation
npm install -g pm2
Create Ecosystem File
// ecosystem.config.js
module.exports = {
apps: [{
name: 'product-scraper',
script: './scrapers/products.js',
cron_restart: '0 */6 * * *', // Every 6 hours
autorestart: false, // Don't auto-restart after completion
watch: false,
env: {
NODE_ENV: 'production',
},
}, {
name: 'news-scraper',
script: './scrapers/news.js',
cron_restart: '0 * * * *', // Every hour
autorestart: false,
watch: false,
env: {
NODE_ENV: 'production',
},
}, {
name: 'daily-report',
script: './reports/daily.js',
cron_restart: '0 0 * * *', // Daily at midnight
autorestart: false,
watch: false,
env: {
NODE_ENV: 'production',
},
}],
};
Start with PM2
# Start all scrapers
pm2 start ecosystem.config.js
# View running processes
pm2 list
# Monitor logs
pm2 logs
# Monitor specific scraper
pm2 logs product-scraper
# Save PM2 configuration
pm2 save
# Set up PM2 to start on system boot
pm2 startup
PM2 Management Commands
# Stop a specific scraper
pm2 stop product-scraper
# Restart a scraper
pm2 restart product-scraper
# Delete a scraper from PM2
pm2 delete product-scraper
# View detailed information
pm2 info product-scraper
# Monitor resource usage
pm2 monit
Method 4: Cloud-Based Schedulers
For cloud deployments, use managed scheduling services that integrate well with running Crawlee scrapers in Docker containers.
AWS Lambda with EventBridge
// lambda/scraper.js
const { CheerioCrawler, Dataset } = require('crawlee');
exports.handler = async (event) => {
const crawler = new CheerioCrawler({
requestHandler: async ({ $, request }) => {
const data = {
title: $('title').text(),
url: request.url,
timestamp: new Date().toISOString(),
};
await Dataset.pushData(data);
},
maxRequestsPerCrawl: 20,
});
await crawler.run(['https://example.com']);
return {
statusCode: 200,
body: JSON.stringify({ message: 'Scraper completed successfully' }),
};
};
Google Cloud Scheduler
// index.js for Google Cloud Functions
const { PlaywrightCrawler } = require('crawlee');
exports.scheduledScraper = async (req, res) => {
try {
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
// Your scraping logic
const title = await page.title();
console.log(`Scraped: ${title}`);
},
maxRequestsPerCrawl: 50,
});
await crawler.run(['https://example.com']);
res.status(200).send('Scraper completed successfully');
} catch (error) {
console.error('Scraper failed:', error);
res.status(500).send('Scraper failed');
}
};
Azure Functions with Timer Trigger
// function.json
{
"bindings": [
{
"name": "myTimer",
"type": "timerTrigger",
"direction": "in",
"schedule": "0 0 */6 * * *"
}
]
}
// index.js
const { CheerioCrawler } = require('crawlee');
module.exports = async function (context, myTimer) {
context.log('Starting scheduled scraper');
const crawler = new CheerioCrawler({
requestHandler: async ({ $, request }) => {
// Your scraping logic
context.log(`Processing: ${request.url}`);
},
});
await crawler.run(['https://example.com']);
context.log('Scraper completed');
};
Method 5: GitHub Actions for Scheduled Scraping
GitHub Actions provides free scheduled workflows, perfect for smaller projects.
# .github/workflows/scraper.yml
name: Scheduled Scraper
on:
schedule:
# Run every day at 2:00 AM UTC
- cron: '0 2 * * *'
workflow_dispatch: # Allow manual triggers
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
- name: Install dependencies
run: npm ci
- name: Install Playwright browsers
run: npx playwright install chromium
- name: Run scraper
run: node scraper.js
env:
API_KEY: ${{ secrets.API_KEY }}
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: scraping-results
path: storage/datasets/default/
Best Practices for Scheduled Scrapers
1. Error Handling and Retries
const cron = require('node-cron');
const { PlaywrightCrawler } = require('crawlee');
async function runScraperWithRetry(maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
console.log(`Attempt ${attempt}/${maxRetries}`);
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
// Your scraping logic
},
maxRequestsPerCrawl: 100,
maxRequestRetries: 3,
requestHandlerTimeoutSecs: 60,
});
await crawler.run(['https://example.com']);
console.log('Scraper completed successfully');
return true;
} catch (error) {
console.error(`Attempt ${attempt} failed:`, error.message);
if (attempt === maxRetries) {
// Send alert notification
await sendAlertNotification(error);
throw error;
}
// Wait before retrying
await new Promise(resolve => setTimeout(resolve, 5000 * attempt));
}
}
}
cron.schedule('0 2 * * *', async () => {
await runScraperWithRetry();
});
2. Monitoring and Logging
const cron = require('node-cron');
const { PlaywrightCrawler, log } = require('crawlee');
const winston = require('winston');
// Configure logging
const logger = winston.createLogger({
level: 'info',
format: winston.format.json(),
transports: [
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});
async function monitoredScraper() {
const startTime = Date.now();
logger.info('Scraper started', { timestamp: new Date().toISOString() });
try {
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
logger.info('Processing URL', { url: request.url });
// Your scraping logic
},
});
await crawler.run(['https://example.com']);
const duration = Date.now() - startTime;
logger.info('Scraper completed', {
duration: `${duration}ms`,
timestamp: new Date().toISOString(),
});
} catch (error) {
logger.error('Scraper failed', {
error: error.message,
stack: error.stack,
timestamp: new Date().toISOString(),
});
throw error;
}
}
cron.schedule('0 */6 * * *', monitoredScraper);
3. Resource Cleanup
const cron = require('node-cron');
const { PlaywrightCrawler } = require('crawlee');
const fs = require('fs').promises;
const path = require('path');
async function cleanupOldData(daysToKeep = 7) {
const storageDir = path.join(__dirname, 'storage');
const cutoffDate = Date.now() - (daysToKeep * 24 * 60 * 60 * 1000);
// Clean up old datasets
const datasetsDir = path.join(storageDir, 'datasets');
// Implementation for cleanup logic
}
// Run scraper every 6 hours
cron.schedule('0 */6 * * *', async () => {
await runScraper();
});
// Clean up old data daily at 3 AM
cron.schedule('0 3 * * *', async () => {
console.log('Cleaning up old data...');
await cleanupOldData(7);
});
4. Rate Limiting and Throttling
const cron = require('node-cron');
const { PlaywrightCrawler, ProxyConfiguration } = require('crawlee');
async function throttledScraper() {
const crawler = new PlaywrightCrawler({
requestHandler: async ({ page, request }) => {
// Your scraping logic
},
maxConcurrency: 5, // Limit concurrent requests
maxRequestsPerMinute: 30, // Rate limiting
minConcurrency: 1,
navigationTimeoutSecs: 30,
});
await crawler.run(['https://example.com']);
}
// Schedule during off-peak hours to avoid detection
cron.schedule('0 2-4 * * *', throttledScraper);
Conclusion
Scheduling Crawlee scrapers automatically is essential for building reliable, production-grade web scraping systems. Whether you choose node-cron for simple in-process scheduling, system cron for production reliability, PM2 for process management, or cloud-based solutions for scalability, each method has its advantages depending on your specific requirements.
Key takeaways:
- node-cron is perfect for applications that need scheduling within the same Node.js process
- System cron provides reliability and process isolation for production environments
- PM2 offers advanced process management with built-in monitoring and auto-restart capabilities
- Cloud schedulers (AWS, GCP, Azure) are ideal for serverless architectures and automatic scaling
- GitHub Actions works well for open-source projects and smaller scraping tasks
Remember to implement proper error handling, logging, and monitoring to ensure your scheduled scrapers run reliably and efficiently. Always respect website terms of service, implement rate limiting, and consider using proxy rotation with Crawlee when scraping at scale.
By combining these scheduling techniques with Crawlee's powerful features, you can build robust, automated data collection pipelines that run reliably without manual intervention.