How do I Schedule Crawlee Scrapers to Run Automatically?

Scheduling Crawlee scrapers to run automatically is essential for maintaining up-to-date data collections, monitoring websites, and building reliable web scraping pipelines. Whether you need to scrape data hourly, daily, or at custom intervals, there are several robust methods to automate your Crawlee scrapers.

In this comprehensive guide, we'll explore multiple approaches to scheduling Crawlee scrapers, from simple cron jobs to cloud-based solutions, along with best practices for production deployments.

Why Automate Crawlee Scrapers?

Before diving into implementation, let's understand the benefits of automated scraping:

Consistent Data Collection: Gather fresh data at regular intervals without manual intervention
Time Efficiency: Run scrapers during off-peak hours to avoid detection or rate limiting
Scalability: Handle multiple scraping tasks across different schedules
Reliability: Automatic retries and monitoring ensure data collection continues even if individual runs fail
Cost Optimization: Schedule scrapers strategically to minimize resource usage and API costs

Method 1: Using Node.js Cron Jobs with node-cron

The node-cron package provides a simple, Node.js-native way to schedule tasks using cron syntax. This is ideal for applications that need to run scrapers within the same process.

Installation

npm install node-cron
# or
yarn add node-cron

Basic Implementation

const cron = require('node-cron');
const { PlaywrightCrawler } = require('crawlee');

// Define your scraper
async function runScraper() {
    const crawler = new PlaywrightCrawler({
        requestHandler: async ({ page, request, enqueueLinks }) => {
            const title = await page.title();
            console.log(`Title: ${title}`);

            // Your scraping logic here
            await enqueueLinks({
                globs: ['https://example.com/**'],
            });
        },
        maxRequestsPerCrawl: 100,
    });

    await crawler.run(['https://example.com']);
}

// Schedule scraper to run every day at 2:00 AM
cron.schedule('0 2 * * *', async () => {
    console.log('Starting scheduled scraper at', new Date().toISOString());
    try {
        await runScraper();
        console.log('Scraper completed successfully');
    } catch (error) {
        console.error('Scraper failed:', error);
    }
});

console.log('Scheduler started. Waiting for scheduled tasks...');

Advanced Scheduling with Multiple Tasks

const cron = require('node-cron');
const { CheerioCrawler, Dataset } = require('crawlee');

// Product scraper
async function scrapeProducts() {
    const crawler = new CheerioCrawler({
        requestHandler: async ({ $, request }) => {
            const products = [];
            $('.product').each((i, element) => {
                products.push({
                    name: $(element).find('.product-name').text(),
                    price: $(element).find('.price').text(),
                    scrapedAt: new Date().toISOString(),
                });
            });

            await Dataset.pushData(products);
        },
    });

    await crawler.run(['https://example-store.com/products']);
}

// News scraper
async function scrapeNews() {
    const crawler = new CheerioCrawler({
        requestHandler: async ({ $, request }) => {
            const articles = [];
            $('.article').each((i, element) => {
                articles.push({
                    title: $(element).find('.title').text(),
                    author: $(element).find('.author').text(),
                    scrapedAt: new Date().toISOString(),
                });
            });

            await Dataset.pushData(articles);
        },
    });

    await crawler.run(['https://example-news.com']);
}

// Run products scraper every 6 hours
cron.schedule('0 */6 * * *', async () => {
    console.log('Running products scraper...');
    await scrapeProducts();
});

// Run news scraper every hour
cron.schedule('0 * * * *', async () => {
    console.log('Running news scraper...');
    await scrapeNews();
});

// Run daily report at midnight
cron.schedule('0 0 * * *', async () => {
    console.log('Generating daily report...');
    // Your reporting logic
});

Common Cron Patterns

// Every minute
cron.schedule('* * * * *', task);

// Every 5 minutes
cron.schedule('*/5 * * * *', task);

// Every hour at minute 30
cron.schedule('30 * * * *', task);

// Every day at 3:15 AM
cron.schedule('15 3 * * *', task);

// Every Monday at 9:00 AM
cron.schedule('0 9 * * 1', task);

// First day of every month at midnight
cron.schedule('0 0 1 * *', task);

// Every weekday at 6:00 PM
cron.schedule('0 18 * * 1-5', task);

Method 2: System Cron Jobs (Linux/macOS)

For production environments, system cron jobs offer reliability and process isolation. This method runs your scraper as a separate process, making it ideal for deploying Crawlee scrapers to the cloud.

Create a Scraper Script

First, create an executable scraper script:

#!/usr/bin/env node

// scraper.js
const { PlaywrightCrawler, Dataset } = require('crawlee');

async function main() {
    console.log('Starting scraper at', new Date().toISOString());

    const crawler = new PlaywrightCrawler({
        requestHandler: async ({ page, request, enqueueLinks }) => {
            const data = {
                url: request.url,
                title: await page.title(),
                timestamp: new Date().toISOString(),
            };

            await Dataset.pushData(data);

            await enqueueLinks({
                globs: ['https://example.com/**'],
            });
        },
        maxRequestsPerCrawl: 50,
    });

    await crawler.run(['https://example.com']);
    console.log('Scraper completed at', new Date().toISOString());
}

main().catch((error) => {
    console.error('Scraper failed:', error);
    process.exit(1);
});

Make it executable:

chmod +x scraper.js

Set Up Cron Job

Edit your crontab:

crontab -e

Add your scheduled task:

# Run scraper every day at 2:00 AM
0 2 * * * cd /path/to/your/project && /usr/bin/node scraper.js >> /var/log/scraper.log 2>&1

# Run every 4 hours with logging
0 */4 * * * cd /path/to/your/project && /usr/bin/node scraper.js >> /var/log/scraper.log 2>&1

# Run on weekdays at 9 AM
0 9 * * 1-5 cd /path/to/your/project && /usr/bin/node scraper.js >> /var/log/scraper.log 2>&1

Environment Variables in Cron

Since cron runs with limited environment variables, create a wrapper script:

#!/bin/bash
# run-scraper.sh

# Load environment variables
export NODE_ENV=production
export API_KEY=your_api_key
source /path/to/.env

# Navigate to project directory
cd /path/to/your/project

# Run the scraper
/usr/bin/node scraper.js

Update crontab:

0 2 * * * /path/to/run-scraper.sh >> /var/log/scraper.log 2>&1

Method 3: PM2 with Cron Module

PM2 is a production process manager for Node.js that includes built-in cron functionality, making it perfect for managing scheduled scrapers.

Installation

npm install -g pm2

Create Ecosystem File

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'product-scraper',
    script: './scrapers/products.js',
    cron_restart: '0 */6 * * *', // Every 6 hours
    autorestart: false, // Don't auto-restart after completion
    watch: false,
    env: {
      NODE_ENV: 'production',
    },
  }, {
    name: 'news-scraper',
    script: './scrapers/news.js',
    cron_restart: '0 * * * *', // Every hour
    autorestart: false,
    watch: false,
    env: {
      NODE_ENV: 'production',
    },
  }, {
    name: 'daily-report',
    script: './reports/daily.js',
    cron_restart: '0 0 * * *', // Daily at midnight
    autorestart: false,
    watch: false,
    env: {
      NODE_ENV: 'production',
    },
  }],
};

Start with PM2

# Start all scrapers
pm2 start ecosystem.config.js

# View running processes
pm2 list

# Monitor logs
pm2 logs

# Monitor specific scraper
pm2 logs product-scraper

# Save PM2 configuration
pm2 save

# Set up PM2 to start on system boot
pm2 startup

PM2 Management Commands

# Stop a specific scraper
pm2 stop product-scraper

# Restart a scraper
pm2 restart product-scraper

# Delete a scraper from PM2
pm2 delete product-scraper

# View detailed information
pm2 info product-scraper

# Monitor resource usage
pm2 monit

Method 4: Cloud-Based Schedulers

For cloud deployments, use managed scheduling services that integrate well with running Crawlee scrapers in Docker containers.

AWS Lambda with EventBridge

// lambda/scraper.js
const { CheerioCrawler, Dataset } = require('crawlee');

exports.handler = async (event) => {
    const crawler = new CheerioCrawler({
        requestHandler: async ({ $, request }) => {
            const data = {
                title: $('title').text(),
                url: request.url,
                timestamp: new Date().toISOString(),
            };

            await Dataset.pushData(data);
        },
        maxRequestsPerCrawl: 20,
    });

    await crawler.run(['https://example.com']);

    return {
        statusCode: 200,
        body: JSON.stringify({ message: 'Scraper completed successfully' }),
    };
};

Google Cloud Scheduler

// index.js for Google Cloud Functions
const { PlaywrightCrawler } = require('crawlee');

exports.scheduledScraper = async (req, res) => {
    try {
        const crawler = new PlaywrightCrawler({
            requestHandler: async ({ page, request }) => {
                // Your scraping logic
                const title = await page.title();
                console.log(`Scraped: ${title}`);
            },
            maxRequestsPerCrawl: 50,
        });

        await crawler.run(['https://example.com']);

        res.status(200).send('Scraper completed successfully');
    } catch (error) {
        console.error('Scraper failed:', error);
        res.status(500).send('Scraper failed');
    }
};

Azure Functions with Timer Trigger

// function.json
{
  "bindings": [
    {
      "name": "myTimer",
      "type": "timerTrigger",
      "direction": "in",
      "schedule": "0 0 */6 * * *"
    }
  ]
}

// index.js
const { CheerioCrawler } = require('crawlee');

module.exports = async function (context, myTimer) {
    context.log('Starting scheduled scraper');

    const crawler = new CheerioCrawler({
        requestHandler: async ({ $, request }) => {
            // Your scraping logic
            context.log(`Processing: ${request.url}`);
        },
    });

    await crawler.run(['https://example.com']);
    context.log('Scraper completed');
};

Method 5: GitHub Actions for Scheduled Scraping

GitHub Actions provides free scheduled workflows, perfect for smaller projects.

# .github/workflows/scraper.yml
name: Scheduled Scraper

on:
  schedule:
    # Run every day at 2:00 AM UTC
    - cron: '0 2 * * *'
  workflow_dispatch: # Allow manual triggers

jobs:
  scrape:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'

      - name: Install dependencies
        run: npm ci

      - name: Install Playwright browsers
        run: npx playwright install chromium

      - name: Run scraper
        run: node scraper.js
        env:
          API_KEY: ${{ secrets.API_KEY }}

      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: scraping-results
          path: storage/datasets/default/

Best Practices for Scheduled Scrapers

1. Error Handling and Retries

const cron = require('node-cron');
const { PlaywrightCrawler } = require('crawlee');

async function runScraperWithRetry(maxRetries = 3) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            console.log(`Attempt ${attempt}/${maxRetries}`);

            const crawler = new PlaywrightCrawler({
                requestHandler: async ({ page, request }) => {
                    // Your scraping logic
                },
                maxRequestsPerCrawl: 100,
                maxRequestRetries: 3,
                requestHandlerTimeoutSecs: 60,
            });

            await crawler.run(['https://example.com']);
            console.log('Scraper completed successfully');
            return true;
        } catch (error) {
            console.error(`Attempt ${attempt} failed:`, error.message);

            if (attempt === maxRetries) {
                // Send alert notification
                await sendAlertNotification(error);
                throw error;
            }

            // Wait before retrying
            await new Promise(resolve => setTimeout(resolve, 5000 * attempt));
        }
    }
}

cron.schedule('0 2 * * *', async () => {
    await runScraperWithRetry();
});

2. Monitoring and Logging

const cron = require('node-cron');
const { PlaywrightCrawler, log } = require('crawlee');
const winston = require('winston');

// Configure logging
const logger = winston.createLogger({
    level: 'info',
    format: winston.format.json(),
    transports: [
        new winston.transports.File({ filename: 'error.log', level: 'error' }),
        new winston.transports.File({ filename: 'combined.log' }),
    ],
});

async function monitoredScraper() {
    const startTime = Date.now();
    logger.info('Scraper started', { timestamp: new Date().toISOString() });

    try {
        const crawler = new PlaywrightCrawler({
            requestHandler: async ({ page, request }) => {
                logger.info('Processing URL', { url: request.url });
                // Your scraping logic
            },
        });

        await crawler.run(['https://example.com']);

        const duration = Date.now() - startTime;
        logger.info('Scraper completed', {
            duration: `${duration}ms`,
            timestamp: new Date().toISOString(),
        });
    } catch (error) {
        logger.error('Scraper failed', {
            error: error.message,
            stack: error.stack,
            timestamp: new Date().toISOString(),
        });
        throw error;
    }
}

cron.schedule('0 */6 * * *', monitoredScraper);

3. Resource Cleanup

const cron = require('node-cron');
const { PlaywrightCrawler } = require('crawlee');
const fs = require('fs').promises;
const path = require('path');

async function cleanupOldData(daysToKeep = 7) {
    const storageDir = path.join(__dirname, 'storage');
    const cutoffDate = Date.now() - (daysToKeep * 24 * 60 * 60 * 1000);

    // Clean up old datasets
    const datasetsDir = path.join(storageDir, 'datasets');
    // Implementation for cleanup logic
}

// Run scraper every 6 hours
cron.schedule('0 */6 * * *', async () => {
    await runScraper();
});

// Clean up old data daily at 3 AM
cron.schedule('0 3 * * *', async () => {
    console.log('Cleaning up old data...');
    await cleanupOldData(7);
});

4. Rate Limiting and Throttling

const cron = require('node-cron');
const { PlaywrightCrawler, ProxyConfiguration } = require('crawlee');

async function throttledScraper() {
    const crawler = new PlaywrightCrawler({
        requestHandler: async ({ page, request }) => {
            // Your scraping logic
        },
        maxConcurrency: 5, // Limit concurrent requests
        maxRequestsPerMinute: 30, // Rate limiting
        minConcurrency: 1,
        navigationTimeoutSecs: 30,
    });

    await crawler.run(['https://example.com']);
}

// Schedule during off-peak hours to avoid detection
cron.schedule('0 2-4 * * *', throttledScraper);

Conclusion

Scheduling Crawlee scrapers automatically is essential for building reliable, production-grade web scraping systems. Whether you choose node-cron for simple in-process scheduling, system cron for production reliability, PM2 for process management, or cloud-based solutions for scalability, each method has its advantages depending on your specific requirements.

Key takeaways:

node-cron is perfect for applications that need scheduling within the same Node.js process
System cron provides reliability and process isolation for production environments
PM2 offers advanced process management with built-in monitoring and auto-restart capabilities
Cloud schedulers (AWS, GCP, Azure) are ideal for serverless architectures and automatic scaling
GitHub Actions works well for open-source projects and smaller scraping tasks

Remember to implement proper error handling, logging, and monitoring to ensure your scheduled scrapers run reliably and efficiently. Always respect website terms of service, implement rate limiting, and consider using proxy rotation with Crawlee when scraping at scale.

By combining these scheduling techniques with Crawlee's powerful features, you can build robust, automated data collection pipelines that run reliably without manual intervention.

Table of contents

How do I Schedule Crawlee Scrapers to Run Automatically?

Why Automate Crawlee Scrapers?

Method 1: Using Node.js Cron Jobs with node-cron

Installation

Basic Implementation

Advanced Scheduling with Multiple Tasks

Common Cron Patterns

Method 2: System Cron Jobs (Linux/macOS)

Create a Scraper Script

Set Up Cron Job

Environment Variables in Cron

Method 3: PM2 with Cron Module

Installation

Create Ecosystem File

Start with PM2

PM2 Management Commands

Method 4: Cloud-Based Schedulers

AWS Lambda with EventBridge

Google Cloud Scheduler

Azure Functions with Timer Trigger

Method 5: GitHub Actions for Scheduled Scraping

Best Practices for Scheduled Scrapers

1. Error Handling and Retries

2. Monitoring and Logging

3. Resource Cleanup

4. Rate Limiting and Throttling

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

Where can I find Crawlee documentation and examples?

Are there any good Crawlee tutorials for beginners?

What are the best practices for web scraping with Crawlee?

Get Started Now

Support