How do I use Crawlee's storage system for large datasets?

When web scraping at scale, efficiently managing large datasets is crucial for performance and reliability. Crawlee provides a sophisticated storage system designed specifically for handling massive amounts of scraped data without overwhelming your system's resources. This guide explores advanced techniques for working with large datasets, from memory-efficient data handling to distributed storage solutions.

Understanding Crawlee's Storage Architecture

Crawlee's storage system is built on three core components that work together to handle large datasets efficiently:

Datasets: Store scraped results as individual records
Key-Value Stores: Store arbitrary data like HTML snapshots, screenshots, or binary files
Request Queues: Manage URLs to be crawled with persistence

For large-scale scraping operations, understanding how these components interact and optimize memory usage is essential.

Configuring Storage for Large Datasets

Storage Directory Configuration

When dealing with large datasets, you'll want to configure where Crawlee stores data and how it manages memory:

import { PlaywrightCrawler, Configuration } from 'crawlee';

const config = new Configuration({
    storageDir: './large-crawl-storage',
    persistStorage: true,
    purgeOnStart: false, // Keep data between runs
    writeMetadata: false, // Disable metadata for performance
});

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 100000, // Limit for safety
    maxConcurrency: 50, // Adjust based on your resources

    async requestHandler({ request, page }) {
        // Your scraping logic
    },
}, config);

Memory Management Settings

For large-scale operations, optimize memory usage:

import { PlaywrightCrawler, Configuration } from 'crawlee';

const crawler = new PlaywrightCrawler({
    // Limit memory usage per browser instance
    maxConcurrency: 10,

    // Configure autoscaling for resource management
    autoscaledPoolOptions: {
        maxConcurrency: 50,
        minConcurrency: 5,
        desiredConcurrency: 20,
        // Automatically adjust based on system resources
        systemStatusOptions: {
            maxUsedMemoryRatio: 0.7, // Use up to 70% of available memory
            maxUsedCpuRatio: 0.8,     // Use up to 80% of CPU
        },
    },

    async requestHandler({ request, page }) {
        const data = await extractData(page);
        await crawler.pushData(data);
    },
});

Efficient Data Writing for Large Datasets

Batch Writing Strategy

Instead of writing data one record at a time, batch your writes for better performance:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, crawler }) {
        const products = [];

        // Extract multiple items from a single page
        $('.product-item').each((index, element) => {
            products.push({
                name: $(element).find('.product-name').text().trim(),
                price: $(element).find('.price').text().trim(),
                url: $(element).find('a').attr('href'),
                scrapedAt: new Date().toISOString(),
            });
        });

        // Write all products in a single operation
        if (products.length > 0) {
            await crawler.pushData(products);
        }
    },
});

await crawler.run(['https://example-shop.com/products']);

Streaming Data to Reduce Memory Footprint

For extremely large datasets, process and write data in chunks:

import { PlaywrightCrawler, Dataset } from 'crawlee';

const BATCH_SIZE = 100;
let batch = [];

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page }) {
        const items = await page.$$eval('.item', elements =>
            elements.map(el => ({
                title: el.querySelector('.title')?.textContent,
                description: el.querySelector('.desc')?.textContent,
            }))
        );

        // Add items to batch
        batch.push(...items);

        // Write batch when it reaches threshold
        if (batch.length >= BATCH_SIZE) {
            await crawler.pushData(batch);
            batch = []; // Clear batch to free memory
        }
    },

    async failedRequestHandler() {
        // Flush remaining items on failure
        if (batch.length > 0) {
            await crawler.pushData(batch);
            batch = [];
        }
    },
});

await crawler.run(['https://example.com/items']);

// Flush any remaining items
if (batch.length > 0) {
    const dataset = await Dataset.open();
    await dataset.pushData(batch);
}

Reading Large Datasets Efficiently

Pagination for Large Dataset Retrieval

When reading millions of records, use pagination to avoid loading everything into memory:

import { Dataset } from 'crawlee';

async function processLargeDataset() {
    const dataset = await Dataset.open('large-dataset');
    const CHUNK_SIZE = 1000;
    let offset = 0;
    let processedCount = 0;

    while (true) {
        // Retrieve data in chunks
        const { items } = await dataset.getData({
            limit: CHUNK_SIZE,
            offset: offset,
        });

        if (items.length === 0) break;

        // Process chunk
        for (const item of items) {
            await processItem(item);
            processedCount++;
        }

        console.log(`Processed ${processedCount} items...`);

        offset += CHUNK_SIZE;

        // Optional: Clear memory between chunks
        if (global.gc) {
            global.gc();
        }
    }

    console.log(`Total items processed: ${processedCount}`);
}

async function processItem(item) {
    // Your processing logic here
    console.log(item.name);
}

await processLargeDataset();

Python Implementation for Large Datasets

When using Crawlee with Python for web scraping, the storage API offers similar capabilities:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee import Dataset
import asyncio

BATCH_SIZE = 100
batch = []

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        global batch
        soup = context.soup

        # Extract multiple items
        items = []
        for product in soup.find_all('div', class_='product'):
            items.append({
                'name': product.find('h2').get_text() if product.find('h2') else '',
                'price': product.find('span', class_='price').get_text() if product.find('span', class_='price') else '',
            })

        batch.extend(items)

        # Batch write when threshold reached
        if len(batch) >= BATCH_SIZE:
            await context.push_data(batch)
            batch.clear()  # Free memory

    await crawler.run(['https://example-shop.com'])

    # Flush remaining items
    if batch:
        dataset = await Dataset.open()
        await dataset.push_data(batch)

if __name__ == '__main__':
    asyncio.run(main())

Using Key-Value Stores for Binary Data

When scraping large amounts of binary data like images or PDFs, use key-value stores instead of datasets:

import { PlaywrightCrawler, KeyValueStore } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page }) {
        const kvStore = await KeyValueStore.open('images');

        // Extract image URLs
        const imageUrls = await page.$$eval('img', imgs =>
            imgs.map(img => img.src)
        );

        // Download and store images
        for (let i = 0; i < imageUrls.length; i++) {
            const imageUrl = imageUrls[i];
            try {
                const response = await page.goto(imageUrl);
                const buffer = await response.buffer();

                // Store with unique key
                const key = `${request.id}_image_${i}.jpg`;
                await kvStore.setValue(key, buffer, {
                    contentType: 'image/jpeg',
                });

                // Navigate back
                await page.goBack();
            } catch (error) {
                console.error(`Failed to download ${imageUrl}:`, error);
            }
        }
    },
});

Exporting Large Datasets

Streaming Export for Memory Efficiency

Export large datasets without loading everything into memory:

import { Dataset } from 'crawlee';
import fs from 'fs';

async function exportLargeDataset() {
    const dataset = await Dataset.open('large-dataset');
    const writeStream = fs.createWriteStream('output.json');

    const CHUNK_SIZE = 500;
    let offset = 0;
    let isFirst = true;

    writeStream.write('[');

    while (true) {
        const { items } = await dataset.getData({
            limit: CHUNK_SIZE,
            offset: offset,
        });

        if (items.length === 0) break;

        for (const item of items) {
            if (!isFirst) {
                writeStream.write(',');
            }
            writeStream.write(JSON.stringify(item));
            isFirst = false;
        }

        offset += CHUNK_SIZE;
        console.log(`Exported ${offset} items...`);
    }

    writeStream.write(']');
    writeStream.end();

    console.log('Export complete!');
}

await exportLargeDataset();

Export to CSV for Large Datasets

CSV exports are more memory-efficient than JSON for very large datasets:

import { Dataset } from 'crawlee';
import fs from 'fs';

async function exportToCSV() {
    const dataset = await Dataset.open('large-dataset');
    const writeStream = fs.createWriteStream('output.csv');

    let offset = 0;
    const CHUNK_SIZE = 1000;
    let headerWritten = false;

    while (true) {
        const { items } = await dataset.getData({
            limit: CHUNK_SIZE,
            offset: offset,
        });

        if (items.length === 0) break;

        // Write header once
        if (!headerWritten && items.length > 0) {
            const headers = Object.keys(items[0]).join(',');
            writeStream.write(headers + '\n');
            headerWritten = true;
        }

        // Write rows
        for (const item of items) {
            const values = Object.values(item).map(v =>
                typeof v === 'string' ? `"${v.replace(/"/g, '""')}"` : v
            );
            writeStream.write(values.join(',') + '\n');
        }

        offset += CHUNK_SIZE;
        console.log(`Exported ${offset} rows...`);
    }

    writeStream.end();
    console.log('CSV export complete!');
}

await exportToCSV();

Distributed Storage for Massive Datasets

Using Named Datasets for Partitioning

Split large datasets across multiple named datasets to improve performance:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const products = [];

        $('.product').each((index, element) => {
            products.push({
                name: $(element).find('.name').text(),
                category: $(element).attr('data-category'),
                price: $(element).find('.price').text(),
            });
        });

        // Partition data by category into separate datasets
        const categorizedData = {};
        for (const product of products) {
            const category = product.category || 'uncategorized';
            if (!categorizedData[category]) {
                categorizedData[category] = [];
            }
            categorizedData[category].push(product);
        }

        // Write to separate datasets
        for (const [category, items] of Object.entries(categorizedData)) {
            const dataset = await Dataset.open(`products-${category}`);
            await dataset.pushData(items);
        }
    },
});

await crawler.run(['https://example-shop.com/all-products']);

Monitoring Storage Usage

Track storage metrics to prevent disk space issues:

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { promises as fs } from 'fs';
import path from 'path';

async function getStorageSize(dirPath) {
    let totalSize = 0;
    const files = await fs.readdir(dirPath, { withFileTypes: true });

    for (const file of files) {
        const filePath = path.join(dirPath, file.name);
        if (file.isDirectory()) {
            totalSize += await getStorageSize(filePath);
        } else {
            const stats = await fs.stat(filePath);
            totalSize += stats.size;
        }
    }

    return totalSize;
}

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, crawler }) {
        // Your scraping logic
        const data = await extractData(page);
        await crawler.pushData(data);

        // Monitor storage every 100 requests
        const stats = await crawler.requestQueue.getInfo();
        if (stats.handledRequestCount % 100 === 0) {
            const storageSize = await getStorageSize('./large-crawl-storage');
            const sizeMB = (storageSize / 1024 / 1024).toFixed(2);

            console.log(`Storage size: ${sizeMB} MB`);
            console.log(`Requests: ${stats.handledRequestCount} processed, ${stats.pendingRequestCount} pending`);

            // Alert if storage exceeds threshold
            if (storageSize > 10 * 1024 * 1024 * 1024) { // 10 GB
                console.warn('Storage exceeding 10 GB!');
            }
        }
    },
});

Best Practices for Large Dataset Management

1. Implement Data Validation Early

Validate data before storing to avoid processing invalid records later:

function validateProduct(product) {
    return product.name &&
           product.price &&
           typeof product.price === 'string' &&
           product.url &&
           product.url.startsWith('http');
}

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, crawler }) {
        const products = extractProducts($);

        // Filter valid products before storing
        const validProducts = products.filter(validateProduct);

        if (validProducts.length > 0) {
            await crawler.pushData(validProducts);
        }

        const invalidCount = products.length - validProducts.length;
        if (invalidCount > 0) {
            console.warn(`Skipped ${invalidCount} invalid products on ${request.url}`);
        }
    },
});

2. Use Compression for Text-Heavy Data

For datasets with large text fields, consider compressing before storage:

import { gzipSync, gunzipSync } from 'zlib';
import { KeyValueStore } from 'crawlee';

async function storeCompressedData(key, data) {
    const kvStore = await KeyValueStore.open('compressed-data');
    const jsonString = JSON.stringify(data);
    const compressed = gzipSync(jsonString);

    await kvStore.setValue(key, compressed, {
        contentType: 'application/gzip',
    });
}

async function retrieveCompressedData(key) {
    const kvStore = await KeyValueStore.open('compressed-data');
    const compressed = await kvStore.getValue(key);

    if (!compressed) return null;

    const decompressed = gunzipSync(compressed);
    return JSON.parse(decompressed.toString());
}

3. Implement Checkpoints for Long-Running Crawls

Save progress periodically to resume from failures:

import { PlaywrightCrawler, Dataset, KeyValueStore } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ request, page, crawler }) {
        const data = await extractData(page);
        await crawler.pushData(data);

        // Save checkpoint every 1000 requests
        const stats = await crawler.requestQueue.getInfo();
        if (stats.handledRequestCount % 1000 === 0) {
            const checkpoint = await KeyValueStore.open('checkpoints');
            await checkpoint.setValue('last-checkpoint', {
                timestamp: new Date().toISOString(),
                requestsProcessed: stats.handledRequestCount,
                requestsPending: stats.pendingRequestCount,
            });

            console.log(`Checkpoint saved at ${stats.handledRequestCount} requests`);
        }
    },
});

4. Clean Up Storage After Processing

Remove temporary data to free disk space when managing request queues in Crawlee:

import { Dataset, RequestQueue, KeyValueStore } from 'crawlee';

async function cleanupAfterCrawl() {
    // Export final data
    const dataset = await Dataset.open();
    await dataset.exportToJSON('final-output');

    // Drop temporary datasets
    const tempDataset = await Dataset.open('temp-data');
    await tempDataset.drop();

    // Clear request queue
    const queue = await RequestQueue.open();
    await queue.drop();

    // Clean temporary key-value stores
    const tempStore = await KeyValueStore.open('temp-storage');
    await tempStore.drop();

    console.log('Cleanup complete!');
}

Performance Optimization Strategies

Memory-Efficient Browser Management

When using PlaywrightCrawler in Crawlee for large-scale scraping, optimize browser resource usage:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    launchContext: {
        launchOptions: {
            args: [
                '--disable-dev-shm-usage',
                '--disable-accelerated-2d-canvas',
                '--no-first-run',
                '--no-zygote',
                '--disable-gpu',
            ],
        },
    },

    // Reuse browser contexts for efficiency
    useSessionPool: true,
    persistCookiesPerSession: false,

    // Limit concurrent browser instances
    maxConcurrency: 5,

    async requestHandler({ request, page }) {
        // Block unnecessary resources to save memory
        await page.route('**/*', (route) => {
            const resourceType = route.request().resourceType();
            if (['image', 'stylesheet', 'font'].includes(resourceType)) {
                route.abort();
            } else {
                route.continue();
            }
        });

        const data = await extractData(page);
        await crawler.pushData(data);
    },
});

Disk I/O Optimization

Minimize disk writes by buffering data:

import { CheerioCrawler, Dataset } from 'crawlee';

class BufferedDataset {
    constructor(maxBufferSize = 500) {
        this.buffer = [];
        this.maxBufferSize = maxBufferSize;
        this.dataset = null;
    }

    async init() {
        this.dataset = await Dataset.open();
    }

    async add(data) {
        this.buffer.push(data);

        if (this.buffer.length >= this.maxBufferSize) {
            await this.flush();
        }
    }

    async flush() {
        if (this.buffer.length > 0) {
            await this.dataset.pushData(this.buffer);
            console.log(`Flushed ${this.buffer.length} items to dataset`);
            this.buffer = [];
        }
    }
}

const bufferedDataset = new BufferedDataset(500);
await bufferedDataset.init();

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const items = extractItems($);

        for (const item of items) {
            await bufferedDataset.add(item);
        }
    },
});

await crawler.run(['https://example.com']);

// Don't forget to flush remaining items
await bufferedDataset.flush();

Conclusion

Crawlee's storage system provides powerful tools for handling large-scale web scraping datasets efficiently. By implementing batch writing strategies, using pagination for reads, optimizing memory usage, and following best practices for data validation and cleanup, you can successfully scrape and store millions of records without overwhelming your system.

Key takeaways for managing large datasets with Crawlee:

Configure storage directories and memory limits appropriately
Use batch writing to minimize disk I/O operations
Implement pagination when reading large datasets
Partition data across multiple named datasets for better organization
Monitor storage usage and implement cleanup routines
Optimize browser resource usage for memory-intensive crawls
Stream exports for datasets that exceed available memory
Use key-value stores for binary data like images and PDFs

With these techniques, you can build scalable web scraping solutions that efficiently handle datasets ranging from thousands to millions of records, while maintaining system stability and performance throughout the crawling process. For more information on efficiently storing scraped data using Crawlee datasets, check out our comprehensive guide on dataset management.

Table of contents