How do I use Crawlee's storage system for large datasets?
When web scraping at scale, efficiently managing large datasets is crucial for performance and reliability. Crawlee provides a sophisticated storage system designed specifically for handling massive amounts of scraped data without overwhelming your system's resources. This guide explores advanced techniques for working with large datasets, from memory-efficient data handling to distributed storage solutions.
Understanding Crawlee's Storage Architecture
Crawlee's storage system is built on three core components that work together to handle large datasets efficiently:
- Datasets: Store scraped results as individual records
- Key-Value Stores: Store arbitrary data like HTML snapshots, screenshots, or binary files
- Request Queues: Manage URLs to be crawled with persistence
For large-scale scraping operations, understanding how these components interact and optimize memory usage is essential.
Configuring Storage for Large Datasets
Storage Directory Configuration
When dealing with large datasets, you'll want to configure where Crawlee stores data and how it manages memory:
import { PlaywrightCrawler, Configuration } from 'crawlee';
const config = new Configuration({
storageDir: './large-crawl-storage',
persistStorage: true,
purgeOnStart: false, // Keep data between runs
writeMetadata: false, // Disable metadata for performance
});
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 100000, // Limit for safety
maxConcurrency: 50, // Adjust based on your resources
async requestHandler({ request, page }) {
// Your scraping logic
},
}, config);
Memory Management Settings
For large-scale operations, optimize memory usage:
import { PlaywrightCrawler, Configuration } from 'crawlee';
const crawler = new PlaywrightCrawler({
// Limit memory usage per browser instance
maxConcurrency: 10,
// Configure autoscaling for resource management
autoscaledPoolOptions: {
maxConcurrency: 50,
minConcurrency: 5,
desiredConcurrency: 20,
// Automatically adjust based on system resources
systemStatusOptions: {
maxUsedMemoryRatio: 0.7, // Use up to 70% of available memory
maxUsedCpuRatio: 0.8, // Use up to 80% of CPU
},
},
async requestHandler({ request, page }) {
const data = await extractData(page);
await crawler.pushData(data);
},
});
Efficient Data Writing for Large Datasets
Batch Writing Strategy
Instead of writing data one record at a time, batch your writes for better performance:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, crawler }) {
const products = [];
// Extract multiple items from a single page
$('.product-item').each((index, element) => {
products.push({
name: $(element).find('.product-name').text().trim(),
price: $(element).find('.price').text().trim(),
url: $(element).find('a').attr('href'),
scrapedAt: new Date().toISOString(),
});
});
// Write all products in a single operation
if (products.length > 0) {
await crawler.pushData(products);
}
},
});
await crawler.run(['https://example-shop.com/products']);
Streaming Data to Reduce Memory Footprint
For extremely large datasets, process and write data in chunks:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const BATCH_SIZE = 100;
let batch = [];
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page }) {
const items = await page.$$eval('.item', elements =>
elements.map(el => ({
title: el.querySelector('.title')?.textContent,
description: el.querySelector('.desc')?.textContent,
}))
);
// Add items to batch
batch.push(...items);
// Write batch when it reaches threshold
if (batch.length >= BATCH_SIZE) {
await crawler.pushData(batch);
batch = []; // Clear batch to free memory
}
},
async failedRequestHandler() {
// Flush remaining items on failure
if (batch.length > 0) {
await crawler.pushData(batch);
batch = [];
}
},
});
await crawler.run(['https://example.com/items']);
// Flush any remaining items
if (batch.length > 0) {
const dataset = await Dataset.open();
await dataset.pushData(batch);
}
Reading Large Datasets Efficiently
Pagination for Large Dataset Retrieval
When reading millions of records, use pagination to avoid loading everything into memory:
import { Dataset } from 'crawlee';
async function processLargeDataset() {
const dataset = await Dataset.open('large-dataset');
const CHUNK_SIZE = 1000;
let offset = 0;
let processedCount = 0;
while (true) {
// Retrieve data in chunks
const { items } = await dataset.getData({
limit: CHUNK_SIZE,
offset: offset,
});
if (items.length === 0) break;
// Process chunk
for (const item of items) {
await processItem(item);
processedCount++;
}
console.log(`Processed ${processedCount} items...`);
offset += CHUNK_SIZE;
// Optional: Clear memory between chunks
if (global.gc) {
global.gc();
}
}
console.log(`Total items processed: ${processedCount}`);
}
async function processItem(item) {
// Your processing logic here
console.log(item.name);
}
await processLargeDataset();
Python Implementation for Large Datasets
When using Crawlee with Python for web scraping, the storage API offers similar capabilities:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee import Dataset
import asyncio
BATCH_SIZE = 100
batch = []
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
global batch
soup = context.soup
# Extract multiple items
items = []
for product in soup.find_all('div', class_='product'):
items.append({
'name': product.find('h2').get_text() if product.find('h2') else '',
'price': product.find('span', class_='price').get_text() if product.find('span', class_='price') else '',
})
batch.extend(items)
# Batch write when threshold reached
if len(batch) >= BATCH_SIZE:
await context.push_data(batch)
batch.clear() # Free memory
await crawler.run(['https://example-shop.com'])
# Flush remaining items
if batch:
dataset = await Dataset.open()
await dataset.push_data(batch)
if __name__ == '__main__':
asyncio.run(main())
Using Key-Value Stores for Binary Data
When scraping large amounts of binary data like images or PDFs, use key-value stores instead of datasets:
import { PlaywrightCrawler, KeyValueStore } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page }) {
const kvStore = await KeyValueStore.open('images');
// Extract image URLs
const imageUrls = await page.$$eval('img', imgs =>
imgs.map(img => img.src)
);
// Download and store images
for (let i = 0; i < imageUrls.length; i++) {
const imageUrl = imageUrls[i];
try {
const response = await page.goto(imageUrl);
const buffer = await response.buffer();
// Store with unique key
const key = `${request.id}_image_${i}.jpg`;
await kvStore.setValue(key, buffer, {
contentType: 'image/jpeg',
});
// Navigate back
await page.goBack();
} catch (error) {
console.error(`Failed to download ${imageUrl}:`, error);
}
}
},
});
Exporting Large Datasets
Streaming Export for Memory Efficiency
Export large datasets without loading everything into memory:
import { Dataset } from 'crawlee';
import fs from 'fs';
async function exportLargeDataset() {
const dataset = await Dataset.open('large-dataset');
const writeStream = fs.createWriteStream('output.json');
const CHUNK_SIZE = 500;
let offset = 0;
let isFirst = true;
writeStream.write('[');
while (true) {
const { items } = await dataset.getData({
limit: CHUNK_SIZE,
offset: offset,
});
if (items.length === 0) break;
for (const item of items) {
if (!isFirst) {
writeStream.write(',');
}
writeStream.write(JSON.stringify(item));
isFirst = false;
}
offset += CHUNK_SIZE;
console.log(`Exported ${offset} items...`);
}
writeStream.write(']');
writeStream.end();
console.log('Export complete!');
}
await exportLargeDataset();
Export to CSV for Large Datasets
CSV exports are more memory-efficient than JSON for very large datasets:
import { Dataset } from 'crawlee';
import fs from 'fs';
async function exportToCSV() {
const dataset = await Dataset.open('large-dataset');
const writeStream = fs.createWriteStream('output.csv');
let offset = 0;
const CHUNK_SIZE = 1000;
let headerWritten = false;
while (true) {
const { items } = await dataset.getData({
limit: CHUNK_SIZE,
offset: offset,
});
if (items.length === 0) break;
// Write header once
if (!headerWritten && items.length > 0) {
const headers = Object.keys(items[0]).join(',');
writeStream.write(headers + '\n');
headerWritten = true;
}
// Write rows
for (const item of items) {
const values = Object.values(item).map(v =>
typeof v === 'string' ? `"${v.replace(/"/g, '""')}"` : v
);
writeStream.write(values.join(',') + '\n');
}
offset += CHUNK_SIZE;
console.log(`Exported ${offset} rows...`);
}
writeStream.end();
console.log('CSV export complete!');
}
await exportToCSV();
Distributed Storage for Massive Datasets
Using Named Datasets for Partitioning
Split large datasets across multiple named datasets to improve performance:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const products = [];
$('.product').each((index, element) => {
products.push({
name: $(element).find('.name').text(),
category: $(element).attr('data-category'),
price: $(element).find('.price').text(),
});
});
// Partition data by category into separate datasets
const categorizedData = {};
for (const product of products) {
const category = product.category || 'uncategorized';
if (!categorizedData[category]) {
categorizedData[category] = [];
}
categorizedData[category].push(product);
}
// Write to separate datasets
for (const [category, items] of Object.entries(categorizedData)) {
const dataset = await Dataset.open(`products-${category}`);
await dataset.pushData(items);
}
},
});
await crawler.run(['https://example-shop.com/all-products']);
Monitoring Storage Usage
Track storage metrics to prevent disk space issues:
import { PlaywrightCrawler, Dataset } from 'crawlee';
import { promises as fs } from 'fs';
import path from 'path';
async function getStorageSize(dirPath) {
let totalSize = 0;
const files = await fs.readdir(dirPath, { withFileTypes: true });
for (const file of files) {
const filePath = path.join(dirPath, file.name);
if (file.isDirectory()) {
totalSize += await getStorageSize(filePath);
} else {
const stats = await fs.stat(filePath);
totalSize += stats.size;
}
}
return totalSize;
}
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, crawler }) {
// Your scraping logic
const data = await extractData(page);
await crawler.pushData(data);
// Monitor storage every 100 requests
const stats = await crawler.requestQueue.getInfo();
if (stats.handledRequestCount % 100 === 0) {
const storageSize = await getStorageSize('./large-crawl-storage');
const sizeMB = (storageSize / 1024 / 1024).toFixed(2);
console.log(`Storage size: ${sizeMB} MB`);
console.log(`Requests: ${stats.handledRequestCount} processed, ${stats.pendingRequestCount} pending`);
// Alert if storage exceeds threshold
if (storageSize > 10 * 1024 * 1024 * 1024) { // 10 GB
console.warn('Storage exceeding 10 GB!');
}
}
},
});
Best Practices for Large Dataset Management
1. Implement Data Validation Early
Validate data before storing to avoid processing invalid records later:
function validateProduct(product) {
return product.name &&
product.price &&
typeof product.price === 'string' &&
product.url &&
product.url.startsWith('http');
}
const crawler = new CheerioCrawler({
async requestHandler({ request, $, crawler }) {
const products = extractProducts($);
// Filter valid products before storing
const validProducts = products.filter(validateProduct);
if (validProducts.length > 0) {
await crawler.pushData(validProducts);
}
const invalidCount = products.length - validProducts.length;
if (invalidCount > 0) {
console.warn(`Skipped ${invalidCount} invalid products on ${request.url}`);
}
},
});
2. Use Compression for Text-Heavy Data
For datasets with large text fields, consider compressing before storage:
import { gzipSync, gunzipSync } from 'zlib';
import { KeyValueStore } from 'crawlee';
async function storeCompressedData(key, data) {
const kvStore = await KeyValueStore.open('compressed-data');
const jsonString = JSON.stringify(data);
const compressed = gzipSync(jsonString);
await kvStore.setValue(key, compressed, {
contentType: 'application/gzip',
});
}
async function retrieveCompressedData(key) {
const kvStore = await KeyValueStore.open('compressed-data');
const compressed = await kvStore.getValue(key);
if (!compressed) return null;
const decompressed = gunzipSync(compressed);
return JSON.parse(decompressed.toString());
}
3. Implement Checkpoints for Long-Running Crawls
Save progress periodically to resume from failures:
import { PlaywrightCrawler, Dataset, KeyValueStore } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, crawler }) {
const data = await extractData(page);
await crawler.pushData(data);
// Save checkpoint every 1000 requests
const stats = await crawler.requestQueue.getInfo();
if (stats.handledRequestCount % 1000 === 0) {
const checkpoint = await KeyValueStore.open('checkpoints');
await checkpoint.setValue('last-checkpoint', {
timestamp: new Date().toISOString(),
requestsProcessed: stats.handledRequestCount,
requestsPending: stats.pendingRequestCount,
});
console.log(`Checkpoint saved at ${stats.handledRequestCount} requests`);
}
},
});
4. Clean Up Storage After Processing
Remove temporary data to free disk space when managing request queues in Crawlee:
import { Dataset, RequestQueue, KeyValueStore } from 'crawlee';
async function cleanupAfterCrawl() {
// Export final data
const dataset = await Dataset.open();
await dataset.exportToJSON('final-output');
// Drop temporary datasets
const tempDataset = await Dataset.open('temp-data');
await tempDataset.drop();
// Clear request queue
const queue = await RequestQueue.open();
await queue.drop();
// Clean temporary key-value stores
const tempStore = await KeyValueStore.open('temp-storage');
await tempStore.drop();
console.log('Cleanup complete!');
}
Performance Optimization Strategies
Memory-Efficient Browser Management
When using PlaywrightCrawler in Crawlee for large-scale scraping, optimize browser resource usage:
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
launchContext: {
launchOptions: {
args: [
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--disable-gpu',
],
},
},
// Reuse browser contexts for efficiency
useSessionPool: true,
persistCookiesPerSession: false,
// Limit concurrent browser instances
maxConcurrency: 5,
async requestHandler({ request, page }) {
// Block unnecessary resources to save memory
await page.route('**/*', (route) => {
const resourceType = route.request().resourceType();
if (['image', 'stylesheet', 'font'].includes(resourceType)) {
route.abort();
} else {
route.continue();
}
});
const data = await extractData(page);
await crawler.pushData(data);
},
});
Disk I/O Optimization
Minimize disk writes by buffering data:
import { CheerioCrawler, Dataset } from 'crawlee';
class BufferedDataset {
constructor(maxBufferSize = 500) {
this.buffer = [];
this.maxBufferSize = maxBufferSize;
this.dataset = null;
}
async init() {
this.dataset = await Dataset.open();
}
async add(data) {
this.buffer.push(data);
if (this.buffer.length >= this.maxBufferSize) {
await this.flush();
}
}
async flush() {
if (this.buffer.length > 0) {
await this.dataset.pushData(this.buffer);
console.log(`Flushed ${this.buffer.length} items to dataset`);
this.buffer = [];
}
}
}
const bufferedDataset = new BufferedDataset(500);
await bufferedDataset.init();
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const items = extractItems($);
for (const item of items) {
await bufferedDataset.add(item);
}
},
});
await crawler.run(['https://example.com']);
// Don't forget to flush remaining items
await bufferedDataset.flush();
Conclusion
Crawlee's storage system provides powerful tools for handling large-scale web scraping datasets efficiently. By implementing batch writing strategies, using pagination for reads, optimizing memory usage, and following best practices for data validation and cleanup, you can successfully scrape and store millions of records without overwhelming your system.
Key takeaways for managing large datasets with Crawlee:
- Configure storage directories and memory limits appropriately
- Use batch writing to minimize disk I/O operations
- Implement pagination when reading large datasets
- Partition data across multiple named datasets for better organization
- Monitor storage usage and implement cleanup routines
- Optimize browser resource usage for memory-intensive crawls
- Stream exports for datasets that exceed available memory
- Use key-value stores for binary data like images and PDFs
With these techniques, you can build scalable web scraping solutions that efficiently handle datasets ranging from thousands to millions of records, while maintaining system stability and performance throughout the crawling process. For more information on efficiently storing scraped data using Crawlee datasets, check out our comprehensive guide on dataset management.