How do I Store Scraped Data Using Crawlee Datasets?
Crawlee provides a powerful and flexible dataset storage system that allows you to store scraped data efficiently and export it in various formats. Datasets are one of Crawlee's core features, offering automatic data deduplication, multiple export formats, and seamless integration with the crawler workflow.
Understanding Crawlee Datasets
A dataset in Crawlee is a storage mechanism designed specifically for web scraping results. It stores data as individual records (objects) and provides methods to push data, retrieve it, and export it to different formats. Datasets are particularly useful because they:
- Automatically handle data persistence - Data is saved incrementally as you scrape
- Support multiple export formats - JSON, CSV, Excel, HTML, XML, and RSS
- Provide data deduplication - Optional automatic removal of duplicate records
- Scale efficiently - Can handle millions of records
- Offer local and cloud storage - Works locally during development and can integrate with cloud platforms
Basic Dataset Usage
Storing Data with pushData()
The most common way to store data in Crawlee is using the pushData()
method available in the crawler's context. Here's a basic example:
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks, log }) {
const title = $('title').text();
const heading = $('h1').first().text();
const description = $('meta[name="description"]').attr('content');
// Store the scraped data
await crawler.pushData({
url: request.loadedUrl,
title,
heading,
description,
timestamp: new Date().toISOString(),
});
log.info(`Scraped: ${title}`);
},
});
await crawler.run(['https://example.com']);
The pushData()
method accepts a single object or an array of objects. Each object represents one record in your dataset.
Using the Dataset Class Directly
For more advanced use cases, you can work with the Dataset
class directly:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const products = [];
$('.product').each((index, element) => {
products.push({
name: $(element).find('.product-name').text(),
price: $(element).find('.price').text(),
url: $(element).find('a').attr('href'),
});
});
// Get the default dataset
const dataset = await Dataset.open();
// Push all products at once
await dataset.pushData(products);
log.info(`Scraped ${products.length} products`);
},
});
await crawler.run(['https://example-shop.com/products']);
Python Implementation with Crawlee
If you're using Crawlee with Python for web scraping, the dataset API is similar:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
soup = context.soup
# Extract data
title = soup.find('title').get_text() if soup.find('title') else ''
headings = [h.get_text() for h in soup.find_all('h2')]
# Store data using push_data
await context.push_data({
'url': context.request.url,
'title': title,
'headings': headings,
})
await crawler.run(['https://example.com'])
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Working with Named Datasets
By default, Crawlee uses a single "default" dataset. However, you can create multiple named datasets to organize different types of data:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
// Store product data in one dataset
const productsDataset = await Dataset.open('products');
await productsDataset.pushData({
name: $('.product-name').text(),
price: $('.price').text(),
});
// Store review data in another dataset
const reviewsDataset = await Dataset.open('reviews');
$('.review').each(async (index, element) => {
await reviewsDataset.pushData({
author: $(element).find('.author').text(),
rating: $(element).find('.rating').text(),
comment: $(element).find('.comment').text(),
});
});
},
});
await crawler.run(['https://example-shop.com/product/123']);
Exporting Dataset Data
Crawlee makes it easy to export your scraped data in various formats. After your crawler finishes, you can export the entire dataset:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
await crawler.pushData({
url: request.loadedUrl,
title: $('title').text(),
});
},
});
await crawler.run(['https://example.com']);
// Export to JSON (default)
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(data.items); // Array of all scraped records
// Export to CSV
await dataset.exportToCSV('output');
// Export to JSON file
await dataset.exportToJSON('output');
// Export to Excel
await dataset.exportToXLSX('output');
Retrieving Dataset Items Programmatically
You can retrieve and process dataset items within your code:
import { Dataset } from 'crawlee';
// Get all items
const dataset = await Dataset.open();
const { items } = await dataset.getData();
// Process the data
const processedData = items.map(item => ({
...item,
processedAt: new Date().toISOString(),
titleLength: item.title.length,
}));
console.log(`Total items scraped: ${items.length}`);
console.log(processedData);
Pagination and Large Datasets
For large datasets, you can retrieve data in chunks:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
const limit = 100;
let offset = 0;
while (true) {
const { items } = await dataset.getData({
limit,
offset
});
if (items.length === 0) break;
// Process this chunk
console.log(`Processing items ${offset} to ${offset + items.length}`);
offset += limit;
}
Data Deduplication
Crawlee doesn't automatically deduplicate dataset records, but you can implement your own deduplication logic:
import { CheerioCrawler, Dataset } from 'crawlee';
const seenUrls = new Set();
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const productUrl = $('.product-link').attr('href');
// Only store if we haven't seen this URL before
if (!seenUrls.has(productUrl)) {
seenUrls.add(productUrl);
await crawler.pushData({
url: productUrl,
name: $('.product-name').text(),
price: $('.price').text(),
});
}
},
});
await crawler.run(['https://example-shop.com']);
Advanced Dataset Configuration
Custom Storage Location
You can configure where datasets are stored:
import { CheerioCrawler, Configuration } from 'crawlee';
const config = new Configuration({
storageDir: './my-custom-storage',
});
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
await crawler.pushData({
url: request.loadedUrl,
data: $('body').text(),
});
},
}, config);
await crawler.run(['https://example.com']);
Dropping Datasets
You can programmatically clear a dataset:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open('my-dataset');
// Clear all data
await dataset.drop();
Best Practices for Dataset Storage
Structure your data consistently: Ensure all records have the same structure for easier export and analysis.
Use meaningful field names: Choose clear, descriptive names for your data fields.
Include metadata: Add timestamps, source URLs, and other metadata to track data provenance.
Handle missing data gracefully: Use null or undefined for missing values rather than omitting fields.
Validate before storing: Check that extracted data meets your requirements before pushing to the dataset.
import { CheerioCrawler } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const title = $('title').text().trim();
const price = $('.price').text().trim();
// Validate before storing
if (title && price) {
await crawler.pushData({
url: request.loadedUrl,
title,
price,
scrapedAt: new Date().toISOString(),
source: 'example-shop',
});
} else {
log.warning(`Missing data on ${request.loadedUrl}`);
}
},
});
await crawler.run(['https://example-shop.com']);
Integration with Browser Automation
When using PuppeteerCrawler in Crawlee, dataset storage works identically:
import { PuppeteerCrawler } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ request, page, log }) {
// Wait for dynamic content
await page.waitForSelector('.product-list');
// Extract data from the page
const products = await page.$$eval('.product', elements =>
elements.map(el => ({
name: el.querySelector('.name')?.textContent,
price: el.querySelector('.price')?.textContent,
image: el.querySelector('img')?.src,
}))
);
// Store all products
await crawler.pushData(products);
log.info(`Stored ${products.length} products`);
},
});
await crawler.run(['https://example-shop.com']);
Conclusion
Crawlee's dataset system provides a robust solution for storing scraped data with minimal configuration. Whether you're scraping a few pages or millions of records, datasets handle the complexity of data persistence, allowing you to focus on extraction logic. By understanding the dataset API, export options, and best practices, you can build efficient and maintainable web scraping solutions.
The flexibility to work with multiple datasets, export to various formats, and integrate seamlessly with different crawler types makes Crawlee datasets an essential tool for any web scraping project.