How do I Store Scraped Data Using Crawlee Datasets?

Crawlee provides a powerful and flexible dataset storage system that allows you to store scraped data efficiently and export it in various formats. Datasets are one of Crawlee's core features, offering automatic data deduplication, multiple export formats, and seamless integration with the crawler workflow.

Understanding Crawlee Datasets

A dataset in Crawlee is a storage mechanism designed specifically for web scraping results. It stores data as individual records (objects) and provides methods to push data, retrieve it, and export it to different formats. Datasets are particularly useful because they:

Automatically handle data persistence - Data is saved incrementally as you scrape
Support multiple export formats - JSON, CSV, Excel, HTML, XML, and RSS
Provide data deduplication - Optional automatic removal of duplicate records
Scale efficiently - Can handle millions of records
Offer local and cloud storage - Works locally during development and can integrate with cloud platforms

Basic Dataset Usage

Storing Data with pushData()

The most common way to store data in Crawlee is using the pushData() method available in the crawler's context. Here's a basic example:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        const heading = $('h1').first().text();
        const description = $('meta[name="description"]').attr('content');

        // Store the scraped data
        await crawler.pushData({
            url: request.loadedUrl,
            title,
            heading,
            description,
            timestamp: new Date().toISOString(),
        });

        log.info(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

The pushData() method accepts a single object or an array of objects. Each object represents one record in your dataset.

Using the Dataset Class Directly

For more advanced use cases, you can work with the Dataset class directly:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const products = [];

        $('.product').each((index, element) => {
            products.push({
                name: $(element).find('.product-name').text(),
                price: $(element).find('.price').text(),
                url: $(element).find('a').attr('href'),
            });
        });

        // Get the default dataset
        const dataset = await Dataset.open();

        // Push all products at once
        await dataset.pushData(products);

        log.info(`Scraped ${products.length} products`);
    },
});

await crawler.run(['https://example-shop.com/products']);

Python Implementation with Crawlee

If you're using Crawlee with Python for web scraping, the dataset API is similar:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        soup = context.soup

        # Extract data
        title = soup.find('title').get_text() if soup.find('title') else ''
        headings = [h.get_text() for h in soup.find_all('h2')]

        # Store data using push_data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'headings': headings,
        })

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Working with Named Datasets

By default, Crawlee uses a single "default" dataset. However, you can create multiple named datasets to organize different types of data:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        // Store product data in one dataset
        const productsDataset = await Dataset.open('products');
        await productsDataset.pushData({
            name: $('.product-name').text(),
            price: $('.price').text(),
        });

        // Store review data in another dataset
        const reviewsDataset = await Dataset.open('reviews');
        $('.review').each(async (index, element) => {
            await reviewsDataset.pushData({
                author: $(element).find('.author').text(),
                rating: $(element).find('.rating').text(),
                comment: $(element).find('.comment').text(),
            });
        });
    },
});

await crawler.run(['https://example-shop.com/product/123']);

Exporting Dataset Data

Crawlee makes it easy to export your scraped data in various formats. After your crawler finishes, you can export the entire dataset:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            title: $('title').text(),
        });
    },
});

await crawler.run(['https://example.com']);

// Export to JSON (default)
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(data.items); // Array of all scraped records

// Export to CSV
await dataset.exportToCSV('output');

// Export to JSON file
await dataset.exportToJSON('output');

// Export to Excel
await dataset.exportToXLSX('output');

Retrieving Dataset Items Programmatically

You can retrieve and process dataset items within your code:

import { Dataset } from 'crawlee';

// Get all items
const dataset = await Dataset.open();
const { items } = await dataset.getData();

// Process the data
const processedData = items.map(item => ({
    ...item,
    processedAt: new Date().toISOString(),
    titleLength: item.title.length,
}));

console.log(`Total items scraped: ${items.length}`);
console.log(processedData);

Pagination and Large Datasets

For large datasets, you can retrieve data in chunks:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();
const limit = 100;
let offset = 0;

while (true) {
    const { items } = await dataset.getData({
        limit,
        offset
    });

    if (items.length === 0) break;

    // Process this chunk
    console.log(`Processing items ${offset} to ${offset + items.length}`);

    offset += limit;
}

Data Deduplication

Crawlee doesn't automatically deduplicate dataset records, but you can implement your own deduplication logic:

import { CheerioCrawler, Dataset } from 'crawlee';

const seenUrls = new Set();

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const productUrl = $('.product-link').attr('href');

        // Only store if we haven't seen this URL before
        if (!seenUrls.has(productUrl)) {
            seenUrls.add(productUrl);

            await crawler.pushData({
                url: productUrl,
                name: $('.product-name').text(),
                price: $('.price').text(),
            });
        }
    },
});

await crawler.run(['https://example-shop.com']);

Advanced Dataset Configuration

Custom Storage Location

You can configure where datasets are stored:

import { CheerioCrawler, Configuration } from 'crawlee';

const config = new Configuration({
    storageDir: './my-custom-storage',
});

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            data: $('body').text(),
        });
    },
}, config);

await crawler.run(['https://example.com']);

Dropping Datasets

You can programmatically clear a dataset:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open('my-dataset');

// Clear all data
await dataset.drop();

Best Practices for Dataset Storage

Structure your data consistently: Ensure all records have the same structure for easier export and analysis.
Use meaningful field names: Choose clear, descriptive names for your data fields.
Include metadata: Add timestamps, source URLs, and other metadata to track data provenance.
Handle missing data gracefully: Use null or undefined for missing values rather than omitting fields.
Validate before storing: Check that extracted data meets your requirements before pushing to the dataset.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const title = $('title').text().trim();
        const price = $('.price').text().trim();

        // Validate before storing
        if (title && price) {
            await crawler.pushData({
                url: request.loadedUrl,
                title,
                price,
                scrapedAt: new Date().toISOString(),
                source: 'example-shop',
            });
        } else {
            log.warning(`Missing data on ${request.loadedUrl}`);
        }
    },
});

await crawler.run(['https://example-shop.com']);

Integration with Browser Automation

When using PuppeteerCrawler in Crawlee, dataset storage works identically:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ request, page, log }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        // Extract data from the page
        const products = await page.$$eval('.product', elements =>
            elements.map(el => ({
                name: el.querySelector('.name')?.textContent,
                price: el.querySelector('.price')?.textContent,
                image: el.querySelector('img')?.src,
            }))
        );

        // Store all products
        await crawler.pushData(products);

        log.info(`Stored ${products.length} products`);
    },
});

await crawler.run(['https://example-shop.com']);

Conclusion

Crawlee's dataset system provides a robust solution for storing scraped data with minimal configuration. Whether you're scraping a few pages or millions of records, datasets handle the complexity of data persistence, allowing you to focus on extraction logic. By understanding the dataset API, export options, and best practices, you can build efficient and maintainable web scraping solutions.

The flexibility to work with multiple datasets, export to various formats, and integrate seamlessly with different crawler types makes Crawlee datasets an essential tool for any web scraping project.

Table of contents

How do I Store Scraped Data Using Crawlee Datasets?

Understanding Crawlee Datasets

Basic Dataset Usage

Storing Data with pushData()

Using the Dataset Class Directly

Python Implementation with Crawlee

Working with Named Datasets

Exporting Dataset Data

Retrieving Dataset Items Programmatically

Pagination and Large Datasets

Data Deduplication

Advanced Dataset Configuration

Custom Storage Location

Dropping Datasets

Best Practices for Dataset Storage

Integration with Browser Automation

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What data export formats does Crawlee support?

How do I save scraped data to JSON with Crawlee?

Can I export Crawlee data to CSV or Excel?

Get Started Now

Support