Table of contents

How do I Store Scraped Data Using Crawlee Datasets?

Crawlee provides a powerful and flexible dataset storage system that allows you to store scraped data efficiently and export it in various formats. Datasets are one of Crawlee's core features, offering automatic data deduplication, multiple export formats, and seamless integration with the crawler workflow.

Understanding Crawlee Datasets

A dataset in Crawlee is a storage mechanism designed specifically for web scraping results. It stores data as individual records (objects) and provides methods to push data, retrieve it, and export it to different formats. Datasets are particularly useful because they:

  • Automatically handle data persistence - Data is saved incrementally as you scrape
  • Support multiple export formats - JSON, CSV, Excel, HTML, XML, and RSS
  • Provide data deduplication - Optional automatic removal of duplicate records
  • Scale efficiently - Can handle millions of records
  • Offer local and cloud storage - Works locally during development and can integrate with cloud platforms

Basic Dataset Usage

Storing Data with pushData()

The most common way to store data in Crawlee is using the pushData() method available in the crawler's context. Here's a basic example:

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        const heading = $('h1').first().text();
        const description = $('meta[name="description"]').attr('content');

        // Store the scraped data
        await crawler.pushData({
            url: request.loadedUrl,
            title,
            heading,
            description,
            timestamp: new Date().toISOString(),
        });

        log.info(`Scraped: ${title}`);
    },
});

await crawler.run(['https://example.com']);

The pushData() method accepts a single object or an array of objects. Each object represents one record in your dataset.

Using the Dataset Class Directly

For more advanced use cases, you can work with the Dataset class directly:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const products = [];

        $('.product').each((index, element) => {
            products.push({
                name: $(element).find('.product-name').text(),
                price: $(element).find('.price').text(),
                url: $(element).find('a').attr('href'),
            });
        });

        // Get the default dataset
        const dataset = await Dataset.open();

        // Push all products at once
        await dataset.pushData(products);

        log.info(`Scraped ${products.length} products`);
    },
});

await crawler.run(['https://example-shop.com/products']);

Python Implementation with Crawlee

If you're using Crawlee with Python for web scraping, the dataset API is similar:

from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main():
    crawler = BeautifulSoupCrawler()

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        soup = context.soup

        # Extract data
        title = soup.find('title').get_text() if soup.find('title') else ''
        headings = [h.get_text() for h in soup.find_all('h2')]

        # Store data using push_data
        await context.push_data({
            'url': context.request.url,
            'title': title,
            'headings': headings,
        })

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

Working with Named Datasets

By default, Crawlee uses a single "default" dataset. However, you can create multiple named datasets to organize different types of data:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        // Store product data in one dataset
        const productsDataset = await Dataset.open('products');
        await productsDataset.pushData({
            name: $('.product-name').text(),
            price: $('.price').text(),
        });

        // Store review data in another dataset
        const reviewsDataset = await Dataset.open('reviews');
        $('.review').each(async (index, element) => {
            await reviewsDataset.pushData({
                author: $(element).find('.author').text(),
                rating: $(element).find('.rating').text(),
                comment: $(element).find('.comment').text(),
            });
        });
    },
});

await crawler.run(['https://example-shop.com/product/123']);

Exporting Dataset Data

Crawlee makes it easy to export your scraped data in various formats. After your crawler finishes, you can export the entire dataset:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            title: $('title').text(),
        });
    },
});

await crawler.run(['https://example.com']);

// Export to JSON (default)
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(data.items); // Array of all scraped records

// Export to CSV
await dataset.exportToCSV('output');

// Export to JSON file
await dataset.exportToJSON('output');

// Export to Excel
await dataset.exportToXLSX('output');

Retrieving Dataset Items Programmatically

You can retrieve and process dataset items within your code:

import { Dataset } from 'crawlee';

// Get all items
const dataset = await Dataset.open();
const { items } = await dataset.getData();

// Process the data
const processedData = items.map(item => ({
    ...item,
    processedAt: new Date().toISOString(),
    titleLength: item.title.length,
}));

console.log(`Total items scraped: ${items.length}`);
console.log(processedData);

Pagination and Large Datasets

For large datasets, you can retrieve data in chunks:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();
const limit = 100;
let offset = 0;

while (true) {
    const { items } = await dataset.getData({
        limit,
        offset
    });

    if (items.length === 0) break;

    // Process this chunk
    console.log(`Processing items ${offset} to ${offset + items.length}`);

    offset += limit;
}

Data Deduplication

Crawlee doesn't automatically deduplicate dataset records, but you can implement your own deduplication logic:

import { CheerioCrawler, Dataset } from 'crawlee';

const seenUrls = new Set();

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const productUrl = $('.product-link').attr('href');

        // Only store if we haven't seen this URL before
        if (!seenUrls.has(productUrl)) {
            seenUrls.add(productUrl);

            await crawler.pushData({
                url: productUrl,
                name: $('.product-name').text(),
                price: $('.price').text(),
            });
        }
    },
});

await crawler.run(['https://example-shop.com']);

Advanced Dataset Configuration

Custom Storage Location

You can configure where datasets are stored:

import { CheerioCrawler, Configuration } from 'crawlee';

const config = new Configuration({
    storageDir: './my-custom-storage',
});

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        await crawler.pushData({
            url: request.loadedUrl,
            data: $('body').text(),
        });
    },
}, config);

await crawler.run(['https://example.com']);

Dropping Datasets

You can programmatically clear a dataset:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open('my-dataset');

// Clear all data
await dataset.drop();

Best Practices for Dataset Storage

  1. Structure your data consistently: Ensure all records have the same structure for easier export and analysis.

  2. Use meaningful field names: Choose clear, descriptive names for your data fields.

  3. Include metadata: Add timestamps, source URLs, and other metadata to track data provenance.

  4. Handle missing data gracefully: Use null or undefined for missing values rather than omitting fields.

  5. Validate before storing: Check that extracted data meets your requirements before pushing to the dataset.

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, log }) {
        const title = $('title').text().trim();
        const price = $('.price').text().trim();

        // Validate before storing
        if (title && price) {
            await crawler.pushData({
                url: request.loadedUrl,
                title,
                price,
                scrapedAt: new Date().toISOString(),
                source: 'example-shop',
            });
        } else {
            log.warning(`Missing data on ${request.loadedUrl}`);
        }
    },
});

await crawler.run(['https://example-shop.com']);

Integration with Browser Automation

When using PuppeteerCrawler in Crawlee, dataset storage works identically:

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ request, page, log }) {
        // Wait for dynamic content
        await page.waitForSelector('.product-list');

        // Extract data from the page
        const products = await page.$$eval('.product', elements =>
            elements.map(el => ({
                name: el.querySelector('.name')?.textContent,
                price: el.querySelector('.price')?.textContent,
                image: el.querySelector('img')?.src,
            }))
        );

        // Store all products
        await crawler.pushData(products);

        log.info(`Stored ${products.length} products`);
    },
});

await crawler.run(['https://example-shop.com']);

Conclusion

Crawlee's dataset system provides a robust solution for storing scraped data with minimal configuration. Whether you're scraping a few pages or millions of records, datasets handle the complexity of data persistence, allowing you to focus on extraction logic. By understanding the dataset API, export options, and best practices, you can build efficient and maintainable web scraping solutions.

The flexibility to work with multiple datasets, export to various formats, and integrate seamlessly with different crawler types makes Crawlee datasets an essential tool for any web scraping project.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon