How do I save scraped data to JSON with Crawlee?

Crawlee provides a built-in Dataset API that makes saving scraped data to JSON format straightforward and efficient. The Dataset automatically handles data persistence, deduplication, and export to various formats including JSON, CSV, and more.

Understanding Crawlee Datasets

A Dataset in Crawlee is a storage mechanism designed specifically for web scraping projects. It stores structured data as records (objects) and automatically saves them to disk or cloud storage. The Dataset API abstracts away the complexity of file handling, allowing you to focus on data extraction.

Key features of Crawlee Datasets:

Automatic persistence: Data is saved automatically as you push items
Multiple export formats: JSON, JSONL, CSV, XML, RSS, and Excel
Scalability: Handles large datasets efficiently
Deduplication: Optional deduplication based on custom keys
Cloud integration: Works with local storage and cloud providers

Saving Data to JSON in JavaScript/TypeScript

Basic Example with CheerioCrawler

Here's how to scrape data and save it to JSON using Crawlee's CheerioCrawler:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks }) {
        const title = $('title').text();
        const heading = $('h1').first().text();
        const description = $('meta[name="description"]').attr('content');

        // Push data to the default dataset
        await Dataset.pushData({
            url: request.url,
            title: title,
            heading: heading,
            description: description,
            scrapedAt: new Date().toISOString(),
        });

        // Enqueue additional links for crawling
        await enqueueLinks({
            selector: 'a[href]',
            limit: 10,
        });
    },
    maxRequestsPerCrawl: 50,
});

// Start the crawler
await crawler.run(['https://example.com']);

// Export data to JSON after crawling is complete
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(`Scraped ${data.items.length} items`);

Using PuppeteerCrawler for Dynamic Content

For JavaScript-rendered pages, use PuppeteerCrawler with the same Dataset approach:

import { PuppeteerCrawler, Dataset } from 'crawlee';

const crawler = new PuppeteerCrawler({
    async requestHandler({ request, page }) {
        // Wait for dynamic content to load
        await page.waitForSelector('.product-card');

        // Extract data from the page
        const products = await page.$$eval('.product-card', (elements) => {
            return elements.map((el) => ({
                name: el.querySelector('.product-name')?.textContent?.trim(),
                price: el.querySelector('.product-price')?.textContent?.trim(),
                rating: el.querySelector('.product-rating')?.textContent?.trim(),
                image: el.querySelector('img')?.getAttribute('src'),
            }));
        });

        // Save each product to the dataset
        for (const product of products) {
            await Dataset.pushData({
                ...product,
                url: request.url,
                category: 'electronics',
            });
        }
    },
});

await crawler.run(['https://example-shop.com/products']);

Exporting Dataset to JSON File

Crawlee provides multiple ways to export your dataset to JSON:

import { Dataset } from 'crawlee';

// Method 1: Export all data to JSON
const dataset = await Dataset.open();
await dataset.exportToJSON('results');  // Creates results.json

// Method 2: Get data and process it
const { items } = await dataset.getData();
console.log(`Total items: ${items.length}`);

// Method 3: Stream large datasets
await dataset.forEach(async (item, index) => {
    console.log(`Processing item ${index}:`, item);
});

// Method 4: Export with custom path
await dataset.exportToJSON('output/scraped-data');

Working with Named Datasets

For complex projects, you can use named datasets to organize different types of data:

import { Dataset } from 'crawlee';

// Create or open named datasets
const productsDataset = await Dataset.open('products');
const reviewsDataset = await Dataset.open('reviews');

// Push data to specific datasets
await productsDataset.pushData({
    productId: '12345',
    name: 'Wireless Mouse',
    price: 29.99,
});

await reviewsDataset.pushData({
    productId: '12345',
    author: 'John Doe',
    rating: 5,
    comment: 'Great product!',
});

// Export each dataset separately
await productsDataset.exportToJSON('products');
await reviewsDataset.exportToJSON('reviews');

Saving Data to JSON in Python

Crawlee for Python follows a similar pattern to the JavaScript version:

Basic Example with BeautifulSoupCrawler

import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset

async def main():
    crawler = BeautifulSoupCrawler(
        max_requests_per_crawl=50,
    )

    @crawler.router.default_handler
    async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
        data = {
            'url': context.request.url,
            'title': context.soup.find('title').get_text() if context.soup.find('title') else '',
            'heading': context.soup.find('h1').get_text() if context.soup.find('h1') else '',
            'paragraphs': len(context.soup.find_all('p')),
        }

        # Push data to the default dataset
        await context.push_data(data)

        # Enqueue additional links
        await context.enqueue_links(selector='a[href]', limit=10)

    # Run the crawler
    await crawler.run(['https://example.com'])

    # Export dataset to JSON
    dataset = await Dataset.open()
    await dataset.export_to_json('results')

if __name__ == '__main__':
    asyncio.run(main())

Using PlaywrightCrawler in Python

For handling dynamic content and browser automation, use PlaywrightCrawler:

import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset

async def main():
    crawler = PlaywrightCrawler(
        headless=True,
        max_requests_per_crawl=100,
    )

    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        page = context.page

        # Wait for content to load
        await page.wait_for_selector('.article')

        # Extract data
        articles = await page.query_selector_all('.article')

        for article in articles:
            title = await article.query_selector('.article-title')
            author = await article.query_selector('.article-author')
            date = await article.query_selector('.article-date')

            data = {
                'title': await title.inner_text() if title else '',
                'author': await author.inner_text() if author else '',
                'date': await date.inner_text() if date else '',
                'url': context.request.url,
            }

            await context.push_data(data)

    await crawler.run(['https://example-news.com'])

    # Export to JSON
    dataset = await Dataset.open()
    await dataset.export_to_json('articles')

if __name__ == '__main__':
    asyncio.run(main())

Advanced Dataset Operations

Custom JSON Formatting

You can customize the JSON output format:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();

// Get data with pagination
const { items } = await dataset.getData({
    offset: 0,
    limit: 100,
    clean: true,  // Remove internal fields like $id
});

// Custom JSON formatting
const fs = require('fs');
fs.writeFileSync(
    'custom-output.json',
    JSON.stringify(items, null, 2),  // Pretty print with 2-space indentation
);

Data Validation Before Saving

Implement validation to ensure data quality:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $ }) {
        const data = {
            url: request.url,
            title: $('title').text(),
            price: $('.price').text(),
            rating: $('.rating').text(),
        };

        // Validate data before saving
        if (data.title && data.price) {
            // Clean and format data
            data.price = parseFloat(data.price.replace(/[^0-9.]/g, ''));
            data.rating = parseFloat(data.rating) || 0;

            await Dataset.pushData(data);
        } else {
            console.warn(`Invalid data for ${request.url}`);
        }
    },
});

Batch Processing for Large Datasets

For handling large-scale crawling projects:

import { Dataset } from 'crawlee';

const dataset = await Dataset.open();

// Process data in batches
const batchSize = 1000;
let offset = 0;
let hasMore = true;

while (hasMore) {
    const { items } = await dataset.getData({
        offset,
        limit: batchSize,
    });

    if (items.length === 0) {
        hasMore = false;
    } else {
        // Process batch
        console.log(`Processing items ${offset} to ${offset + items.length}`);

        // Export batch to separate file
        const fs = require('fs');
        fs.writeFileSync(
            `batch-${offset}.json`,
            JSON.stringify(items, null, 2),
        );

        offset += batchSize;
    }
}

Accessing Saved JSON Files

By default, Crawlee saves datasets in the ./storage/datasets/default/ directory. Each item is stored as a separate JSON file, and you can export the entire dataset to a single JSON file.

Directory Structure

storage/
└── datasets/
    ├── default/
    │   ├── 000000001.json
    │   ├── 000000002.json
    │   └── ...
    └── products/
        ├── 000000001.json
        └── ...

Programmatic Access

import { Dataset } from 'crawlee';

// Open existing dataset
const dataset = await Dataset.open('products');

// Get all items
const { items } = await dataset.getData();

// Get specific item count
const info = await dataset.getInfo();
console.log(`Total items: ${info.itemCount}`);

// Clear dataset
await dataset.drop();

Integration with Cloud Storage

For production environments, integrate with cloud storage:

// Configure Crawlee to use cloud storage (AWS S3, Azure, GCP)
process.env.CRAWLEE_STORAGE_DIR = 's3://my-bucket/crawlee-storage';

// Or use Apify platform
import { Actor } from 'apify';

await Actor.init();
const dataset = await Actor.openDataset();
await dataset.pushData({ /* data */ });
await Actor.exit();

Best Practices

Use structured data: Always save data with consistent schemas
Add timestamps: Include scraping timestamps for data freshness tracking
Validate before saving: Implement validation to ensure data quality
Use named datasets: Organize different data types in separate datasets
Handle errors gracefully: Wrap Dataset operations in try-catch blocks
Export regularly: For long-running crawls, export data periodically
Clean up: Remove unnecessary fields before saving to reduce storage

Troubleshooting Common Issues

Issue: Dataset not persisting

Ensure you're awaiting the pushData call:

await Dataset.pushData(data);  // Correct
Dataset.pushData(data);  // Wrong - data may not be saved

Issue: Out of memory with large datasets

Use streaming or batch processing instead of loading all data at once:

await dataset.forEach(async (item) => {
    // Process one item at a time
});

Issue: Duplicate data

Enable deduplication by using unique identifiers:

await Dataset.pushData({
    uniqueKey: `${productId}-${timestamp}`,
    ...data,
});

Conclusion

Crawlee's Dataset API provides a powerful and flexible way to save scraped data to JSON format. Whether you're building a simple scraper or a complex multi-page crawling system, the Dataset API handles data persistence efficiently while giving you full control over data structure and export formats.

The built-in features like automatic persistence, multiple export formats, and cloud integration make Crawlee an excellent choice for production web scraping projects. By following the examples and best practices outlined above, you can build robust data extraction pipelines that scale with your needs.

Table of contents