How do I save scraped data to JSON with Crawlee?
Crawlee provides a built-in Dataset API that makes saving scraped data to JSON format straightforward and efficient. The Dataset automatically handles data persistence, deduplication, and export to various formats including JSON, CSV, and more.
Understanding Crawlee Datasets
A Dataset in Crawlee is a storage mechanism designed specifically for web scraping projects. It stores structured data as records (objects) and automatically saves them to disk or cloud storage. The Dataset API abstracts away the complexity of file handling, allowing you to focus on data extraction.
Key features of Crawlee Datasets:
- Automatic persistence: Data is saved automatically as you push items
- Multiple export formats: JSON, JSONL, CSV, XML, RSS, and Excel
- Scalability: Handles large datasets efficiently
- Deduplication: Optional deduplication based on custom keys
- Cloud integration: Works with local storage and cloud providers
Saving Data to JSON in JavaScript/TypeScript
Basic Example with CheerioCrawler
Here's how to scrape data and save it to JSON using Crawlee's CheerioCrawler:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text();
const heading = $('h1').first().text();
const description = $('meta[name="description"]').attr('content');
// Push data to the default dataset
await Dataset.pushData({
url: request.url,
title: title,
heading: heading,
description: description,
scrapedAt: new Date().toISOString(),
});
// Enqueue additional links for crawling
await enqueueLinks({
selector: 'a[href]',
limit: 10,
});
},
maxRequestsPerCrawl: 50,
});
// Start the crawler
await crawler.run(['https://example.com']);
// Export data to JSON after crawling is complete
const dataset = await Dataset.open();
const data = await dataset.getData();
console.log(`Scraped ${data.items.length} items`);
Using PuppeteerCrawler for Dynamic Content
For JavaScript-rendered pages, use PuppeteerCrawler with the same Dataset approach:
import { PuppeteerCrawler, Dataset } from 'crawlee';
const crawler = new PuppeteerCrawler({
async requestHandler({ request, page }) {
// Wait for dynamic content to load
await page.waitForSelector('.product-card');
// Extract data from the page
const products = await page.$$eval('.product-card', (elements) => {
return elements.map((el) => ({
name: el.querySelector('.product-name')?.textContent?.trim(),
price: el.querySelector('.product-price')?.textContent?.trim(),
rating: el.querySelector('.product-rating')?.textContent?.trim(),
image: el.querySelector('img')?.getAttribute('src'),
}));
});
// Save each product to the dataset
for (const product of products) {
await Dataset.pushData({
...product,
url: request.url,
category: 'electronics',
});
}
},
});
await crawler.run(['https://example-shop.com/products']);
Exporting Dataset to JSON File
Crawlee provides multiple ways to export your dataset to JSON:
import { Dataset } from 'crawlee';
// Method 1: Export all data to JSON
const dataset = await Dataset.open();
await dataset.exportToJSON('results'); // Creates results.json
// Method 2: Get data and process it
const { items } = await dataset.getData();
console.log(`Total items: ${items.length}`);
// Method 3: Stream large datasets
await dataset.forEach(async (item, index) => {
console.log(`Processing item ${index}:`, item);
});
// Method 4: Export with custom path
await dataset.exportToJSON('output/scraped-data');
Working with Named Datasets
For complex projects, you can use named datasets to organize different types of data:
import { Dataset } from 'crawlee';
// Create or open named datasets
const productsDataset = await Dataset.open('products');
const reviewsDataset = await Dataset.open('reviews');
// Push data to specific datasets
await productsDataset.pushData({
productId: '12345',
name: 'Wireless Mouse',
price: 29.99,
});
await reviewsDataset.pushData({
productId: '12345',
author: 'John Doe',
rating: 5,
comment: 'Great product!',
});
// Export each dataset separately
await productsDataset.exportToJSON('products');
await reviewsDataset.exportToJSON('reviews');
Saving Data to JSON in Python
Crawlee for Python follows a similar pattern to the JavaScript version:
Basic Example with BeautifulSoupCrawler
import asyncio
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
async def main():
crawler = BeautifulSoupCrawler(
max_requests_per_crawl=50,
)
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
data = {
'url': context.request.url,
'title': context.soup.find('title').get_text() if context.soup.find('title') else '',
'heading': context.soup.find('h1').get_text() if context.soup.find('h1') else '',
'paragraphs': len(context.soup.find_all('p')),
}
# Push data to the default dataset
await context.push_data(data)
# Enqueue additional links
await context.enqueue_links(selector='a[href]', limit=10)
# Run the crawler
await crawler.run(['https://example.com'])
# Export dataset to JSON
dataset = await Dataset.open()
await dataset.export_to_json('results')
if __name__ == '__main__':
asyncio.run(main())
Using PlaywrightCrawler in Python
For handling dynamic content and browser automation, use PlaywrightCrawler:
import asyncio
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
async def main():
crawler = PlaywrightCrawler(
headless=True,
max_requests_per_crawl=100,
)
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Wait for content to load
await page.wait_for_selector('.article')
# Extract data
articles = await page.query_selector_all('.article')
for article in articles:
title = await article.query_selector('.article-title')
author = await article.query_selector('.article-author')
date = await article.query_selector('.article-date')
data = {
'title': await title.inner_text() if title else '',
'author': await author.inner_text() if author else '',
'date': await date.inner_text() if date else '',
'url': context.request.url,
}
await context.push_data(data)
await crawler.run(['https://example-news.com'])
# Export to JSON
dataset = await Dataset.open()
await dataset.export_to_json('articles')
if __name__ == '__main__':
asyncio.run(main())
Advanced Dataset Operations
Custom JSON Formatting
You can customize the JSON output format:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
// Get data with pagination
const { items } = await dataset.getData({
offset: 0,
limit: 100,
clean: true, // Remove internal fields like $id
});
// Custom JSON formatting
const fs = require('fs');
fs.writeFileSync(
'custom-output.json',
JSON.stringify(items, null, 2), // Pretty print with 2-space indentation
);
Data Validation Before Saving
Implement validation to ensure data quality:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $ }) {
const data = {
url: request.url,
title: $('title').text(),
price: $('.price').text(),
rating: $('.rating').text(),
};
// Validate data before saving
if (data.title && data.price) {
// Clean and format data
data.price = parseFloat(data.price.replace(/[^0-9.]/g, ''));
data.rating = parseFloat(data.rating) || 0;
await Dataset.pushData(data);
} else {
console.warn(`Invalid data for ${request.url}`);
}
},
});
Batch Processing for Large Datasets
For handling large-scale crawling projects:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
// Process data in batches
const batchSize = 1000;
let offset = 0;
let hasMore = true;
while (hasMore) {
const { items } = await dataset.getData({
offset,
limit: batchSize,
});
if (items.length === 0) {
hasMore = false;
} else {
// Process batch
console.log(`Processing items ${offset} to ${offset + items.length}`);
// Export batch to separate file
const fs = require('fs');
fs.writeFileSync(
`batch-${offset}.json`,
JSON.stringify(items, null, 2),
);
offset += batchSize;
}
}
Accessing Saved JSON Files
By default, Crawlee saves datasets in the ./storage/datasets/default/
directory. Each item is stored as a separate JSON file, and you can export the entire dataset to a single JSON file.
Directory Structure
storage/
└── datasets/
├── default/
│ ├── 000000001.json
│ ├── 000000002.json
│ └── ...
└── products/
├── 000000001.json
└── ...
Programmatic Access
import { Dataset } from 'crawlee';
// Open existing dataset
const dataset = await Dataset.open('products');
// Get all items
const { items } = await dataset.getData();
// Get specific item count
const info = await dataset.getInfo();
console.log(`Total items: ${info.itemCount}`);
// Clear dataset
await dataset.drop();
Integration with Cloud Storage
For production environments, integrate with cloud storage:
// Configure Crawlee to use cloud storage (AWS S3, Azure, GCP)
process.env.CRAWLEE_STORAGE_DIR = 's3://my-bucket/crawlee-storage';
// Or use Apify platform
import { Actor } from 'apify';
await Actor.init();
const dataset = await Actor.openDataset();
await dataset.pushData({ /* data */ });
await Actor.exit();
Best Practices
- Use structured data: Always save data with consistent schemas
- Add timestamps: Include scraping timestamps for data freshness tracking
- Validate before saving: Implement validation to ensure data quality
- Use named datasets: Organize different data types in separate datasets
- Handle errors gracefully: Wrap Dataset operations in try-catch blocks
- Export regularly: For long-running crawls, export data periodically
- Clean up: Remove unnecessary fields before saving to reduce storage
Troubleshooting Common Issues
Issue: Dataset not persisting
Ensure you're awaiting the pushData call:
await Dataset.pushData(data); // Correct
Dataset.pushData(data); // Wrong - data may not be saved
Issue: Out of memory with large datasets
Use streaming or batch processing instead of loading all data at once:
await dataset.forEach(async (item) => {
// Process one item at a time
});
Issue: Duplicate data
Enable deduplication by using unique identifiers:
await Dataset.pushData({
uniqueKey: `${productId}-${timestamp}`,
...data,
});
Conclusion
Crawlee's Dataset API provides a powerful and flexible way to save scraped data to JSON format. Whether you're building a simple scraper or a complex multi-page crawling system, the Dataset API handles data persistence efficiently while giving you full control over data structure and export formats.
The built-in features like automatic persistence, multiple export formats, and cloud integration make Crawlee an excellent choice for production web scraping projects. By following the examples and best practices outlined above, you can build robust data extraction pipelines that scale with your needs.