What Data Export Formats Does Crawlee Support?
Crawlee provides comprehensive support for exporting scraped data in multiple formats, making it easy to integrate your web scraping results with various downstream tools and applications. The framework includes built-in export methods for JSON, CSV, Excel (XLSX), HTML, XML, and RSS formats, all accessible through the Dataset API.
Overview of Crawlee Export Formats
Crawlee's dataset system supports six primary export formats:
- JSON: Standard JSON format for programmatic processing
- CSV: Comma-separated values for spreadsheet applications
- XLSX: Microsoft Excel format with full spreadsheet support
- HTML: Human-readable HTML tables
- XML: Structured XML documents
- RSS: RSS feed format for syndication
Each format can be exported with a single method call, and Crawlee handles the conversion automatically based on your dataset structure.
Exporting Data to JSON
JSON is the most commonly used format for web scraping data. Crawlee provides multiple ways to work with JSON exports:
Basic JSON Export
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const data = {
url: request.loadedUrl,
title: $('title').text(),
heading: $('h1').first().text(),
timestamp: new Date().toISOString(),
};
await crawler.pushData(data);
log.info(`Scraped: ${data.title}`);
},
});
await crawler.run(['https://example.com', 'https://example.com/about']);
// Export to JSON file
const dataset = await Dataset.open();
await dataset.exportToJSON('output');
// This creates: ./storage/datasets/default/output.json
Retrieving JSON Data Programmatically
You can also retrieve data as JSON objects without writing to a file:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
const { items } = await dataset.getData();
// items is an array of JavaScript objects
console.log(`Total records: ${items.length}`);
console.log(JSON.stringify(items, null, 2));
// Process data further
const processed = items.map(item => ({
...item,
titleLength: item.title?.length || 0,
domain: new URL(item.url).hostname,
}));
Python JSON Export
When using Crawlee with Python for web scraping, JSON export works similarly:
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
soup = context.soup
data = {
'url': context.request.url,
'title': soup.find('title').get_text() if soup.find('title') else '',
'heading': soup.find('h1').get_text() if soup.find('h1') else '',
}
await context.push_data(data)
await crawler.run(['https://example.com'])
# Export to JSON
dataset = await Dataset.open()
await dataset.export_to_json('output')
if __name__ == '__main__':
import asyncio
asyncio.run(main())
Exporting Data to CSV
CSV format is ideal for importing data into spreadsheet applications like Excel, Google Sheets, or for data analysis with tools like pandas:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
// Extract structured product data
$('.product').each((index, element) => {
crawler.pushData({
productName: $(element).find('.name').text(),
price: $(element).find('.price').text(),
category: $(element).find('.category').text(),
rating: $(element).find('.rating').text(),
inStock: $(element).find('.in-stock').length > 0,
});
});
},
});
await crawler.run(['https://example-shop.com/products']);
// Export to CSV
const dataset = await Dataset.open();
await dataset.exportToCSV('products');
// This creates: ./storage/datasets/default/products.csv
The CSV export automatically: - Converts object keys to column headers - Flattens nested objects (up to a certain depth) - Handles boolean values and arrays - Escapes special characters
CSV Export with Custom Options
You can customize CSV export behavior:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
// Export with custom delimiter and encoding
await dataset.exportToCSV('output', {
delimiter: ';', // Use semicolon instead of comma
encoding: 'utf-8', // Specify encoding
});
Exporting Data to Excel (XLSX)
Excel format is useful for sharing data with non-technical stakeholders or for advanced spreadsheet analysis:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
// Wait for dynamic content
await page.waitForSelector('.data-table');
// Extract tabular data
const tableData = await page.$$eval('table.data-table tr', rows =>
rows.map(row => {
const cells = Array.from(row.querySelectorAll('td, th'));
return {
column1: cells[0]?.textContent?.trim(),
column2: cells[1]?.textContent?.trim(),
column3: cells[2]?.textContent?.trim(),
};
})
);
await crawler.pushData(tableData);
log.info(`Extracted ${tableData.length} rows`);
},
});
await crawler.run(['https://example.com/data-table']);
// Export to Excel
const dataset = await Dataset.open();
await dataset.exportToXLSX('report');
// This creates: ./storage/datasets/default/report.xlsx
The XLSX export: - Preserves data types (numbers, dates, booleans) - Supports multiple sheets (for complex data structures) - Includes proper column formatting - Can be opened directly in Microsoft Excel or Google Sheets
Exporting Data to HTML
HTML export creates a formatted table view of your data, perfect for quick visual inspection or embedding in web pages:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
await crawler.pushData({
url: request.loadedUrl,
title: $('title').text(),
metaDescription: $('meta[name="description"]').attr('content'),
h1Count: $('h1').length,
h2Count: $('h2').length,
});
},
});
await crawler.run(['https://example.com']);
// Export to HTML
const dataset = await Dataset.open();
await dataset.exportToHTML('report');
// This creates: ./storage/datasets/default/report.html
The generated HTML file includes: - A formatted table with headers - Proper HTML structure - Basic CSS styling - Sortable columns (in some versions)
Exporting Data to XML
XML format is useful for integrating with enterprise systems or APIs that expect XML input:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const article = {
url: request.loadedUrl,
title: $('article h1').text(),
author: $('article .author').text(),
publishDate: $('article .date').text(),
content: $('article .content').text(),
tags: $('article .tag').map((i, el) => $(el).text()).get(),
};
await crawler.pushData(article);
},
});
await crawler.run(['https://example-blog.com/article']);
// Export to XML
const dataset = await Dataset.open();
await dataset.exportToXML('articles');
// This creates: ./storage/datasets/default/articles.xml
The XML export: - Converts object structure to XML elements - Handles nested objects and arrays - Uses proper XML escaping - Includes XML declaration and root element
Exporting Data to RSS
RSS export is specifically designed for creating news feeds or content syndication:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
// Extract blog post data in RSS-compatible format
const post = {
title: $('article h1').text(),
link: request.loadedUrl,
description: $('article .excerpt').text(),
pubDate: new Date($('article .date').text()).toISOString(),
author: $('article .author').text(),
};
await crawler.pushData(post);
},
});
await crawler.run(['https://example-blog.com/posts']);
// Export to RSS feed
const dataset = await Dataset.open();
await dataset.exportToRSS('feed', {
title: 'Example Blog Feed',
description: 'Latest posts from Example Blog',
link: 'https://example-blog.com',
});
// This creates: ./storage/datasets/default/feed.rss
Working with Multiple Export Formats
You can export the same dataset to multiple formats for different use cases:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
await crawler.pushData({
url: request.loadedUrl,
title: $('title').text(),
wordCount: $('body').text().split(/\s+/).length,
scrapedAt: new Date().toISOString(),
});
},
});
await crawler.run(['https://example.com']);
// Export to multiple formats
const dataset = await Dataset.open();
await Promise.all([
dataset.exportToJSON('data'), // For programmatic access
dataset.exportToCSV('data'), // For spreadsheet analysis
dataset.exportToXLSX('report'), // For stakeholder reporting
dataset.exportToHTML('preview'), // For quick browser preview
]);
console.log('Data exported to all formats');
Custom Storage Locations
You can specify custom storage directories for your exports:
import { CheerioCrawler, Configuration, Dataset } from 'crawlee';
const config = new Configuration({
storageDir: './my-scraping-results',
});
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
await crawler.pushData({
url: request.loadedUrl,
data: $('body').text(),
});
},
}, config);
await crawler.run(['https://example.com']);
const dataset = await Dataset.open(undefined, { config });
await dataset.exportToJSON('results');
// Creates: ./my-scraping-results/datasets/default/results.json
Best Practices for Data Export
1. Structure Data Consistently
Ensure all records have the same structure for clean exports:
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
// Always include all fields, even if empty
const data = {
url: request.loadedUrl || '',
title: $('title').text() || 'N/A',
description: $('meta[name="description"]').attr('content') || 'N/A',
imageCount: $('img').length,
scrapedAt: new Date().toISOString(),
};
await crawler.pushData(data);
},
});
2. Choose the Right Format
Select export formats based on your use case:
- JSON: Use for APIs, databases, or programmatic processing
- CSV: Use for simple tabular data and spreadsheet analysis
- XLSX: Use for complex reports with formatting requirements
- HTML: Use for quick visual inspection or embedding in web pages
- XML: Use for enterprise integrations or systems requiring XML
- RSS: Use for content syndication or feed creation
3. Handle Large Datasets
For large datasets, consider exporting in chunks or using streaming:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
const totalCount = await dataset.getInfo().then(info => info.itemCount);
if (totalCount > 100000) {
console.log('Large dataset detected, consider processing in chunks');
// Process in batches
for (let offset = 0; offset < totalCount; offset += 10000) {
const { items } = await dataset.getData({ offset, limit: 10000 });
// Process this batch
console.log(`Processed ${offset} to ${offset + items.length} items`);
}
}
4. Validate Before Export
Validate your data structure before exporting:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, log }) {
const title = $('title').text().trim();
const price = $('.price').text().trim();
// Validate data before storing
if (title && price) {
await crawler.pushData({
url: request.loadedUrl,
title,
price,
valid: true,
});
} else {
log.warning(`Invalid data on ${request.loadedUrl}`);
await crawler.pushData({
url: request.loadedUrl,
title: title || 'MISSING',
price: price || 'MISSING',
valid: false,
});
}
},
});
await crawler.run(['https://example-shop.com']);
// Filter valid records before export
const dataset = await Dataset.open();
const { items } = await dataset.getData();
const validItems = items.filter(item => item.valid);
console.log(`Valid records: ${validItems.length}/${items.length}`);
Integration with Browser Automation
Export functionality works seamlessly with all Crawlee crawler types, including browser-based crawlers:
import { PlaywrightCrawler, Dataset } from 'crawlee';
const crawler = new PlaywrightCrawler({
async requestHandler({ request, page, log }) {
// Wait for dynamic content to load
await page.waitForSelector('.dynamic-content');
// Extract data after JavaScript execution
const data = await page.evaluate(() => ({
title: document.title,
dynamicData: document.querySelector('.dynamic-content')?.textContent,
timestamp: Date.now(),
}));
await crawler.pushData(data);
},
});
await crawler.run(['https://example.com']);
// Export works the same regardless of crawler type
const dataset = await Dataset.open();
await dataset.exportToJSON('scraped-data');
await dataset.exportToCSV('scraped-data');
Conclusion
Crawlee's comprehensive export functionality makes it easy to store and export scraped data in multiple formats without additional libraries or complex conversion code. Whether you need JSON for API integration, CSV for data analysis, Excel for reporting, or HTML for quick previews, Crawlee handles the conversion automatically.
The unified Dataset API works consistently across all crawler types, from lightweight HTTP crawlers to full browser automation with PuppeteerCrawler, making it simple to build flexible web scraping solutions that can output data in the format that best suits your downstream processing needs.
By following best practices like maintaining consistent data structures, validating before export, and choosing appropriate formats for your use case, you can create robust web scraping pipelines that seamlessly integrate with your data analysis and business intelligence workflows.