Can I Export Crawlee Data to CSV or Excel?
Yes, Crawlee provides built-in support for exporting scraped data to multiple formats, including CSV, Excel, JSON, and more. The framework makes it straightforward to save your scraped data in structured formats that are easy to analyze, share, and integrate with other tools.
In this guide, we'll explore various methods to export your Crawlee data to CSV and Excel formats, covering both JavaScript and Python implementations.
Using Crawlee's Built-in Dataset Export
Crawlee's Dataset
class provides a exportToCSV()
method that makes exporting data incredibly simple. Here's how to use it:
JavaScript/TypeScript Example
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text();
const heading = $('h1').text();
const description = $('meta[name="description"]').attr('content');
// Push data to the default dataset
await Dataset.pushData({
url: request.url,
title,
heading,
description,
scrapedAt: new Date().toISOString(),
});
},
});
await crawler.run(['https://example.com']);
// Export to CSV
const dataset = await Dataset.open();
await dataset.exportToCSV('output');
This will create a file named output.csv
in your storage directory with all the scraped data.
Python Example
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.storages import Dataset
async def main():
crawler = PlaywrightCrawler()
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
page = context.page
# Extract data
title = await page.title()
heading = await page.locator('h1').text_content()
# Push to dataset
await context.push_data({
'url': context.request.url,
'title': title,
'heading': heading,
'scraped_at': datetime.now().isoformat(),
})
await crawler.run(['https://example.com'])
# Export to CSV
dataset = await Dataset.open()
await dataset.export_to_csv('output')
if __name__ == '__main__':
asyncio.run(main())
Exporting to Excel Format
While Crawlee doesn't have built-in Excel export functionality, you can easily convert your scraped data to Excel format using popular libraries.
JavaScript Excel Export
import { CheerioCrawler, Dataset } from 'crawlee';
import * as XLSX from 'xlsx';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
await Dataset.pushData({
url: request.url,
title: $('title').text(),
price: $('.price').text(),
availability: $('.stock-status').text(),
});
},
});
await crawler.run(['https://example.com/products']);
// Get all data from dataset
const dataset = await Dataset.open();
const data = await dataset.getData();
// Convert to Excel
const worksheet = XLSX.utils.json_to_sheet(data.items);
const workbook = XLSX.utils.book_new();
XLSX.utils.book_append_sheet(workbook, worksheet, 'Scraped Data');
// Write to file
XLSX.writeFile(workbook, 'output.xlsx');
First, install the required package:
npm install xlsx
Python Excel Export with Pandas
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.storages import Dataset
import pandas as pd
async def main():
crawler = BeautifulSoupCrawler()
@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
soup = context.soup
await context.push_data({
'url': context.request.url,
'title': soup.find('title').text if soup.find('title') else '',
'price': soup.select_one('.price').text if soup.select_one('.price') else '',
})
await crawler.run(['https://example.com'])
# Export to Excel
dataset = await Dataset.open()
data = await dataset.get_data()
df = pd.DataFrame(data.items)
df.to_excel('output.xlsx', index=False, engine='openpyxl')
if __name__ == '__main__':
asyncio.run(main())
Install the required packages:
pip install pandas openpyxl
Advanced CSV Export Options
You can customize your CSV export with additional options:
Custom CSV Formatting
import { CheerioCrawler, Dataset } from 'crawlee';
import { createObjectCsvWriter } from 'csv-writer';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
await Dataset.pushData({
url: request.url,
title: $('title').text(),
timestamp: Date.now(),
});
},
});
await crawler.run(['https://example.com']);
// Custom CSV writer with specific headers and formatting
const dataset = await Dataset.open();
const data = await dataset.getData();
const csvWriter = createObjectCsvWriter({
path: 'output.csv',
header: [
{ id: 'url', title: 'URL' },
{ id: 'title', title: 'Page Title' },
{ id: 'timestamp', title: 'Scraped At' },
],
append: false,
});
await csvWriter.writeRecords(data.items);
Handling Large Datasets
When working with large amounts of scraped data, it's important to export data in batches to avoid memory issues:
import { CheerioCrawler, Dataset } from 'crawlee';
import { createWriteStream } from 'fs';
import { stringify } from 'csv-stringify';
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 10000,
async requestHandler({ request, $, enqueueLinks }) {
await Dataset.pushData({
url: request.url,
title: $('title').text(),
content: $('article').text().substring(0, 500),
});
},
});
await crawler.run(['https://example.com']);
// Stream large dataset to CSV
const dataset = await Dataset.open();
const writeStream = createWriteStream('output.csv');
const stringifier = stringify({ header: true });
stringifier.pipe(writeStream);
// Process in batches
let offset = 0;
const limit = 1000;
while (true) {
const batch = await dataset.getData({ offset, limit });
if (batch.items.length === 0) break;
batch.items.forEach(item => stringifier.write(item));
offset += limit;
}
stringifier.end();
Export Formats Comparison
Crawlee supports multiple export formats out of the box:
| Format | Method | Best For |
|--------|--------|----------|
| CSV | exportToCSV()
| Tabular data, spreadsheet analysis, database imports |
| JSON | exportToJSON()
| Hierarchical data, API integration, web applications |
| JSONL | exportToJSONL()
| Large datasets, streaming processing, log analysis |
| RSS | exportToRSS()
| Content feeds, syndication |
| XML | exportToXML()
| Legacy systems, specific enterprise requirements |
Exporting Named Datasets
If you're using multiple datasets in your crawler, you can export specific ones:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
// Push to different datasets based on content type
if (request.url.includes('/products/')) {
await Dataset.pushData({
type: 'product',
name: $('.product-name').text(),
price: $('.price').text(),
}, { datasetName: 'products' });
} else {
await Dataset.pushData({
type: 'page',
title: $('title').text(),
}, { datasetName: 'pages' });
}
},
});
await crawler.run(['https://example.com']);
// Export each dataset separately
const productsDataset = await Dataset.open('products');
await productsDataset.exportToCSV('products');
const pagesDataset = await Dataset.open('pages');
await pagesDataset.exportToCSV('pages');
Best Practices for Data Export
1. Data Validation Before Export
Always validate your data before exporting to ensure quality:
import { CheerioCrawler, Dataset } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ request, $, enqueueLinks }) {
const title = $('title').text().trim();
const price = $('.price').text().trim();
// Validate data before pushing
if (title && price) {
await Dataset.pushData({
url: request.url,
title,
price: parseFloat(price.replace(/[^0-9.]/g, '')),
isValid: true,
});
} else {
// Log invalid data
console.log(`Invalid data at ${request.url}`);
}
},
});
2. Handle Special Characters
Ensure proper encoding for special characters in CSV files:
import { Dataset } from 'crawlee';
const dataset = await Dataset.open();
await dataset.exportToCSV('output', {
// Ensure UTF-8 encoding
bom: true, // Add BOM for Excel compatibility
});
3. Add Timestamps
Include timestamps to track when data was scraped:
await Dataset.pushData({
url: request.url,
title: $('title').text(),
scrapedAt: new Date().toISOString(),
unixTimestamp: Math.floor(Date.now() / 1000),
});
Integrating with Data Analysis Tools
Once exported, your CSV or Excel files can be easily integrated with various data analysis tools:
- Excel/Google Sheets: Open CSV/Excel files directly for manual analysis
- Python Pandas: Load CSV files for advanced data analysis
- SQL Databases: Import CSV files into databases for querying
- Power BI/Tableau: Connect to exported files for visualization
Loading Crawlee CSV in Pandas
import pandas as pd
# Load exported CSV
df = pd.read_csv('output.csv')
# Analyze data
print(df.describe())
print(df.groupby('category').count())
# Clean and transform
df['price'] = df['price'].str.replace('$', '').astype(float)
df['date'] = pd.to_datetime(df['scraped_at'])
# Export to Excel with formatting
with pd.ExcelWriter('analyzed_output.xlsx', engine='openpyxl') as writer:
df.to_excel(writer, sheet_name='Data', index=False)
Troubleshooting Common Export Issues
CSV File Not Created
Ensure the crawler has completed before exporting:
await crawler.run(['https://example.com']);
// Wait for all requests to complete
await crawler.teardown();
// Now export
const dataset = await Dataset.open();
await dataset.exportToCSV('output');
Empty Export Files
Check if data is being pushed to the dataset correctly:
const dataset = await Dataset.open();
const info = await dataset.getInfo();
console.log(`Dataset contains ${info.itemCount} items`);
if (info.itemCount > 0) {
await dataset.exportToCSV('output');
} else {
console.log('No data to export');
}
Memory Issues with Large Exports
For very large datasets, consider streaming the export or using batch processing techniques to handle data more efficiently.
Conclusion
Crawlee makes it simple to export scraped data to CSV and Excel formats, whether you're working with JavaScript or Python. The built-in exportToCSV()
method handles most use cases, while third-party libraries like xlsx
and pandas
provide additional flexibility for Excel exports and advanced data manipulation.
By following the best practices outlined in this guide—including data validation, proper encoding, and batch processing for large datasets—you can ensure your exported data is clean, properly formatted, and ready for analysis.
Whether you're building a web scraping pipeline with browser automation or processing static HTML content, Crawlee's flexible export options make it easy to get your data into the format you need.