How do I save scraped data to CSV files using Ruby?
When web scraping with Ruby, saving your extracted data to CSV (Comma-Separated Values) files is a common requirement. CSV format is ideal for structured data export, making it easy to analyze scraped information in spreadsheet applications or import into databases. Ruby's built-in CSV library provides robust functionality for writing scraped data to CSV files with minimal setup.
Using Ruby's Built-in CSV Library
Ruby includes a powerful CSV library in its standard library that handles CSV file creation, writing, and formatting automatically. Here's how to use it for saving scraped data:
Basic CSV Writing
require 'csv'
# Basic example - writing scraped data to CSV
scraped_data = [
['Name', 'Price', 'URL'],
['Product A', '$29.99', 'https://example.com/product-a'],
['Product B', '$39.99', 'https://example.com/product-b'],
['Product C', '$19.99', 'https://example.com/product-c']
]
# Write to CSV file
CSV.open('scraped_products.csv', 'w') do |csv|
scraped_data.each do |row|
csv << row
end
end
Web Scraping Example with Nokogiri and CSV Export
Here's a practical example combining web scraping with CSV export:
require 'nokogiri'
require 'open-uri'
require 'csv'
class ProductScraper
def initialize(url)
@url = url
@products = []
end
def scrape_products
# Parse the HTML document
doc = Nokogiri::HTML(URI.open(@url))
# Extract product information
doc.css('.product-item').each do |product|
name = product.css('.product-name').text.strip
price = product.css('.product-price').text.strip
description = product.css('.product-description').text.strip
link = product.css('a')&.attr('href')&.value
@products << {
name: name,
price: price,
description: description,
link: link,
scraped_at: Time.now
}
end
@products
end
def save_to_csv(filename = 'products.csv')
CSV.open(filename, 'w', write_headers: true, headers: csv_headers) do |csv|
@products.each do |product|
csv << product.values
end
end
puts "Saved #{@products.length} products to #{filename}"
end
private
def csv_headers
['Name', 'Price', 'Description', 'Link', 'Scraped At']
end
end
# Usage
scraper = ProductScraper.new('https://example-store.com/products')
scraper.scrape_products
scraper.save_to_csv('scraped_products.csv')
Advanced CSV Writing Techniques
Handling Different Data Types
When working with diverse scraped data, you may need to handle different data types and formats:
require 'csv'
require 'json'
class DataExporter
def self.export_to_csv(data, filename, options = {})
# Default CSV options
csv_options = {
write_headers: true,
headers: data.first&.keys || [],
force_quotes: false
}.merge(options)
CSV.open(filename, 'w', **csv_options) do |csv|
data.each do |row|
# Handle different data types
formatted_row = row.map do |key, value|
case value
when Hash, Array
value.to_json # Convert complex objects to JSON strings
when Time, DateTime
value.strftime('%Y-%m-%d %H:%M:%S') # Format timestamps
when String
value.encode('UTF-8', invalid: :replace, undef: :replace) # Handle encoding
else
value
end
end
csv << formatted_row
end
end
end
end
# Example with mixed data types
scraped_data = [
{
title: 'Article Title',
published_date: Time.now,
tags: ['ruby', 'web-scraping', 'csv'],
metadata: { author: 'John Doe', category: 'Tech' },
content_length: 1500
}
]
DataExporter.export_to_csv(scraped_data, 'articles.csv')
Incremental CSV Writing
For large scraping operations, you might want to write data incrementally to avoid memory issues:
require 'csv'
require 'nokogiri'
require 'net/http'
class IncrementalScraper
def initialize(filename)
@filename = filename
@csv_file = nil
setup_csv_file
end
def scrape_and_save(urls)
urls.each_with_index do |url, index|
begin
product_data = scrape_single_product(url)
@csv_file << product_data.values if product_data
# Progress indication
puts "Processed #{index + 1}/#{urls.length}: #{url}"
# Respect rate limits
sleep(1)
rescue => e
puts "Error scraping #{url}: #{e.message}"
end
end
ensure
close_csv_file
end
private
def setup_csv_file
@csv_file = CSV.open(@filename, 'w', write_headers: true,
headers: ['Name', 'Price', 'Description', 'URL', 'Scraped At'])
end
def scrape_single_product(url)
response = Net::HTTP.get_response(URI(url))
return nil unless response.code == '200'
doc = Nokogiri::HTML(response.body)
{
name: doc.css('h1').text.strip,
price: doc.css('.price').text.strip,
description: doc.css('.description').text.strip[0..200],
url: url,
scraped_at: Time.now.strftime('%Y-%m-%d %H:%M:%S')
}
end
def close_csv_file
@csv_file&.close
end
end
# Usage for large-scale scraping
urls = (1..1000).map { |i| "https://example.com/product/#{i}" }
scraper = IncrementalScraper.new('large_dataset.csv')
scraper.scrape_and_save(urls)
CSV Writing with Error Handling
Robust error handling is crucial when saving scraped data:
require 'csv'
class RobustCSVWriter
def self.write_with_validation(data, filename, required_fields = [])
validate_data(data, required_fields)
begin
CSV.open(filename, 'w', write_headers: true, headers: data.first.keys) do |csv|
data.each_with_index do |row, index|
begin
# Validate each row
validate_row(row, required_fields, index)
csv << row.values
rescue => e
puts "Warning: Skipping row #{index + 1}: #{e.message}"
end
end
end
puts "Successfully wrote #{data.length} rows to #{filename}"
rescue IOError => e
puts "File I/O error: #{e.message}"
rescue CSV::MalformedCSVError => e
puts "CSV format error: #{e.message}"
rescue => e
puts "Unexpected error: #{e.message}"
end
end
private
def self.validate_data(data, required_fields)
raise ArgumentError, "Data cannot be empty" if data.empty?
raise ArgumentError, "Data must be an array of hashes" unless data.all? { |row| row.is_a?(Hash) }
missing_fields = required_fields - data.first.keys
raise ArgumentError, "Missing required fields: #{missing_fields.join(', ')}" unless missing_fields.empty?
end
def self.validate_row(row, required_fields, index)
required_fields.each do |field|
if row[field].nil? || row[field].to_s.strip.empty?
raise ArgumentError, "Required field '#{field}' is missing or empty"
end
end
end
end
# Usage with validation
scraped_products = [
{ name: 'Product A', price: '$29.99', category: 'Electronics' },
{ name: 'Product B', price: '$39.99', category: 'Books' }
]
RobustCSVWriter.write_with_validation(
scraped_products,
'validated_products.csv',
[:name, :price] # Required fields
)
Performance Optimization
For high-performance CSV writing with large datasets:
require 'csv'
class HighPerformanceCSVWriter
def self.write_optimized(data, filename, batch_size = 1000)
start_time = Time.now
CSV.open(filename, 'w', write_headers: true, headers: data.first.keys,
col_sep: ',', quote_char: '"', force_quotes: false) do |csv|
data.each_slice(batch_size).with_index do |batch, batch_index|
batch.each { |row| csv << row.values }
# Progress reporting
processed = (batch_index + 1) * batch_size
puts "Processed #{[processed, data.length].min}/#{data.length} rows"
end
end
duration = Time.now - start_time
puts "CSV export completed in #{duration.round(2)} seconds"
puts "Average: #{(data.length / duration).round(0)} rows/second"
end
end
Best Practices for CSV Export
1. Data Sanitization
Always clean your scraped data before writing to CSV:
def sanitize_for_csv(value)
return '' if value.nil?
# Remove or replace problematic characters
cleaned = value.to_s
.gsub(/[\r\n\t]/, ' ') # Replace newlines and tabs with spaces
.gsub(/\s+/, ' ') # Collapse multiple spaces
.strip # Remove leading/trailing whitespace
.encode('UTF-8', invalid: :replace, undef: :replace) # Handle encoding issues
# Truncate very long fields
cleaned.length > 1000 ? cleaned[0..997] + '...' : cleaned
end
2. Memory-Efficient Processing
For large datasets, process data in chunks:
def process_large_dataset(data_source, output_file)
CSV.open(output_file, 'w') do |csv|
csv << ['Column1', 'Column2', 'Column3'] # Headers
data_source.find_each(batch_size: 1000) do |record|
csv << [record.field1, record.field2, record.field3]
end
end
end
3. File Management
Implement proper file handling and backup strategies:
def safe_csv_write(data, filename)
temp_file = "#{filename}.tmp"
backup_file = "#{filename}.backup"
begin
# Write to temporary file first
CSV.open(temp_file, 'w', write_headers: true, headers: data.first.keys) do |csv|
data.each { |row| csv << row.values }
end
# Create backup if original exists
File.rename(filename, backup_file) if File.exist?(filename)
# Move temp file to final location
File.rename(temp_file, filename)
# Clean up backup
File.delete(backup_file) if File.exist?(backup_file)
puts "Successfully saved data to #{filename}"
rescue => e
# Restore backup if something went wrong
File.rename(backup_file, filename) if File.exist?(backup_file)
File.delete(temp_file) if File.exist?(temp_file)
raise e
end
end
Conclusion
Ruby's CSV library provides excellent support for saving scraped data to CSV files. Whether you're working with simple data structures or complex scraped information, the techniques shown above will help you export your data efficiently and reliably. Remember to always validate your data, handle errors gracefully, and consider performance implications when working with large datasets.
For more advanced web scraping scenarios, consider combining these CSV export techniques with headless browser automation tools or API-based scraping solutions to build comprehensive data extraction and export pipelines.