Table of contents

How do I save scraped data to CSV files using Ruby?

When web scraping with Ruby, saving your extracted data to CSV (Comma-Separated Values) files is a common requirement. CSV format is ideal for structured data export, making it easy to analyze scraped information in spreadsheet applications or import into databases. Ruby's built-in CSV library provides robust functionality for writing scraped data to CSV files with minimal setup.

Using Ruby's Built-in CSV Library

Ruby includes a powerful CSV library in its standard library that handles CSV file creation, writing, and formatting automatically. Here's how to use it for saving scraped data:

Basic CSV Writing

require 'csv'

# Basic example - writing scraped data to CSV
scraped_data = [
  ['Name', 'Price', 'URL'],
  ['Product A', '$29.99', 'https://example.com/product-a'],
  ['Product B', '$39.99', 'https://example.com/product-b'],
  ['Product C', '$19.99', 'https://example.com/product-c']
]

# Write to CSV file
CSV.open('scraped_products.csv', 'w') do |csv|
  scraped_data.each do |row|
    csv << row
  end
end

Web Scraping Example with Nokogiri and CSV Export

Here's a practical example combining web scraping with CSV export:

require 'nokogiri'
require 'open-uri'
require 'csv'

class ProductScraper
  def initialize(url)
    @url = url
    @products = []
  end

  def scrape_products
    # Parse the HTML document
    doc = Nokogiri::HTML(URI.open(@url))

    # Extract product information
    doc.css('.product-item').each do |product|
      name = product.css('.product-name').text.strip
      price = product.css('.product-price').text.strip
      description = product.css('.product-description').text.strip
      link = product.css('a')&.attr('href')&.value

      @products << {
        name: name,
        price: price,
        description: description,
        link: link,
        scraped_at: Time.now
      }
    end

    @products
  end

  def save_to_csv(filename = 'products.csv')
    CSV.open(filename, 'w', write_headers: true, headers: csv_headers) do |csv|
      @products.each do |product|
        csv << product.values
      end
    end

    puts "Saved #{@products.length} products to #{filename}"
  end

  private

  def csv_headers
    ['Name', 'Price', 'Description', 'Link', 'Scraped At']
  end
end

# Usage
scraper = ProductScraper.new('https://example-store.com/products')
scraper.scrape_products
scraper.save_to_csv('scraped_products.csv')

Advanced CSV Writing Techniques

Handling Different Data Types

When working with diverse scraped data, you may need to handle different data types and formats:

require 'csv'
require 'json'

class DataExporter
  def self.export_to_csv(data, filename, options = {})
    # Default CSV options
    csv_options = {
      write_headers: true,
      headers: data.first&.keys || [],
      force_quotes: false
    }.merge(options)

    CSV.open(filename, 'w', **csv_options) do |csv|
      data.each do |row|
        # Handle different data types
        formatted_row = row.map do |key, value|
          case value
          when Hash, Array
            value.to_json  # Convert complex objects to JSON strings
          when Time, DateTime
            value.strftime('%Y-%m-%d %H:%M:%S')  # Format timestamps
          when String
            value.encode('UTF-8', invalid: :replace, undef: :replace)  # Handle encoding
          else
            value
          end
        end

        csv << formatted_row
      end
    end
  end
end

# Example with mixed data types
scraped_data = [
  {
    title: 'Article Title',
    published_date: Time.now,
    tags: ['ruby', 'web-scraping', 'csv'],
    metadata: { author: 'John Doe', category: 'Tech' },
    content_length: 1500
  }
]

DataExporter.export_to_csv(scraped_data, 'articles.csv')

Incremental CSV Writing

For large scraping operations, you might want to write data incrementally to avoid memory issues:

require 'csv'
require 'nokogiri'
require 'net/http'

class IncrementalScraper
  def initialize(filename)
    @filename = filename
    @csv_file = nil
    setup_csv_file
  end

  def scrape_and_save(urls)
    urls.each_with_index do |url, index|
      begin
        product_data = scrape_single_product(url)
        @csv_file << product_data.values if product_data

        # Progress indication
        puts "Processed #{index + 1}/#{urls.length}: #{url}"

        # Respect rate limits
        sleep(1)
      rescue => e
        puts "Error scraping #{url}: #{e.message}"
      end
    end
  ensure
    close_csv_file
  end

  private

  def setup_csv_file
    @csv_file = CSV.open(@filename, 'w', write_headers: true, 
                        headers: ['Name', 'Price', 'Description', 'URL', 'Scraped At'])
  end

  def scrape_single_product(url)
    response = Net::HTTP.get_response(URI(url))
    return nil unless response.code == '200'

    doc = Nokogiri::HTML(response.body)

    {
      name: doc.css('h1').text.strip,
      price: doc.css('.price').text.strip,
      description: doc.css('.description').text.strip[0..200],
      url: url,
      scraped_at: Time.now.strftime('%Y-%m-%d %H:%M:%S')
    }
  end

  def close_csv_file
    @csv_file&.close
  end
end

# Usage for large-scale scraping
urls = (1..1000).map { |i| "https://example.com/product/#{i}" }
scraper = IncrementalScraper.new('large_dataset.csv')
scraper.scrape_and_save(urls)

CSV Writing with Error Handling

Robust error handling is crucial when saving scraped data:

require 'csv'

class RobustCSVWriter
  def self.write_with_validation(data, filename, required_fields = [])
    validate_data(data, required_fields)

    begin
      CSV.open(filename, 'w', write_headers: true, headers: data.first.keys) do |csv|
        data.each_with_index do |row, index|
          begin
            # Validate each row
            validate_row(row, required_fields, index)
            csv << row.values
          rescue => e
            puts "Warning: Skipping row #{index + 1}: #{e.message}"
          end
        end
      end

      puts "Successfully wrote #{data.length} rows to #{filename}"
    rescue IOError => e
      puts "File I/O error: #{e.message}"
    rescue CSV::MalformedCSVError => e
      puts "CSV format error: #{e.message}"
    rescue => e
      puts "Unexpected error: #{e.message}"
    end
  end

  private

  def self.validate_data(data, required_fields)
    raise ArgumentError, "Data cannot be empty" if data.empty?
    raise ArgumentError, "Data must be an array of hashes" unless data.all? { |row| row.is_a?(Hash) }

    missing_fields = required_fields - data.first.keys
    raise ArgumentError, "Missing required fields: #{missing_fields.join(', ')}" unless missing_fields.empty?
  end

  def self.validate_row(row, required_fields, index)
    required_fields.each do |field|
      if row[field].nil? || row[field].to_s.strip.empty?
        raise ArgumentError, "Required field '#{field}' is missing or empty"
      end
    end
  end
end

# Usage with validation
scraped_products = [
  { name: 'Product A', price: '$29.99', category: 'Electronics' },
  { name: 'Product B', price: '$39.99', category: 'Books' }
]

RobustCSVWriter.write_with_validation(
  scraped_products,
  'validated_products.csv',
  [:name, :price]  # Required fields
)

Performance Optimization

For high-performance CSV writing with large datasets:

require 'csv'

class HighPerformanceCSVWriter
  def self.write_optimized(data, filename, batch_size = 1000)
    start_time = Time.now

    CSV.open(filename, 'w', write_headers: true, headers: data.first.keys, 
             col_sep: ',', quote_char: '"', force_quotes: false) do |csv|

      data.each_slice(batch_size).with_index do |batch, batch_index|
        batch.each { |row| csv << row.values }

        # Progress reporting
        processed = (batch_index + 1) * batch_size
        puts "Processed #{[processed, data.length].min}/#{data.length} rows"
      end
    end

    duration = Time.now - start_time
    puts "CSV export completed in #{duration.round(2)} seconds"
    puts "Average: #{(data.length / duration).round(0)} rows/second"
  end
end

Best Practices for CSV Export

1. Data Sanitization

Always clean your scraped data before writing to CSV:

def sanitize_for_csv(value)
  return '' if value.nil?

  # Remove or replace problematic characters
  cleaned = value.to_s
    .gsub(/[\r\n\t]/, ' ')  # Replace newlines and tabs with spaces
    .gsub(/\s+/, ' ')       # Collapse multiple spaces
    .strip                  # Remove leading/trailing whitespace
    .encode('UTF-8', invalid: :replace, undef: :replace)  # Handle encoding issues

  # Truncate very long fields
  cleaned.length > 1000 ? cleaned[0..997] + '...' : cleaned
end

2. Memory-Efficient Processing

For large datasets, process data in chunks:

def process_large_dataset(data_source, output_file)
  CSV.open(output_file, 'w') do |csv|
    csv << ['Column1', 'Column2', 'Column3']  # Headers

    data_source.find_each(batch_size: 1000) do |record|
      csv << [record.field1, record.field2, record.field3]
    end
  end
end

3. File Management

Implement proper file handling and backup strategies:

def safe_csv_write(data, filename)
  temp_file = "#{filename}.tmp"
  backup_file = "#{filename}.backup"

  begin
    # Write to temporary file first
    CSV.open(temp_file, 'w', write_headers: true, headers: data.first.keys) do |csv|
      data.each { |row| csv << row.values }
    end

    # Create backup if original exists
    File.rename(filename, backup_file) if File.exist?(filename)

    # Move temp file to final location
    File.rename(temp_file, filename)

    # Clean up backup
    File.delete(backup_file) if File.exist?(backup_file)

    puts "Successfully saved data to #{filename}"
  rescue => e
    # Restore backup if something went wrong
    File.rename(backup_file, filename) if File.exist?(backup_file)
    File.delete(temp_file) if File.exist?(temp_file)
    raise e
  end
end

Conclusion

Ruby's CSV library provides excellent support for saving scraped data to CSV files. Whether you're working with simple data structures or complex scraped information, the techniques shown above will help you export your data efficiently and reliably. Remember to always validate your data, handle errors gracefully, and consider performance implications when working with large datasets.

For more advanced web scraping scenarios, consider combining these CSV export techniques with headless browser automation tools or API-based scraping solutions to build comprehensive data extraction and export pipelines.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon