Table of contents

How do I convert scraped data to JSON format using Ruby?

Converting scraped data to JSON format in Ruby is a fundamental skill for web scraping projects. JSON (JavaScript Object Notation) provides a lightweight, human-readable format that's perfect for storing and transmitting scraped data. Ruby's built-in JSON library makes this conversion straightforward, whether you're working with simple data structures or complex nested objects.

Understanding JSON Conversion in Ruby

Ruby provides excellent support for JSON through its built-in json library. The conversion process typically involves organizing your scraped data into Ruby hashes and arrays, then using the JSON.generate or to_json methods to create properly formatted JSON output.

Basic JSON Conversion

Here's a simple example of converting scraped data to JSON:

require 'json'
require 'nokogiri'
require 'net/http'

# Sample scraped data structure
scraped_data = {
  title: "Example Article",
  author: "John Doe",
  published_date: "2024-01-15",
  content: "This is the article content...",
  tags: ["ruby", "web-scraping", "json"]
}

# Convert to JSON
json_output = JSON.generate(scraped_data)
puts json_output

# Alternative using to_json method
json_output = scraped_data.to_json
puts json_output

Scraping and Converting Real Web Data

Let's create a more comprehensive example that scrapes actual web content and converts it to JSON:

require 'nokogiri'
require 'net/http'
require 'json'
require 'uri'

class WebScraper
  def initialize(url)
    @url = url
    @data = {}
  end

  def scrape_and_convert
    # Fetch the webpage
    uri = URI(@url)
    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      doc = Nokogiri::HTML(response.body)

      # Extract data
      @data = {
        url: @url,
        title: extract_title(doc),
        meta_description: extract_meta_description(doc),
        headings: extract_headings(doc),
        links: extract_links(doc),
        images: extract_images(doc),
        scraped_at: Time.now.iso8601
      }

      # Convert to JSON
      JSON.pretty_generate(@data)
    else
      { error: "Failed to fetch page", status_code: response.code }.to_json
    end
  end

  private

  def extract_title(doc)
    title_element = doc.at_css('title')
    title_element ? title_element.text.strip : nil
  end

  def extract_meta_description(doc)
    meta_desc = doc.at_css('meta[name="description"]')
    meta_desc ? meta_desc['content'] : nil
  end

  def extract_headings(doc)
    headings = {}
    (1..6).each do |level|
      headings["h#{level}"] = doc.css("h#{level}").map(&:text).map(&:strip)
    end
    headings
  end

  def extract_links(doc)
    doc.css('a[href]').map do |link|
      {
        text: link.text.strip,
        href: link['href'],
        title: link['title']
      }
    end
  end

  def extract_images(doc)
    doc.css('img[src]').map do |img|
      {
        src: img['src'],
        alt: img['alt'],
        title: img['title']
      }
    end
  end
end

# Usage
scraper = WebScraper.new('https://example.com')
json_result = scraper.scrape_and_convert
puts json_result

Handling Complex Data Structures

When dealing with nested or complex data structures, you might need custom serialization methods:

class ProductScraper
  def initialize
    @products = []
  end

  def scrape_products(doc)
    doc.css('.product').each do |product_element|
      product_data = {
        id: extract_product_id(product_element),
        name: extract_product_name(product_element),
        price: extract_price(product_element),
        availability: extract_availability(product_element),
        reviews: extract_reviews(product_element),
        specifications: extract_specifications(product_element)
      }

      @products << product_data
    end
  end

  def to_json_with_metadata
    output = {
      metadata: {
        total_products: @products.length,
        scraped_at: Time.now.iso8601,
        version: "1.0"
      },
      products: @products
    }

    JSON.pretty_generate(output)
  end

  private

  def extract_reviews(product_element)
    reviews = []
    product_element.css('.review').each do |review|
      reviews << {
        rating: review.css('.rating').text.to_i,
        comment: review.css('.comment').text.strip,
        author: review.css('.author').text.strip,
        date: review.css('.date').text.strip
      }
    end
    reviews
  end

  def extract_specifications(product_element)
    specs = {}
    product_element.css('.spec-item').each do |spec|
      key = spec.css('.spec-name').text.strip.downcase.gsub(/\s+/, '_')
      value = spec.css('.spec-value').text.strip
      specs[key] = value
    end
    specs
  end
end

Custom JSON Serialization with Classes

For more control over JSON output, you can create custom classes with to_json methods:

class ScrapedArticle
  attr_accessor :title, :author, :content, :publish_date, :tags

  def initialize(title:, author:, content:, publish_date:, tags: [])
    @title = title
    @author = author
    @content = content
    @publish_date = publish_date
    @tags = tags
  end

  def to_json(*args)
    {
      article: {
        title: @title,
        author: @author,
        content: truncate_content(@content),
        publish_date: @publish_date,
        tags: @tags,
        word_count: @content.split.length,
        reading_time: calculate_reading_time
      }
    }.to_json(*args)
  end

  private

  def truncate_content(content, limit = 500)
    content.length > limit ? "#{content[0..limit]}..." : content
  end

  def calculate_reading_time
    words = @content.split.length
    (words / 200.0).ceil # Assuming 200 words per minute
  end
end

# Usage
article = ScrapedArticle.new(
  title: "Ruby Web Scraping Guide",
  author: "Jane Developer",
  content: "This is a comprehensive guide...",
  publish_date: "2024-01-15",
  tags: ["ruby", "scraping", "tutorial"]
)

puts article.to_json

Error Handling and Data Validation

Always include proper error handling when converting scraped data to JSON:

class SafeJsonConverter
  def self.convert_with_validation(data)
    begin
      # Validate data structure
      raise ArgumentError, "Data cannot be nil" if data.nil?
      raise ArgumentError, "Data must be a Hash or Array" unless data.is_a?(Hash) || data.is_a?(Array)

      # Clean data before conversion
      cleaned_data = clean_data(data)

      # Convert to JSON
      JSON.generate(cleaned_data)

    rescue JSON::GeneratorError => e
      handle_json_error(e, data)
    rescue ArgumentError => e
      { error: e.message }.to_json
    end
  end

  private

  def self.clean_data(data)
    case data
    when Hash
      data.transform_values { |v| clean_data(v) }
    when Array
      data.map { |item| clean_data(item) }
    when String
      # Remove null bytes and ensure UTF-8 encoding
      data.encode('UTF-8', invalid: :replace, undef: :replace).delete("\u0000")
    when NilClass, TrueClass, FalseClass, Numeric
      data
    else
      data.to_s
    end
  end

  def self.handle_json_error(error, data)
    {
      error: "JSON conversion failed",
      message: error.message,
      data_preview: data.to_s[0..100]
    }.to_json
  end
end

Working with Large Datasets

For large amounts of scraped data, consider streaming JSON output:

require 'json'

class StreamingJsonConverter
  def initialize(output_file)
    @output_file = output_file
    @file = File.open(output_file, 'w')
    @first_item = true
  end

  def start_array
    @file.write('[')
  end

  def add_item(data)
    @file.write(',') unless @first_item
    @file.write(JSON.generate(data))
    @first_item = false
  end

  def end_array
    @file.write(']')
  end

  def close
    @file.close
  end
end

# Usage for large datasets
converter = StreamingJsonConverter.new('scraped_data.json')
converter.start_array

# Process items one by one to avoid memory issues
scraped_items.each do |item|
  converter.add_item(item)
end

converter.end_array
converter.close

Command Line Tools for JSON Conversion

You can create command-line tools for converting scraped data:

#!/usr/bin/env ruby

require 'json'
require 'optparse'

options = {}
OptionParser.new do |opts|
  opts.banner = "Usage: scrape_to_json.rb [options]"

  opts.on("-u", "--url URL", "URL to scrape") do |url|
    options[:url] = url
  end

  opts.on("-o", "--output FILE", "Output JSON file") do |file|
    options[:output] = file
  end

  opts.on("-f", "--format FORMAT", "JSON format (compact|pretty)") do |format|
    options[:format] = format
  end
end.parse!

# Your scraping and conversion logic here

Using JSON with Web Scraping APIs

When working with web scraping services, JSON output is often the preferred format. For example, when using automated web scraping tools like WebScraping.AI, you can process the structured JSON responses and integrate them with your Ruby applications.

require 'net/http'
require 'json'

class ApiScraper
  def initialize(api_key)
    @api_key = api_key
  end

  def scrape_with_api(url)
    uri = URI("https://api.webscraping.ai/html")
    uri.query = URI.encode_www_form({
      api_key: @api_key,
      url: url,
      return_page_source: true
    })

    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      # Parse API response and convert to desired JSON format
      html_content = response.body
      doc = Nokogiri::HTML(html_content)

      # Process and structure data
      structured_data = {
        source_url: url,
        scraped_content: extract_content(doc),
        metadata: {
          scraped_at: Time.now.iso8601,
          api_provider: "webscraping.ai"
        }
      }

      JSON.pretty_generate(structured_data)
    else
      { error: "API request failed", status: response.code }.to_json
    end
  end
end

Best Practices for JSON Conversion

  1. Always validate data before conversion to prevent JSON generation errors
  2. Handle encoding issues by ensuring UTF-8 encoding for text content
  3. Use meaningful key names that follow JSON naming conventions (snake_case or camelCase)
  4. Include metadata such as scraping timestamp and data source
  5. Consider data size and use streaming for large datasets
  6. Implement proper error handling to gracefully handle conversion failures

Advanced JSON Formatting Options

Ruby's JSON library provides several formatting options:

# Compact JSON (minimal whitespace)
compact_json = JSON.generate(data)

# Pretty-printed JSON (readable formatting)
pretty_json = JSON.pretty_generate(data)

# Custom formatting with specific indentation
custom_json = JSON.pretty_generate(data, {
  indent: '  ',
  space: ' ',
  space_before: '',
  object_nl: "\n",
  array_nl: "\n"
})

# Sort keys for consistent output
sorted_json = JSON.pretty_generate(data, { sort_keys: true })

Performance Considerations

When working with large datasets, consider these performance optimizations:

# Use JSON.generate instead of to_json for better performance
large_data = { items: (1..10000).map { |i| { id: i, name: "Item #{i}" } } }

# Faster
json_fast = JSON.generate(large_data)

# Slower due to additional method calls
json_slow = large_data.to_json

# For repeated conversions, consider caching
class CachedJsonConverter
  def initialize
    @cache = {}
  end

  def convert(data)
    cache_key = data.hash
    @cache[cache_key] ||= JSON.generate(data)
  end
end

Conclusion

Converting scraped data to JSON format in Ruby is straightforward with the built-in JSON library. By following the patterns and best practices outlined above, you can create robust, maintainable scraping solutions that produce clean, structured JSON output. Remember to always validate your data, handle errors gracefully, and consider performance implications when working with large datasets.

The key to successful JSON conversion lies in proper data structure organization and thorough error handling. Whether you're building simple scrapers or complex data extraction systems, these techniques will help you create reliable, production-ready solutions for your web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon