Table of contents

How do I implement caching mechanisms for Ruby web scraping projects?

Implementing effective caching mechanisms in Ruby web scraping projects is crucial for improving performance, reducing server load, and avoiding rate limits. This guide covers various caching strategies, from simple file-based caching to advanced Redis implementations.

Why Caching Matters in Web Scraping

Caching serves multiple purposes in web scraping:

  • Performance: Avoid re-fetching unchanged content
  • Rate limiting compliance: Reduce API calls and HTTP requests
  • Cost reduction: Minimize bandwidth usage and server resources
  • Reliability: Provide fallback data when websites are unavailable
  • Debugging: Store responses for analysis and testing

File-Based Caching

File-based caching is the simplest approach, storing HTTP responses or parsed data directly to disk.

Basic File Cache Implementation

require 'fileutils'
require 'digest'
require 'json'

class FileCache
  def initialize(cache_dir = './cache')
    @cache_dir = cache_dir
    FileUtils.mkdir_p(@cache_dir)
  end

  def get(key)
    file_path = cache_file_path(key)
    return nil unless File.exist?(file_path)

    cached_data = JSON.parse(File.read(file_path))

    # Check expiration
    if cached_data['expires_at'] && Time.now > Time.parse(cached_data['expires_at'])
      delete(key)
      return nil
    end

    cached_data['data']
  rescue JSON::ParserError
    delete(key)
    nil
  end

  def set(key, data, ttl = 3600)
    file_path = cache_file_path(key)
    expires_at = Time.now + ttl

    cache_data = {
      'data' => data,
      'expires_at' => expires_at.iso8601,
      'created_at' => Time.now.iso8601
    }

    File.write(file_path, JSON.pretty_generate(cache_data))
  end

  def delete(key)
    file_path = cache_file_path(key)
    File.delete(file_path) if File.exist?(file_path)
  end

  private

  def cache_file_path(key)
    hashed_key = Digest::SHA256.hexdigest(key.to_s)
    File.join(@cache_dir, "#{hashed_key}.json")
  end
end

Using File Cache in Web Scraping

require 'net/http'
require 'uri'

class WebScraper
  def initialize
    @cache = FileCache.new('./http_cache')
  end

  def fetch_page(url, cache_ttl = 3600)
    cache_key = "page:#{url}"

    # Try to get from cache first
    cached_content = @cache.get(cache_key)
    return cached_content if cached_content

    # Fetch from web if not cached
    puts "Fetching #{url} from web..."
    uri = URI(url)
    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      content = response.body
      @cache.set(cache_key, content, cache_ttl)
      content
    else
      raise "HTTP Error: #{response.code}"
    end
  end
end

# Usage
scraper = WebScraper.new
content = scraper.fetch_page('https://example.com', 1800) # Cache for 30 minutes

Redis-Based Caching

Redis provides a more robust caching solution with features like automatic expiration, atomic operations, and distributed caching capabilities.

Setting Up Redis Cache

require 'redis'
require 'json'

class RedisCache
  def initialize(redis_url = 'redis://localhost:6379')
    @redis = Redis.new(url: redis_url)
  end

  def get(key)
    cached_data = @redis.get(key)
    return nil unless cached_data

    JSON.parse(cached_data)
  rescue JSON::ParserError
    delete(key)
    nil
  end

  def set(key, data, ttl = 3600)
    @redis.setex(key, ttl, JSON.generate(data))
  end

  def delete(key)
    @redis.del(key)
  end

  def exists?(key)
    @redis.exists?(key) > 0
  end

  def ttl(key)
    @redis.ttl(key)
  end

  def keys(pattern = '*')
    @redis.keys(pattern)
  end
end

Advanced Redis Caching with Compression

require 'zlib'
require 'base64'

class CompressedRedisCache < RedisCache
  def set(key, data, ttl = 3600, compress: true)
    json_data = JSON.generate(data)

    if compress && json_data.size > 1024 # Compress if data > 1KB
      compressed_data = Zlib::Deflate.deflate(json_data)
      encoded_data = Base64.encode64(compressed_data)
      @redis.setex("#{key}:compressed", ttl, encoded_data)
    else
      @redis.setex(key, ttl, json_data)
    end
  end

  def get(key)
    # Try compressed version first
    compressed_data = @redis.get("#{key}:compressed")
    if compressed_data
      decoded_data = Base64.decode64(compressed_data)
      json_data = Zlib::Inflate.inflate(decoded_data)
      return JSON.parse(json_data)
    end

    # Fallback to uncompressed
    super(key)
  end
end

In-Memory Caching with LRU

For frequently accessed data, in-memory caching provides the fastest access times.

class LRUCache
  def initialize(max_size = 1000)
    @max_size = max_size
    @cache = {}
    @access_order = []
  end

  def get(key)
    if @cache.key?(key)
      # Move to end (most recently used)
      @access_order.delete(key)
      @access_order.push(key)

      # Check expiration
      entry = @cache[key]
      if entry[:expires_at] && Time.now > entry[:expires_at]
        delete(key)
        return nil
      end

      entry[:data]
    else
      nil
    end
  end

  def set(key, data, ttl = 3600)
    # Remove if already exists
    delete(key) if @cache.key?(key)

    # Remove oldest if at capacity
    if @cache.size >= @max_size
      oldest_key = @access_order.shift
      @cache.delete(oldest_key)
    end

    # Add new entry
    @cache[key] = {
      data: data,
      expires_at: Time.now + ttl,
      created_at: Time.now
    }
    @access_order.push(key)
  end

  def delete(key)
    @cache.delete(key)
    @access_order.delete(key)
  end

  def size
    @cache.size
  end

  def clear
    @cache.clear
    @access_order.clear
  end
end

Multi-Level Caching Strategy

Combine different caching layers for optimal performance:

class MultiLevelCache
  def initialize
    @memory_cache = LRUCache.new(500)  # Fast, limited size
    @redis_cache = RedisCache.new      # Medium speed, larger capacity
    @file_cache = FileCache.new        # Slow, persistent
  end

  def get(key)
    # Level 1: Memory cache
    data = @memory_cache.get(key)
    return data if data

    # Level 2: Redis cache
    data = @redis_cache.get(key)
    if data
      @memory_cache.set(key, data, 300) # Cache in memory for 5 minutes
      return data
    end

    # Level 3: File cache
    data = @file_cache.get(key)
    if data
      @memory_cache.set(key, data, 300)
      @redis_cache.set(key, data, 1800) # Cache in Redis for 30 minutes
      return data
    end

    nil
  end

  def set(key, data, ttl = 3600)
    @memory_cache.set(key, data, [ttl, 300].min)
    @redis_cache.set(key, data, ttl)
    @file_cache.set(key, data, ttl * 2) # Longer persistence
  end
end

HTTP Response Caching

Implement intelligent HTTP caching that respects cache headers:

require 'net/http'
require 'time'

class HTTPCache
  def initialize(cache_backend)
    @cache = cache_backend
  end

  def fetch(url, options = {})
    cache_key = "http:#{url}"

    # Check for cached response
    cached_response = @cache.get(cache_key)
    if cached_response && !expired?(cached_response)
      puts "Cache hit for #{url}"
      return cached_response['body']
    end

    # Make conditional request if we have cached data
    headers = {}
    if cached_response
      headers['If-Modified-Since'] = cached_response['last_modified'] if cached_response['last_modified']
      headers['If-None-Match'] = cached_response['etag'] if cached_response['etag']
    end

    # Fetch from web
    response = make_request(url, headers)

    case response.code
    when '304' # Not Modified
      puts "304 Not Modified for #{url}"
      refresh_cache_ttl(cache_key, cached_response)
      return cached_response['body']
    when '200'
      puts "200 OK for #{url}"
      cache_response(cache_key, response, options)
      return response.body
    else
      raise "HTTP Error: #{response.code}"
    end
  end

  private

  def make_request(url, headers = {})
    uri = URI(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = uri.scheme == 'https'

    request = Net::HTTP::Get.new(uri)
    headers.each { |key, value| request[key] = value }

    http.request(request)
  end

  def cache_response(cache_key, response, options)
    ttl = calculate_ttl(response, options)

    cached_data = {
      'body' => response.body,
      'headers' => response.to_hash,
      'last_modified' => response['Last-Modified'],
      'etag' => response['ETag'],
      'cached_at' => Time.now.iso8601
    }

    @cache.set(cache_key, cached_data, ttl)
  end

  def calculate_ttl(response, options)
    # Check Cache-Control header
    cache_control = response['Cache-Control']
    if cache_control && cache_control.include?('max-age=')
      max_age = cache_control.match(/max-age=(\d+)/)[1].to_i
      return max_age
    end

    # Check Expires header
    expires = response['Expires']
    if expires
      expires_time = Time.parse(expires)
      return [(expires_time - Time.now).to_i, 0].max
    end

    # Default TTL
    options[:default_ttl] || 3600
  end

  def expired?(cached_response)
    return false unless cached_response['cached_at']

    cached_time = Time.parse(cached_response['cached_at'])
    Time.now > cached_time + 3600 # Default 1 hour expiration
  end

  def refresh_cache_ttl(cache_key, cached_response)
    # Extend cache TTL for 304 responses
    @cache.set(cache_key, cached_response, 3600)
  end
end

Caching Parsed Data

Cache processed data structures to avoid re-parsing:

require 'nokogiri'

class ParsedDataCache
  def initialize(cache_backend)
    @cache = cache_backend
  end

  def get_parsed_data(url, css_selector, &block)
    cache_key = "parsed:#{Digest::SHA256.hexdigest("#{url}:#{css_selector}")}"

    # Try cache first
    cached_data = @cache.get(cache_key)
    return cached_data if cached_data

    # Fetch and parse
    html = fetch_html(url)
    doc = Nokogiri::HTML(html)

    # Extract data using the provided block or selector
    data = if block_given?
             yield(doc)
           else
             doc.css(css_selector).map(&:text)
           end

    # Cache the parsed data
    @cache.set(cache_key, data, 1800) # 30 minutes
    data
  end

  private

  def fetch_html(url)
    # Use your preferred HTTP client
    # Could integrate with HTTPCache from previous example
    Net::HTTP.get(URI(url))
  end
end

# Usage
cache = ParsedDataCache.new(RedisCache.new)

# Cache extracted links
links = cache.get_parsed_data('https://example.com', 'a') do |doc|
  doc.css('a').map { |link| { text: link.text, href: link['href'] } }
end

Cache Warming and Management

Implement cache warming strategies for better performance:

class CacheWarmer
  def initialize(cache_backend, scraper)
    @cache = cache_backend
    @scraper = scraper
  end

  def warm_urls(urls, delay: 1)
    urls.each do |url|
      begin
        puts "Warming cache for #{url}"
        @scraper.fetch_page(url)
        sleep(delay) # Respect rate limits
      rescue StandardError => e
        puts "Failed to warm #{url}: #{e.message}"
      end
    end
  end

  def refresh_expired_cache
    # This assumes your cache backend supports key scanning
    if @cache.respond_to?(:keys)
      @cache.keys('page:*').each do |key|
        ttl = @cache.ttl(key)
        if ttl > 0 && ttl < 300 # Refresh if expiring in 5 minutes
          url = key.sub('page:', '')
          begin
            @scraper.fetch_page(url)
            puts "Refreshed cache for #{url}"
          rescue StandardError => e
            puts "Failed to refresh #{url}: #{e.message}"
          end
        end
      end
    end
  end
end

Best Practices and Performance Tips

1. Choose Appropriate TTL Values

CACHE_TTLS = {
  static_content: 24 * 3600,    # 24 hours
  product_data: 2 * 3600,       # 2 hours
  news_articles: 30 * 60,       # 30 minutes
  real_time_data: 60            # 1 minute
}.freeze

2. Implement Cache Invalidation

class SmartCache
  def invalidate_pattern(pattern)
    if @cache.respond_to?(:keys)
      @cache.keys(pattern).each { |key| @cache.delete(key) }
    end
  end

  def invalidate_domain(domain)
    invalidate_pattern("*#{domain}*")
  end
end

3. Monitor Cache Performance

class CacheMonitor
  def initialize(cache)
    @cache = cache
    @stats = { hits: 0, misses: 0, sets: 0 }
  end

  def get(key)
    result = @cache.get(key)
    if result
      @stats[:hits] += 1
    else
      @stats[:misses] += 1
    end
    result
  end

  def set(key, data, ttl)
    @stats[:sets] += 1
    @cache.set(key, data, ttl)
  end

  def hit_rate
    total = @stats[:hits] + @stats[:misses]
    return 0 if total == 0
    (@stats[:hits].to_f / total * 100).round(2)
  end
end

Conclusion

Implementing effective caching mechanisms in Ruby web scraping projects significantly improves performance and reliability. Start with simple file-based caching for small projects, then graduate to Redis or multi-level caching for production applications. Remember to implement proper cache invalidation strategies and monitor your cache performance to ensure optimal results.

When building complex scraping workflows, consider combining caching strategies with advanced error handling techniques and timeout management for robust, production-ready scrapers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon