How to Handle File Downloads During Web Scraping with Ruby

File downloads are a common requirement in web scraping projects, whether you're downloading images, PDFs, documents, or data files. Ruby provides several robust approaches for handling file downloads during web scraping, from simple HTTP requests to more sophisticated browser automation tools.

Basic File Downloads with Net::HTTP

Ruby's built-in Net::HTTP library provides the foundation for downloading files. Here's a basic example:

require 'net/http'
require 'uri'
require 'fileutils'

def download_file(url, local_path)
  uri = URI(url)

  Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
    request = Net::HTTP::Get.new(uri)

    http.request(request) do |response|
      if response.code == '200'
        File.open(local_path, 'wb') do |file|
          response.read_body do |chunk|
            file.write(chunk)
          end
        end
        puts "Downloaded: #{local_path}"
      else
        puts "Failed to download: HTTP #{response.code}"
      end
    end
  end
end

# Usage
download_file('https://example.com/document.pdf', 'downloads/document.pdf')

Advanced Downloads with Open-URI

For simpler use cases, Ruby's open-uri library provides a more concise approach:

require 'open-uri'
require 'fileutils'

def simple_download(url, local_path)
  FileUtils.mkdir_p(File.dirname(local_path))

  begin
    URI.open(url) do |remote_file|
      File.open(local_path, 'wb') do |local_file|
        local_file.write(remote_file.read)
      end
    end
    puts "Downloaded: #{local_path}"
  rescue OpenURI::HTTPError => e
    puts "HTTP Error: #{e.message}"
  rescue => e
    puts "Error: #{e.message}"
  end
end

# Download with custom headers
def download_with_headers(url, local_path, headers = {})
  default_headers = {
    'User-Agent' => 'Mozilla/5.0 (Ruby Web Scraper)'
  }

  URI.open(url, default_headers.merge(headers)) do |remote_file|
    File.open(local_path, 'wb') do |local_file|
      local_file.write(remote_file.read)
    end
  end
end

Using Mechanize for Complex Downloads

Mechanize is excellent for handling downloads that require session management, form submissions, or authentication:

require 'mechanize'
require 'fileutils'

class FileDownloader
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Chrome'
    @agent.follow_meta_refresh = true
  end

  def login_and_download(login_url, username, password, file_url, local_path)
    # Navigate to login page
    page = @agent.get(login_url)

    # Fill and submit login form
    form = page.forms.first
    form.field_with(name: 'username').value = username
    form.field_with(name: 'password').value = password

    result = @agent.submit(form)

    # Download file after authentication
    download_file(file_url, local_path)
  end

  def download_file(url, local_path)
    FileUtils.mkdir_p(File.dirname(local_path))

    file = @agent.get(url)
    file.save(local_path)
    puts "Downloaded: #{local_path}"
  rescue Mechanize::ResponseCodeError => e
    puts "Failed to download: #{e.response_code}"
  end

  def download_multiple_files(urls, download_dir)
    urls.each_with_index do |url, index|
      filename = File.basename(URI(url).path)
      filename = "file_#{index}.bin" if filename.empty?

      local_path = File.join(download_dir, filename)
      download_file(url, local_path)

      # Add delay to avoid overwhelming the server
      sleep(1)
    end
  end
end

# Usage
downloader = FileDownloader.new
downloader.login_and_download(
  'https://example.com/login',
  'your_username',
  'your_password',
  'https://example.com/protected/file.pdf',
  'downloads/protected_file.pdf'
)

Handling Different File Types

Different file types may require specific handling approaches:

require 'net/http'
require 'mime/types'

class TypedFileDownloader
  def self.download_with_type_detection(url, download_dir)
    uri = URI(url)

    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
      request = Net::HTTP::Get.new(uri)

      response = http.request(request)

      if response.code == '200'
        # Detect file type from Content-Type header
        content_type = response['content-type']
        extension = get_extension_from_content_type(content_type)

        # Generate filename
        filename = File.basename(uri.path)
        filename = "downloaded_file#{extension}" if filename.empty? || !filename.include?('.')

        local_path = File.join(download_dir, filename)

        File.open(local_path, 'wb') do |file|
          file.write(response.body)
        end

        puts "Downloaded #{content_type} file: #{local_path}"
        local_path
      end
    end
  end

  private

  def self.get_extension_from_content_type(content_type)
    mime_type = MIME::Types[content_type].first
    mime_type ? ".#{mime_type.preferred_extension}" : '.bin'
  end
end

Progress Tracking for Large Files

For large file downloads, implementing progress tracking is essential:

require 'net/http'
require 'progressbar'

def download_with_progress(url, local_path)
  uri = URI(url)

  Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
    request = Net::HTTP::Get.new(uri)

    http.request(request) do |response|
      if response.code == '200'
        total_size = response['content-length'].to_i
        progress_bar = ProgressBar.new("Downloading", total_size) if total_size > 0

        File.open(local_path, 'wb') do |file|
          response.read_body do |chunk|
            file.write(chunk)
            progress_bar.inc(chunk.size) if progress_bar
          end
        end

        progress_bar.finish if progress_bar
        puts "\nDownload completed: #{local_path}"
      end
    end
  end
end

Concurrent Downloads

For downloading multiple files efficiently, use concurrent processing:

require 'concurrent'
require 'net/http'

class ConcurrentDownloader
  def initialize(max_threads: 5)
    @thread_pool = Concurrent::FixedThreadPool.new(max_threads)
  end

  def download_files(url_path_pairs)
    futures = url_path_pairs.map do |url, local_path|
      Concurrent::Future.execute(executor: @thread_pool) do
        download_single_file(url, local_path)
      end
    end

    # Wait for all downloads to complete
    results = futures.map(&:value)
    @thread_pool.shutdown
    @thread_pool.wait_for_termination

    results
  end

  private

  def download_single_file(url, local_path)
    uri = URI(url)

    begin
      Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
        request = Net::HTTP::Get.new(uri)
        response = http.request(request)

        if response.code == '200'
          File.open(local_path, 'wb') { |file| file.write(response.body) }
          { url: url, path: local_path, status: :success }
        else
          { url: url, path: local_path, status: :failed, error: "HTTP #{response.code}" }
        end
      end
    rescue => e
      { url: url, path: local_path, status: :error, error: e.message }
    end
  end
end

# Usage
downloader = ConcurrentDownloader.new(max_threads: 3)
files_to_download = [
  ['https://example.com/file1.pdf', 'downloads/file1.pdf'],
  ['https://example.com/file2.jpg', 'downloads/file2.jpg'],
  ['https://example.com/file3.doc', 'downloads/file3.doc']
]

results = downloader.download_files(files_to_download)
results.each { |result| puts "#{result[:url]}: #{result[:status]}" }

Error Handling and Retry Logic

Robust file downloading requires proper error handling and retry mechanisms:

require 'net/http'
require 'retries'

class RobustDownloader
  def download_with_retry(url, local_path, max_retries: 3)
    with_retries(max_tries: max_retries, rescue: [Net::TimeoutError, Net::HTTPError]) do
      download_file(url, local_path)
    end
  rescue => e
    puts "Failed to download after #{max_retries} attempts: #{e.message}"
    false
  end

  private

  def download_file(url, local_path)
    uri = URI(url)

    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = uri.scheme == 'https'
    http.read_timeout = 30
    http.open_timeout = 10

    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'Mozilla/5.0 (Ruby Downloader)'

    response = http.request(request)

    case response
    when Net::HTTPSuccess
      File.open(local_path, 'wb') { |file| file.write(response.body) }
      true
    when Net::HTTPRedirection
      # Handle redirects
      new_url = response['location']
      download_file(new_url, local_path)
    else
      raise Net::HTTPError.new("HTTP Error: #{response.code}", response)
    end
  end
end

Integration with Browser Automation

For files that require JavaScript execution or complex user interactions, similar to how to handle file downloads in Puppeteer, you can use Selenium with Ruby:

require 'selenium-webdriver'
require 'fileutils'

class BrowserDownloader
  def initialize(download_dir)
    @download_dir = File.expand_path(download_dir)
    FileUtils.mkdir_p(@download_dir)

    options = Selenium::WebDriver::Chrome::Options.new
    options.add_preference('download.default_directory', @download_dir)
    options.add_preference('download.prompt_for_download', false)

    @driver = Selenium::WebDriver.for(:chrome, options: options)
  end

  def download_file_with_js(url, download_link_selector)
    @driver.navigate.to(url)

    # Wait for page to load and click download link
    wait = Selenium::WebDriver::Wait.new(timeout: 10)
    download_link = wait.until { @driver.find_element(css: download_link_selector) }

    download_link.click

    # Wait for download to complete
    wait_for_download_completion
  end

  def close
    @driver.quit
  end

  private

  def wait_for_download_completion(timeout: 30)
    start_time = Time.now

    loop do
      # Check if any .crdownload files exist (Chrome partial downloads)
      partial_files = Dir.glob(File.join(@download_dir, '*.crdownload'))
      break if partial_files.empty?

      if Time.now - start_time > timeout
        puts "Download timeout exceeded"
        break
      end

      sleep(1)
    end
  end
end

Advanced Use Cases with WebScraping.AI API

For complex scenarios where traditional Ruby libraries might face challenges with anti-bot measures or JavaScript-heavy sites, you can leverage specialized web scraping APIs. The WebScraping.AI API provides robust file download capabilities with built-in proxy rotation and JavaScript rendering:

require 'net/http'
require 'json'
require 'uri'

class WebScrapingAIDownloader
  def initialize(api_key)
    @api_key = api_key
    @base_url = 'https://api.webscraping.ai/download'
  end

  def download_file(url, local_path, options = {})
    params = {
      url: url,
      api_key: @api_key,
      proxy: options[:proxy] || 'datacenter',
      device: options[:device] || 'desktop',
      js: options[:js] || false
    }

    uri = URI(@base_url)
    uri.query = URI.encode_www_form(params)

    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      File.open(local_path, 'wb') { |file| file.write(response.body) }
      puts "Downloaded via WebScraping.AI: #{local_path}"
      true
    else
      puts "API Error: #{response.code} - #{response.body}"
      false
    end
  end
end

# Usage
api_downloader = WebScrapingAIDownloader.new('your_api_key_here')
api_downloader.download_file(
  'https://example.com/protected-file.pdf',
  'downloads/protected_file.pdf',
  { proxy: 'residential', js: true }
)

Best Practices and Considerations

Performance Optimization

Use streaming for large files to avoid memory issues
Implement connection pooling for multiple downloads
Set appropriate timeouts to handle slow connections
Use concurrent downloads with reasonable thread limits

Security Considerations

Validate file types before saving
Sanitize filenames to prevent directory traversal attacks
Check file sizes to prevent disk space exhaustion
Use HTTPS when possible for secure downloads

Error Handling

Implement retry logic for transient failures
Handle different HTTP status codes appropriately
Log download attempts for debugging
Validate downloaded files for completeness

Common Challenges and Solutions

When downloading files through web scraping, you might encounter several challenges:

Authentication Requirements: Use Mechanize for session-based authentication or handle authentication flows with browser automation
Dynamic Content: For JavaScript-generated download links, use Selenium WebDriver
Rate Limiting: Implement delays and respect robots.txt files
Large Files: Use streaming downloads and progress tracking
File Type Detection: Check Content-Type headers and validate file extensions

Conclusion

Ruby provides multiple approaches for handling file downloads during web scraping, from simple HTTP requests with Net::HTTP to sophisticated browser automation with Selenium. Choose the method that best fits your specific requirements:

Use Net::HTTP or Open-URI for simple, direct file downloads
Use Mechanize for downloads requiring session management or form interactions
Use Selenium for downloads requiring JavaScript execution or complex user interactions
Use WebScraping.AI API for enterprise-grade scraping with advanced anti-bot protection
Implement concurrent downloads for better performance with multiple files
Always include proper error handling and retry logic for production applications

The key to successful file downloading in Ruby web scraping is understanding your target website's requirements and choosing the appropriate tool and technique for your specific use case. Whether you're downloading a single document or processing thousands of files, Ruby's ecosystem provides the tools you need to build robust and efficient download solutions.

Table of contents

How to Handle File Downloads During Web Scraping with Ruby

Basic File Downloads with Net::HTTP

Advanced Downloads with Open-URI

Using Mechanize for Complex Downloads

Handling Different File Types

Progress Tracking for Large Files

Concurrent Downloads

Error Handling and Retry Logic

Integration with Browser Automation

Advanced Use Cases with WebScraping.AI API

Best Practices and Considerations

Performance Optimization

Security Considerations

Error Handling

Common Challenges and Solutions

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What is the Ruby equivalent of Python's requests library for web scraping?

How do I debug web scraping issues in Ruby applications?

How do I implement caching mechanisms for Ruby web scraping projects?

Get Started Now

Support