Table of contents

How to Handle File Downloads During Web Scraping with Ruby

File downloads are a common requirement in web scraping projects, whether you're downloading images, PDFs, documents, or data files. Ruby provides several robust approaches for handling file downloads during web scraping, from simple HTTP requests to more sophisticated browser automation tools.

Basic File Downloads with Net::HTTP

Ruby's built-in Net::HTTP library provides the foundation for downloading files. Here's a basic example:

require 'net/http'
require 'uri'
require 'fileutils'

def download_file(url, local_path)
  uri = URI(url)

  Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
    request = Net::HTTP::Get.new(uri)

    http.request(request) do |response|
      if response.code == '200'
        File.open(local_path, 'wb') do |file|
          response.read_body do |chunk|
            file.write(chunk)
          end
        end
        puts "Downloaded: #{local_path}"
      else
        puts "Failed to download: HTTP #{response.code}"
      end
    end
  end
end

# Usage
download_file('https://example.com/document.pdf', 'downloads/document.pdf')

Advanced Downloads with Open-URI

For simpler use cases, Ruby's open-uri library provides a more concise approach:

require 'open-uri'
require 'fileutils'

def simple_download(url, local_path)
  FileUtils.mkdir_p(File.dirname(local_path))

  begin
    URI.open(url) do |remote_file|
      File.open(local_path, 'wb') do |local_file|
        local_file.write(remote_file.read)
      end
    end
    puts "Downloaded: #{local_path}"
  rescue OpenURI::HTTPError => e
    puts "HTTP Error: #{e.message}"
  rescue => e
    puts "Error: #{e.message}"
  end
end

# Download with custom headers
def download_with_headers(url, local_path, headers = {})
  default_headers = {
    'User-Agent' => 'Mozilla/5.0 (Ruby Web Scraper)'
  }

  URI.open(url, default_headers.merge(headers)) do |remote_file|
    File.open(local_path, 'wb') do |local_file|
      local_file.write(remote_file.read)
    end
  end
end

Using Mechanize for Complex Downloads

Mechanize is excellent for handling downloads that require session management, form submissions, or authentication:

require 'mechanize'
require 'fileutils'

class FileDownloader
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Chrome'
    @agent.follow_meta_refresh = true
  end

  def login_and_download(login_url, username, password, file_url, local_path)
    # Navigate to login page
    page = @agent.get(login_url)

    # Fill and submit login form
    form = page.forms.first
    form.field_with(name: 'username').value = username
    form.field_with(name: 'password').value = password

    result = @agent.submit(form)

    # Download file after authentication
    download_file(file_url, local_path)
  end

  def download_file(url, local_path)
    FileUtils.mkdir_p(File.dirname(local_path))

    file = @agent.get(url)
    file.save(local_path)
    puts "Downloaded: #{local_path}"
  rescue Mechanize::ResponseCodeError => e
    puts "Failed to download: #{e.response_code}"
  end

  def download_multiple_files(urls, download_dir)
    urls.each_with_index do |url, index|
      filename = File.basename(URI(url).path)
      filename = "file_#{index}.bin" if filename.empty?

      local_path = File.join(download_dir, filename)
      download_file(url, local_path)

      # Add delay to avoid overwhelming the server
      sleep(1)
    end
  end
end

# Usage
downloader = FileDownloader.new
downloader.login_and_download(
  'https://example.com/login',
  'your_username',
  'your_password',
  'https://example.com/protected/file.pdf',
  'downloads/protected_file.pdf'
)

Handling Different File Types

Different file types may require specific handling approaches:

require 'net/http'
require 'mime/types'

class TypedFileDownloader
  def self.download_with_type_detection(url, download_dir)
    uri = URI(url)

    Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
      request = Net::HTTP::Get.new(uri)

      response = http.request(request)

      if response.code == '200'
        # Detect file type from Content-Type header
        content_type = response['content-type']
        extension = get_extension_from_content_type(content_type)

        # Generate filename
        filename = File.basename(uri.path)
        filename = "downloaded_file#{extension}" if filename.empty? || !filename.include?('.')

        local_path = File.join(download_dir, filename)

        File.open(local_path, 'wb') do |file|
          file.write(response.body)
        end

        puts "Downloaded #{content_type} file: #{local_path}"
        local_path
      end
    end
  end

  private

  def self.get_extension_from_content_type(content_type)
    mime_type = MIME::Types[content_type].first
    mime_type ? ".#{mime_type.preferred_extension}" : '.bin'
  end
end

Progress Tracking for Large Files

For large file downloads, implementing progress tracking is essential:

require 'net/http'
require 'progressbar'

def download_with_progress(url, local_path)
  uri = URI(url)

  Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
    request = Net::HTTP::Get.new(uri)

    http.request(request) do |response|
      if response.code == '200'
        total_size = response['content-length'].to_i
        progress_bar = ProgressBar.new("Downloading", total_size) if total_size > 0

        File.open(local_path, 'wb') do |file|
          response.read_body do |chunk|
            file.write(chunk)
            progress_bar.inc(chunk.size) if progress_bar
          end
        end

        progress_bar.finish if progress_bar
        puts "\nDownload completed: #{local_path}"
      end
    end
  end
end

Concurrent Downloads

For downloading multiple files efficiently, use concurrent processing:

require 'concurrent'
require 'net/http'

class ConcurrentDownloader
  def initialize(max_threads: 5)
    @thread_pool = Concurrent::FixedThreadPool.new(max_threads)
  end

  def download_files(url_path_pairs)
    futures = url_path_pairs.map do |url, local_path|
      Concurrent::Future.execute(executor: @thread_pool) do
        download_single_file(url, local_path)
      end
    end

    # Wait for all downloads to complete
    results = futures.map(&:value)
    @thread_pool.shutdown
    @thread_pool.wait_for_termination

    results
  end

  private

  def download_single_file(url, local_path)
    uri = URI(url)

    begin
      Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
        request = Net::HTTP::Get.new(uri)
        response = http.request(request)

        if response.code == '200'
          File.open(local_path, 'wb') { |file| file.write(response.body) }
          { url: url, path: local_path, status: :success }
        else
          { url: url, path: local_path, status: :failed, error: "HTTP #{response.code}" }
        end
      end
    rescue => e
      { url: url, path: local_path, status: :error, error: e.message }
    end
  end
end

# Usage
downloader = ConcurrentDownloader.new(max_threads: 3)
files_to_download = [
  ['https://example.com/file1.pdf', 'downloads/file1.pdf'],
  ['https://example.com/file2.jpg', 'downloads/file2.jpg'],
  ['https://example.com/file3.doc', 'downloads/file3.doc']
]

results = downloader.download_files(files_to_download)
results.each { |result| puts "#{result[:url]}: #{result[:status]}" }

Error Handling and Retry Logic

Robust file downloading requires proper error handling and retry mechanisms:

require 'net/http'
require 'retries'

class RobustDownloader
  def download_with_retry(url, local_path, max_retries: 3)
    with_retries(max_tries: max_retries, rescue: [Net::TimeoutError, Net::HTTPError]) do
      download_file(url, local_path)
    end
  rescue => e
    puts "Failed to download after #{max_retries} attempts: #{e.message}"
    false
  end

  private

  def download_file(url, local_path)
    uri = URI(url)

    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = uri.scheme == 'https'
    http.read_timeout = 30
    http.open_timeout = 10

    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = 'Mozilla/5.0 (Ruby Downloader)'

    response = http.request(request)

    case response
    when Net::HTTPSuccess
      File.open(local_path, 'wb') { |file| file.write(response.body) }
      true
    when Net::HTTPRedirection
      # Handle redirects
      new_url = response['location']
      download_file(new_url, local_path)
    else
      raise Net::HTTPError.new("HTTP Error: #{response.code}", response)
    end
  end
end

Integration with Browser Automation

For files that require JavaScript execution or complex user interactions, similar to how to handle file downloads in Puppeteer, you can use Selenium with Ruby:

require 'selenium-webdriver'
require 'fileutils'

class BrowserDownloader
  def initialize(download_dir)
    @download_dir = File.expand_path(download_dir)
    FileUtils.mkdir_p(@download_dir)

    options = Selenium::WebDriver::Chrome::Options.new
    options.add_preference('download.default_directory', @download_dir)
    options.add_preference('download.prompt_for_download', false)

    @driver = Selenium::WebDriver.for(:chrome, options: options)
  end

  def download_file_with_js(url, download_link_selector)
    @driver.navigate.to(url)

    # Wait for page to load and click download link
    wait = Selenium::WebDriver::Wait.new(timeout: 10)
    download_link = wait.until { @driver.find_element(css: download_link_selector) }

    download_link.click

    # Wait for download to complete
    wait_for_download_completion
  end

  def close
    @driver.quit
  end

  private

  def wait_for_download_completion(timeout: 30)
    start_time = Time.now

    loop do
      # Check if any .crdownload files exist (Chrome partial downloads)
      partial_files = Dir.glob(File.join(@download_dir, '*.crdownload'))
      break if partial_files.empty?

      if Time.now - start_time > timeout
        puts "Download timeout exceeded"
        break
      end

      sleep(1)
    end
  end
end

Advanced Use Cases with WebScraping.AI API

For complex scenarios where traditional Ruby libraries might face challenges with anti-bot measures or JavaScript-heavy sites, you can leverage specialized web scraping APIs. The WebScraping.AI API provides robust file download capabilities with built-in proxy rotation and JavaScript rendering:

require 'net/http'
require 'json'
require 'uri'

class WebScrapingAIDownloader
  def initialize(api_key)
    @api_key = api_key
    @base_url = 'https://api.webscraping.ai/download'
  end

  def download_file(url, local_path, options = {})
    params = {
      url: url,
      api_key: @api_key,
      proxy: options[:proxy] || 'datacenter',
      device: options[:device] || 'desktop',
      js: options[:js] || false
    }

    uri = URI(@base_url)
    uri.query = URI.encode_www_form(params)

    response = Net::HTTP.get_response(uri)

    if response.code == '200'
      File.open(local_path, 'wb') { |file| file.write(response.body) }
      puts "Downloaded via WebScraping.AI: #{local_path}"
      true
    else
      puts "API Error: #{response.code} - #{response.body}"
      false
    end
  end
end

# Usage
api_downloader = WebScrapingAIDownloader.new('your_api_key_here')
api_downloader.download_file(
  'https://example.com/protected-file.pdf',
  'downloads/protected_file.pdf',
  { proxy: 'residential', js: true }
)

Best Practices and Considerations

Performance Optimization

  1. Use streaming for large files to avoid memory issues
  2. Implement connection pooling for multiple downloads
  3. Set appropriate timeouts to handle slow connections
  4. Use concurrent downloads with reasonable thread limits

Security Considerations

  1. Validate file types before saving
  2. Sanitize filenames to prevent directory traversal attacks
  3. Check file sizes to prevent disk space exhaustion
  4. Use HTTPS when possible for secure downloads

Error Handling

  1. Implement retry logic for transient failures
  2. Handle different HTTP status codes appropriately
  3. Log download attempts for debugging
  4. Validate downloaded files for completeness

Common Challenges and Solutions

When downloading files through web scraping, you might encounter several challenges:

  • Authentication Requirements: Use Mechanize for session-based authentication or handle authentication flows with browser automation
  • Dynamic Content: For JavaScript-generated download links, use Selenium WebDriver
  • Rate Limiting: Implement delays and respect robots.txt files
  • Large Files: Use streaming downloads and progress tracking
  • File Type Detection: Check Content-Type headers and validate file extensions

Conclusion

Ruby provides multiple approaches for handling file downloads during web scraping, from simple HTTP requests with Net::HTTP to sophisticated browser automation with Selenium. Choose the method that best fits your specific requirements:

  • Use Net::HTTP or Open-URI for simple, direct file downloads
  • Use Mechanize for downloads requiring session management or form interactions
  • Use Selenium for downloads requiring JavaScript execution or complex user interactions
  • Use WebScraping.AI API for enterprise-grade scraping with advanced anti-bot protection
  • Implement concurrent downloads for better performance with multiple files
  • Always include proper error handling and retry logic for production applications

The key to successful file downloading in Ruby web scraping is understanding your target website's requirements and choosing the appropriate tool and technique for your specific use case. Whether you're downloading a single document or processing thousands of files, Ruby's ecosystem provides the tools you need to build robust and efficient download solutions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon