Table of contents

Can I use HTTParty to download files from websites?

Yes, HTTParty can definitely be used to download files from websites. HTTParty is a powerful Ruby gem that provides an intuitive interface for making HTTP requests, including downloading various file types such as images, PDFs, documents, and archives. This guide covers different approaches for file downloads using HTTParty, from simple downloads to handling large files efficiently.

Basic File Download with HTTParty

The simplest way to download a file with HTTParty is using a GET request and saving the response body to a file:

require 'httparty'

# Download a file and save it locally
response = HTTParty.get('https://example.com/document.pdf')

if response.success?
  File.open('downloaded_document.pdf', 'wb') do |file|
    file.write(response.body)
  end
  puts "File downloaded successfully"
else
  puts "Download failed: #{response.code}"
end

Advanced File Download Configuration

For more control over the download process, you can configure various HTTParty options:

require 'httparty'

class FileDownloader
  include HTTParty

  # Set common options for all requests
  headers 'User-Agent' => 'Mozilla/5.0 (compatible; FileDownloader/1.0)'
  timeout 300  # 5 minutes timeout for large files
  follow_redirects true

  def self.download_file(url, local_path, options = {})
    # Merge custom options
    request_options = {
      headers: headers.merge(options[:headers] || {}),
      timeout: options[:timeout] || 300,
      stream_body: true  # Enable streaming for large files
    }

    response = get(url, request_options)

    if response.success?
      File.open(local_path, 'wb') do |file|
        file.write(response.body)
      end

      {
        success: true,
        file_size: File.size(local_path),
        content_type: response.headers['content-type']
      }
    else
      { success: false, error: "HTTP #{response.code}: #{response.message}" }
    end
  end
end

# Usage
result = FileDownloader.download_file(
  'https://example.com/large-file.zip',
  './downloads/large-file.zip',
  { timeout: 600, headers: { 'Authorization' => 'Bearer token123' } }
)

puts result[:success] ? "Downloaded #{result[:file_size]} bytes" : result[:error]

Streaming Large Files

For downloading large files, streaming is essential to avoid memory issues:

require 'httparty'

def download_large_file(url, local_path)
  # Use streaming to handle large files
  response = HTTParty.get(url, stream_body: true, timeout: 0)

  if response.success?
    File.open(local_path, 'wb') do |file|
      response.parsed_response.each do |chunk|
        file.write(chunk)
      end
    end

    file_size = File.size(local_path)
    puts "Successfully downloaded #{file_size} bytes to #{local_path}"
    true
  else
    puts "Download failed: #{response.code} - #{response.message}"
    false
  end
rescue => e
  puts "Error during download: #{e.message}"
  false
end

# Download a large file with streaming
download_large_file(
  'https://example.com/large-dataset.csv',
  './data/large-dataset.csv'
)

Progress Tracking for File Downloads

You can implement progress tracking for better user experience:

require 'httparty'

class ProgressDownloader
  include HTTParty

  def self.download_with_progress(url, local_path)
    response = head(url)  # Get file size first
    total_size = response.headers['content-length']&.to_i

    downloaded = 0

    File.open(local_path, 'wb') do |file|
      get(url, stream_body: true) do |chunk|
        file.write(chunk)
        downloaded += chunk.size

        if total_size && total_size > 0
          progress = (downloaded.to_f / total_size * 100).round(2)
          print "\rProgress: #{progress}% (#{downloaded}/#{total_size} bytes)"
        else
          print "\rDownloaded: #{downloaded} bytes"
        end
      end
    end

    puts "\nDownload completed!"
  end
end

# Usage with progress tracking
ProgressDownloader.download_with_progress(
  'https://example.com/video.mp4',
  './downloads/video.mp4'
)

Handling Different File Types

HTTParty can handle various file types. Here's how to detect and process different formats:

require 'httparty'
require 'mime/types'

class SmartFileDownloader
  include HTTParty

  def self.download_and_identify(url, download_dir = './downloads')
    response = get(url, follow_redirects: true)

    return { success: false, error: "HTTP #{response.code}" } unless response.success?

    # Determine file extension from content type or URL
    content_type = response.headers['content-type']
    extension = determine_extension(url, content_type)

    # Generate filename
    filename = generate_filename(url, extension)
    local_path = File.join(download_dir, filename)

    # Ensure download directory exists
    Dir.mkdir(download_dir) unless Dir.exist?(download_dir)

    # Save file
    File.open(local_path, 'wb') { |file| file.write(response.body) }

    {
      success: true,
      file_path: local_path,
      file_size: File.size(local_path),
      content_type: content_type,
      extension: extension
    }
  end

  private

  def self.determine_extension(url, content_type)
    # Try to get extension from URL
    url_extension = File.extname(URI.parse(url).path)
    return url_extension unless url_extension.empty?

    # Fall back to MIME type
    mime_types = MIME::Types[content_type]
    mime_types.first&.preferred_extension || '.bin'
  end

  def self.generate_filename(url, extension)
    base_name = File.basename(URI.parse(url).path, '.*')
    base_name = 'download' if base_name.empty?
    timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
    "#{base_name}_#{timestamp}#{extension}"
  end
end

# Usage
result = SmartFileDownloader.download_and_identify('https://example.com/document')
puts "Downloaded: #{result[:file_path]} (#{result[:file_size]} bytes)"

Error Handling and Retry Logic

Robust file downloading requires proper error handling and retry mechanisms:

require 'httparty'

class RobustDownloader
  include HTTParty

  MAX_RETRIES = 3
  RETRY_DELAY = 2  # seconds

  def self.download_with_retry(url, local_path, max_retries = MAX_RETRIES)
    retries = 0

    begin
      response = get(url, {
        timeout: 300,
        follow_redirects: true,
        headers: { 'User-Agent' => 'RobustDownloader/1.0' }
      })

      case response.code
      when 200..299
        File.open(local_path, 'wb') { |file| file.write(response.body) }
        return { success: true, retries: retries }

      when 404
        return { success: false, error: 'File not found', permanent: true }

      when 403
        return { success: false, error: 'Access forbidden', permanent: true }

      when 500..599
        raise "Server error: #{response.code}"

      else
        raise "Unexpected response: #{response.code}"
      end

    rescue Net::TimeoutError, Net::ReadTimeout => e
      retries += 1
      if retries <= max_retries
        puts "Timeout occurred, retrying in #{RETRY_DELAY} seconds... (#{retries}/#{max_retries})"
        sleep(RETRY_DELAY)
        retry
      else
        return { success: false, error: "Timeout after #{max_retries} retries" }
      end

    rescue => e
      retries += 1
      if retries <= max_retries
        puts "Error occurred: #{e.message}, retrying... (#{retries}/#{max_retries})"
        sleep(RETRY_DELAY)
        retry
      else
        return { success: false, error: "Failed after #{max_retries} retries: #{e.message}" }
      end
    end
  end
end

# Usage with retry logic
result = RobustDownloader.download_with_retry(
  'https://example.com/unreliable-file.pdf',
  './downloads/document.pdf'
)

puts result[:success] ? "Download successful!" : "Download failed: #{result[:error]}"

Batch File Downloads

For downloading multiple files efficiently:

require 'httparty'
require 'thread'

class BatchDownloader
  include HTTParty

  def self.download_multiple(urls, download_dir = './downloads', max_threads = 5)
    Dir.mkdir(download_dir) unless Dir.exist?(download_dir)

    results = []
    mutex = Mutex.new

    # Create thread pool
    threads = []
    work_queue = Queue.new

    # Add URLs to work queue
    urls.each_with_index { |url, index| work_queue << [url, index] }

    # Create worker threads
    max_threads.times do
      threads << Thread.new do
        while !work_queue.empty?
          begin
            url, index = work_queue.pop(non_block = true)

            filename = "file_#{index}_#{File.basename(URI.parse(url).path)}"
            local_path = File.join(download_dir, filename)

            response = get(url, timeout: 60)

            if response.success?
              File.open(local_path, 'wb') { |file| file.write(response.body) }

              mutex.synchronize do
                results << {
                  url: url,
                  success: true,
                  file_path: local_path,
                  file_size: File.size(local_path)
                }
              end
            else
              mutex.synchronize do
                results << {
                  url: url,
                  success: false,
                  error: "HTTP #{response.code}"
                }
              end
            end

          rescue ThreadError
            # Queue is empty, exit thread
            break
          rescue => e
            mutex.synchronize do
              results << {
                url: url,
                success: false,
                error: e.message
              }
            end
          end
        end
      end
    end

    # Wait for all threads to complete
    threads.each(&:join)

    results
  end
end

# Usage for batch downloads
urls = [
  'https://example.com/file1.pdf',
  'https://example.com/file2.jpg',
  'https://example.com/file3.zip'
]

results = BatchDownloader.download_multiple(urls, './batch_downloads')

results.each do |result|
  if result[:success]
    puts "✓ Downloaded: #{result[:file_path]}"
  else
    puts "✗ Failed: #{result[:url]} - #{result[:error]}"
  end
end

Integration with Authentication

When downloading files from protected resources:

require 'httparty'

class AuthenticatedDownloader
  include HTTParty

  def initialize(api_key: nil, bearer_token: nil, basic_auth: nil)
    @auth_headers = build_auth_headers(api_key, bearer_token, basic_auth)
  end

  def download_protected_file(url, local_path)
    response = self.class.get(url, {
      headers: @auth_headers,
      timeout: 300,
      follow_redirects: true
    })

    if response.success?
      File.open(local_path, 'wb') { |file| file.write(response.body) }
      { success: true, file_size: File.size(local_path) }
    else
      { success: false, error: "HTTP #{response.code}: #{response.message}" }
    end
  end

  private

  def build_auth_headers(api_key, bearer_token, basic_auth)
    headers = { 'User-Agent' => 'AuthenticatedDownloader/1.0' }

    if api_key
      headers['X-API-Key'] = api_key
    elsif bearer_token
      headers['Authorization'] = "Bearer #{bearer_token}"
    elsif basic_auth
      headers['Authorization'] = "Basic #{Base64.strict_encode64("#{basic_auth[:username]}:#{basic_auth[:password]}")}"
    end

    headers
  end
end

# Usage with different authentication methods
downloader = AuthenticatedDownloader.new(bearer_token: 'your_access_token')
result = downloader.download_protected_file(
  'https://api.example.com/private/document.pdf',
  './secure_downloads/document.pdf'
)

Best Practices and Considerations

When using HTTParty for file downloads, consider these best practices:

  1. Memory Management: Always use streaming (stream_body: true) for large files to prevent memory exhaustion
  2. Timeout Configuration: Set appropriate timeouts based on expected file sizes and network conditions
  3. Error Handling: Implement comprehensive error handling with retry logic for network failures
  4. Progress Tracking: For user-facing applications, provide download progress feedback
  5. File Validation: Verify downloaded files using checksums or file size validation when possible
  6. Security: Validate URLs and file paths to prevent directory traversal attacks

Performance Optimization Tips

  • Use connection pooling for multiple downloads from the same domain
  • Implement concurrent downloads with thread pools for batch operations
  • Set appropriate buffer sizes for streaming large files
  • Consider using compression when downloading text-based files
  • Monitor network usage and implement rate limiting to avoid overwhelming servers

HTTParty provides a robust foundation for file downloading in Ruby applications. While it excels at straightforward HTTP-based downloads, for more complex scenarios involving JavaScript-rendered content, you might need to consider how to handle file downloads in Puppeteer for browser-based automation.

Whether you're building a simple file downloader or a complex web scraping application, HTTParty's flexibility and ease of use make it an excellent choice for handling HTTP-based file downloads efficiently and reliably.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon