Can I use HTTParty to download files from websites?
Yes, HTTParty can definitely be used to download files from websites. HTTParty is a powerful Ruby gem that provides an intuitive interface for making HTTP requests, including downloading various file types such as images, PDFs, documents, and archives. This guide covers different approaches for file downloads using HTTParty, from simple downloads to handling large files efficiently.
Basic File Download with HTTParty
The simplest way to download a file with HTTParty is using a GET request and saving the response body to a file:
require 'httparty'
# Download a file and save it locally
response = HTTParty.get('https://example.com/document.pdf')
if response.success?
File.open('downloaded_document.pdf', 'wb') do |file|
file.write(response.body)
end
puts "File downloaded successfully"
else
puts "Download failed: #{response.code}"
end
Advanced File Download Configuration
For more control over the download process, you can configure various HTTParty options:
require 'httparty'
class FileDownloader
include HTTParty
# Set common options for all requests
headers 'User-Agent' => 'Mozilla/5.0 (compatible; FileDownloader/1.0)'
timeout 300 # 5 minutes timeout for large files
follow_redirects true
def self.download_file(url, local_path, options = {})
# Merge custom options
request_options = {
headers: headers.merge(options[:headers] || {}),
timeout: options[:timeout] || 300,
stream_body: true # Enable streaming for large files
}
response = get(url, request_options)
if response.success?
File.open(local_path, 'wb') do |file|
file.write(response.body)
end
{
success: true,
file_size: File.size(local_path),
content_type: response.headers['content-type']
}
else
{ success: false, error: "HTTP #{response.code}: #{response.message}" }
end
end
end
# Usage
result = FileDownloader.download_file(
'https://example.com/large-file.zip',
'./downloads/large-file.zip',
{ timeout: 600, headers: { 'Authorization' => 'Bearer token123' } }
)
puts result[:success] ? "Downloaded #{result[:file_size]} bytes" : result[:error]
Streaming Large Files
For downloading large files, streaming is essential to avoid memory issues:
require 'httparty'
def download_large_file(url, local_path)
# Use streaming to handle large files
response = HTTParty.get(url, stream_body: true, timeout: 0)
if response.success?
File.open(local_path, 'wb') do |file|
response.parsed_response.each do |chunk|
file.write(chunk)
end
end
file_size = File.size(local_path)
puts "Successfully downloaded #{file_size} bytes to #{local_path}"
true
else
puts "Download failed: #{response.code} - #{response.message}"
false
end
rescue => e
puts "Error during download: #{e.message}"
false
end
# Download a large file with streaming
download_large_file(
'https://example.com/large-dataset.csv',
'./data/large-dataset.csv'
)
Progress Tracking for File Downloads
You can implement progress tracking for better user experience:
require 'httparty'
class ProgressDownloader
include HTTParty
def self.download_with_progress(url, local_path)
response = head(url) # Get file size first
total_size = response.headers['content-length']&.to_i
downloaded = 0
File.open(local_path, 'wb') do |file|
get(url, stream_body: true) do |chunk|
file.write(chunk)
downloaded += chunk.size
if total_size && total_size > 0
progress = (downloaded.to_f / total_size * 100).round(2)
print "\rProgress: #{progress}% (#{downloaded}/#{total_size} bytes)"
else
print "\rDownloaded: #{downloaded} bytes"
end
end
end
puts "\nDownload completed!"
end
end
# Usage with progress tracking
ProgressDownloader.download_with_progress(
'https://example.com/video.mp4',
'./downloads/video.mp4'
)
Handling Different File Types
HTTParty can handle various file types. Here's how to detect and process different formats:
require 'httparty'
require 'mime/types'
class SmartFileDownloader
include HTTParty
def self.download_and_identify(url, download_dir = './downloads')
response = get(url, follow_redirects: true)
return { success: false, error: "HTTP #{response.code}" } unless response.success?
# Determine file extension from content type or URL
content_type = response.headers['content-type']
extension = determine_extension(url, content_type)
# Generate filename
filename = generate_filename(url, extension)
local_path = File.join(download_dir, filename)
# Ensure download directory exists
Dir.mkdir(download_dir) unless Dir.exist?(download_dir)
# Save file
File.open(local_path, 'wb') { |file| file.write(response.body) }
{
success: true,
file_path: local_path,
file_size: File.size(local_path),
content_type: content_type,
extension: extension
}
end
private
def self.determine_extension(url, content_type)
# Try to get extension from URL
url_extension = File.extname(URI.parse(url).path)
return url_extension unless url_extension.empty?
# Fall back to MIME type
mime_types = MIME::Types[content_type]
mime_types.first&.preferred_extension || '.bin'
end
def self.generate_filename(url, extension)
base_name = File.basename(URI.parse(url).path, '.*')
base_name = 'download' if base_name.empty?
timestamp = Time.now.strftime('%Y%m%d_%H%M%S')
"#{base_name}_#{timestamp}#{extension}"
end
end
# Usage
result = SmartFileDownloader.download_and_identify('https://example.com/document')
puts "Downloaded: #{result[:file_path]} (#{result[:file_size]} bytes)"
Error Handling and Retry Logic
Robust file downloading requires proper error handling and retry mechanisms:
require 'httparty'
class RobustDownloader
include HTTParty
MAX_RETRIES = 3
RETRY_DELAY = 2 # seconds
def self.download_with_retry(url, local_path, max_retries = MAX_RETRIES)
retries = 0
begin
response = get(url, {
timeout: 300,
follow_redirects: true,
headers: { 'User-Agent' => 'RobustDownloader/1.0' }
})
case response.code
when 200..299
File.open(local_path, 'wb') { |file| file.write(response.body) }
return { success: true, retries: retries }
when 404
return { success: false, error: 'File not found', permanent: true }
when 403
return { success: false, error: 'Access forbidden', permanent: true }
when 500..599
raise "Server error: #{response.code}"
else
raise "Unexpected response: #{response.code}"
end
rescue Net::TimeoutError, Net::ReadTimeout => e
retries += 1
if retries <= max_retries
puts "Timeout occurred, retrying in #{RETRY_DELAY} seconds... (#{retries}/#{max_retries})"
sleep(RETRY_DELAY)
retry
else
return { success: false, error: "Timeout after #{max_retries} retries" }
end
rescue => e
retries += 1
if retries <= max_retries
puts "Error occurred: #{e.message}, retrying... (#{retries}/#{max_retries})"
sleep(RETRY_DELAY)
retry
else
return { success: false, error: "Failed after #{max_retries} retries: #{e.message}" }
end
end
end
end
# Usage with retry logic
result = RobustDownloader.download_with_retry(
'https://example.com/unreliable-file.pdf',
'./downloads/document.pdf'
)
puts result[:success] ? "Download successful!" : "Download failed: #{result[:error]}"
Batch File Downloads
For downloading multiple files efficiently:
require 'httparty'
require 'thread'
class BatchDownloader
include HTTParty
def self.download_multiple(urls, download_dir = './downloads', max_threads = 5)
Dir.mkdir(download_dir) unless Dir.exist?(download_dir)
results = []
mutex = Mutex.new
# Create thread pool
threads = []
work_queue = Queue.new
# Add URLs to work queue
urls.each_with_index { |url, index| work_queue << [url, index] }
# Create worker threads
max_threads.times do
threads << Thread.new do
while !work_queue.empty?
begin
url, index = work_queue.pop(non_block = true)
filename = "file_#{index}_#{File.basename(URI.parse(url).path)}"
local_path = File.join(download_dir, filename)
response = get(url, timeout: 60)
if response.success?
File.open(local_path, 'wb') { |file| file.write(response.body) }
mutex.synchronize do
results << {
url: url,
success: true,
file_path: local_path,
file_size: File.size(local_path)
}
end
else
mutex.synchronize do
results << {
url: url,
success: false,
error: "HTTP #{response.code}"
}
end
end
rescue ThreadError
# Queue is empty, exit thread
break
rescue => e
mutex.synchronize do
results << {
url: url,
success: false,
error: e.message
}
end
end
end
end
end
# Wait for all threads to complete
threads.each(&:join)
results
end
end
# Usage for batch downloads
urls = [
'https://example.com/file1.pdf',
'https://example.com/file2.jpg',
'https://example.com/file3.zip'
]
results = BatchDownloader.download_multiple(urls, './batch_downloads')
results.each do |result|
if result[:success]
puts "✓ Downloaded: #{result[:file_path]}"
else
puts "✗ Failed: #{result[:url]} - #{result[:error]}"
end
end
Integration with Authentication
When downloading files from protected resources:
require 'httparty'
class AuthenticatedDownloader
include HTTParty
def initialize(api_key: nil, bearer_token: nil, basic_auth: nil)
@auth_headers = build_auth_headers(api_key, bearer_token, basic_auth)
end
def download_protected_file(url, local_path)
response = self.class.get(url, {
headers: @auth_headers,
timeout: 300,
follow_redirects: true
})
if response.success?
File.open(local_path, 'wb') { |file| file.write(response.body) }
{ success: true, file_size: File.size(local_path) }
else
{ success: false, error: "HTTP #{response.code}: #{response.message}" }
end
end
private
def build_auth_headers(api_key, bearer_token, basic_auth)
headers = { 'User-Agent' => 'AuthenticatedDownloader/1.0' }
if api_key
headers['X-API-Key'] = api_key
elsif bearer_token
headers['Authorization'] = "Bearer #{bearer_token}"
elsif basic_auth
headers['Authorization'] = "Basic #{Base64.strict_encode64("#{basic_auth[:username]}:#{basic_auth[:password]}")}"
end
headers
end
end
# Usage with different authentication methods
downloader = AuthenticatedDownloader.new(bearer_token: 'your_access_token')
result = downloader.download_protected_file(
'https://api.example.com/private/document.pdf',
'./secure_downloads/document.pdf'
)
Best Practices and Considerations
When using HTTParty for file downloads, consider these best practices:
- Memory Management: Always use streaming (
stream_body: true
) for large files to prevent memory exhaustion - Timeout Configuration: Set appropriate timeouts based on expected file sizes and network conditions
- Error Handling: Implement comprehensive error handling with retry logic for network failures
- Progress Tracking: For user-facing applications, provide download progress feedback
- File Validation: Verify downloaded files using checksums or file size validation when possible
- Security: Validate URLs and file paths to prevent directory traversal attacks
Performance Optimization Tips
- Use connection pooling for multiple downloads from the same domain
- Implement concurrent downloads with thread pools for batch operations
- Set appropriate buffer sizes for streaming large files
- Consider using compression when downloading text-based files
- Monitor network usage and implement rate limiting to avoid overwhelming servers
HTTParty provides a robust foundation for file downloading in Ruby applications. While it excels at straightforward HTTP-based downloads, for more complex scenarios involving JavaScript-rendered content, you might need to consider how to handle file downloads in Puppeteer for browser-based automation.
Whether you're building a simple file downloader or a complex web scraping application, HTTParty's flexibility and ease of use make it an excellent choice for handling HTTP-based file downloads efficiently and reliably.