How to Handle File Downloads During Web Scraping with Ruby
File downloads are a common requirement in web scraping projects, whether you're downloading images, PDFs, documents, or data files. Ruby provides several robust approaches for handling file downloads during web scraping, from simple HTTP requests to more sophisticated browser automation tools.
Basic File Downloads with Net::HTTP
Ruby's built-in Net::HTTP
library provides the foundation for downloading files. Here's a basic example:
require 'net/http'
require 'uri'
require 'fileutils'
def download_file(url, local_path)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Get.new(uri)
http.request(request) do |response|
if response.code == '200'
File.open(local_path, 'wb') do |file|
response.read_body do |chunk|
file.write(chunk)
end
end
puts "Downloaded: #{local_path}"
else
puts "Failed to download: HTTP #{response.code}"
end
end
end
end
# Usage
download_file('https://example.com/document.pdf', 'downloads/document.pdf')
Advanced Downloads with Open-URI
For simpler use cases, Ruby's open-uri
library provides a more concise approach:
require 'open-uri'
require 'fileutils'
def simple_download(url, local_path)
FileUtils.mkdir_p(File.dirname(local_path))
begin
URI.open(url) do |remote_file|
File.open(local_path, 'wb') do |local_file|
local_file.write(remote_file.read)
end
end
puts "Downloaded: #{local_path}"
rescue OpenURI::HTTPError => e
puts "HTTP Error: #{e.message}"
rescue => e
puts "Error: #{e.message}"
end
end
# Download with custom headers
def download_with_headers(url, local_path, headers = {})
default_headers = {
'User-Agent' => 'Mozilla/5.0 (Ruby Web Scraper)'
}
URI.open(url, default_headers.merge(headers)) do |remote_file|
File.open(local_path, 'wb') do |local_file|
local_file.write(remote_file.read)
end
end
end
Using Mechanize for Complex Downloads
Mechanize is excellent for handling downloads that require session management, form submissions, or authentication:
require 'mechanize'
require 'fileutils'
class FileDownloader
def initialize
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Chrome'
@agent.follow_meta_refresh = true
end
def login_and_download(login_url, username, password, file_url, local_path)
# Navigate to login page
page = @agent.get(login_url)
# Fill and submit login form
form = page.forms.first
form.field_with(name: 'username').value = username
form.field_with(name: 'password').value = password
result = @agent.submit(form)
# Download file after authentication
download_file(file_url, local_path)
end
def download_file(url, local_path)
FileUtils.mkdir_p(File.dirname(local_path))
file = @agent.get(url)
file.save(local_path)
puts "Downloaded: #{local_path}"
rescue Mechanize::ResponseCodeError => e
puts "Failed to download: #{e.response_code}"
end
def download_multiple_files(urls, download_dir)
urls.each_with_index do |url, index|
filename = File.basename(URI(url).path)
filename = "file_#{index}.bin" if filename.empty?
local_path = File.join(download_dir, filename)
download_file(url, local_path)
# Add delay to avoid overwhelming the server
sleep(1)
end
end
end
# Usage
downloader = FileDownloader.new
downloader.login_and_download(
'https://example.com/login',
'your_username',
'your_password',
'https://example.com/protected/file.pdf',
'downloads/protected_file.pdf'
)
Handling Different File Types
Different file types may require specific handling approaches:
require 'net/http'
require 'mime/types'
class TypedFileDownloader
def self.download_with_type_detection(url, download_dir)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Get.new(uri)
response = http.request(request)
if response.code == '200'
# Detect file type from Content-Type header
content_type = response['content-type']
extension = get_extension_from_content_type(content_type)
# Generate filename
filename = File.basename(uri.path)
filename = "downloaded_file#{extension}" if filename.empty? || !filename.include?('.')
local_path = File.join(download_dir, filename)
File.open(local_path, 'wb') do |file|
file.write(response.body)
end
puts "Downloaded #{content_type} file: #{local_path}"
local_path
end
end
end
private
def self.get_extension_from_content_type(content_type)
mime_type = MIME::Types[content_type].first
mime_type ? ".#{mime_type.preferred_extension}" : '.bin'
end
end
Progress Tracking for Large Files
For large file downloads, implementing progress tracking is essential:
require 'net/http'
require 'progressbar'
def download_with_progress(url, local_path)
uri = URI(url)
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Get.new(uri)
http.request(request) do |response|
if response.code == '200'
total_size = response['content-length'].to_i
progress_bar = ProgressBar.new("Downloading", total_size) if total_size > 0
File.open(local_path, 'wb') do |file|
response.read_body do |chunk|
file.write(chunk)
progress_bar.inc(chunk.size) if progress_bar
end
end
progress_bar.finish if progress_bar
puts "\nDownload completed: #{local_path}"
end
end
end
end
Concurrent Downloads
For downloading multiple files efficiently, use concurrent processing:
require 'concurrent'
require 'net/http'
class ConcurrentDownloader
def initialize(max_threads: 5)
@thread_pool = Concurrent::FixedThreadPool.new(max_threads)
end
def download_files(url_path_pairs)
futures = url_path_pairs.map do |url, local_path|
Concurrent::Future.execute(executor: @thread_pool) do
download_single_file(url, local_path)
end
end
# Wait for all downloads to complete
results = futures.map(&:value)
@thread_pool.shutdown
@thread_pool.wait_for_termination
results
end
private
def download_single_file(url, local_path)
uri = URI(url)
begin
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
request = Net::HTTP::Get.new(uri)
response = http.request(request)
if response.code == '200'
File.open(local_path, 'wb') { |file| file.write(response.body) }
{ url: url, path: local_path, status: :success }
else
{ url: url, path: local_path, status: :failed, error: "HTTP #{response.code}" }
end
end
rescue => e
{ url: url, path: local_path, status: :error, error: e.message }
end
end
end
# Usage
downloader = ConcurrentDownloader.new(max_threads: 3)
files_to_download = [
['https://example.com/file1.pdf', 'downloads/file1.pdf'],
['https://example.com/file2.jpg', 'downloads/file2.jpg'],
['https://example.com/file3.doc', 'downloads/file3.doc']
]
results = downloader.download_files(files_to_download)
results.each { |result| puts "#{result[:url]}: #{result[:status]}" }
Error Handling and Retry Logic
Robust file downloading requires proper error handling and retry mechanisms:
require 'net/http'
require 'retries'
class RobustDownloader
def download_with_retry(url, local_path, max_retries: 3)
with_retries(max_tries: max_retries, rescue: [Net::TimeoutError, Net::HTTPError]) do
download_file(url, local_path)
end
rescue => e
puts "Failed to download after #{max_retries} attempts: #{e.message}"
false
end
private
def download_file(url, local_path)
uri = URI(url)
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = uri.scheme == 'https'
http.read_timeout = 30
http.open_timeout = 10
request = Net::HTTP::Get.new(uri)
request['User-Agent'] = 'Mozilla/5.0 (Ruby Downloader)'
response = http.request(request)
case response
when Net::HTTPSuccess
File.open(local_path, 'wb') { |file| file.write(response.body) }
true
when Net::HTTPRedirection
# Handle redirects
new_url = response['location']
download_file(new_url, local_path)
else
raise Net::HTTPError.new("HTTP Error: #{response.code}", response)
end
end
end
Integration with Browser Automation
For files that require JavaScript execution or complex user interactions, similar to how to handle file downloads in Puppeteer, you can use Selenium with Ruby:
require 'selenium-webdriver'
require 'fileutils'
class BrowserDownloader
def initialize(download_dir)
@download_dir = File.expand_path(download_dir)
FileUtils.mkdir_p(@download_dir)
options = Selenium::WebDriver::Chrome::Options.new
options.add_preference('download.default_directory', @download_dir)
options.add_preference('download.prompt_for_download', false)
@driver = Selenium::WebDriver.for(:chrome, options: options)
end
def download_file_with_js(url, download_link_selector)
@driver.navigate.to(url)
# Wait for page to load and click download link
wait = Selenium::WebDriver::Wait.new(timeout: 10)
download_link = wait.until { @driver.find_element(css: download_link_selector) }
download_link.click
# Wait for download to complete
wait_for_download_completion
end
def close
@driver.quit
end
private
def wait_for_download_completion(timeout: 30)
start_time = Time.now
loop do
# Check if any .crdownload files exist (Chrome partial downloads)
partial_files = Dir.glob(File.join(@download_dir, '*.crdownload'))
break if partial_files.empty?
if Time.now - start_time > timeout
puts "Download timeout exceeded"
break
end
sleep(1)
end
end
end
Advanced Use Cases with WebScraping.AI API
For complex scenarios where traditional Ruby libraries might face challenges with anti-bot measures or JavaScript-heavy sites, you can leverage specialized web scraping APIs. The WebScraping.AI API provides robust file download capabilities with built-in proxy rotation and JavaScript rendering:
require 'net/http'
require 'json'
require 'uri'
class WebScrapingAIDownloader
def initialize(api_key)
@api_key = api_key
@base_url = 'https://api.webscraping.ai/download'
end
def download_file(url, local_path, options = {})
params = {
url: url,
api_key: @api_key,
proxy: options[:proxy] || 'datacenter',
device: options[:device] || 'desktop',
js: options[:js] || false
}
uri = URI(@base_url)
uri.query = URI.encode_www_form(params)
response = Net::HTTP.get_response(uri)
if response.code == '200'
File.open(local_path, 'wb') { |file| file.write(response.body) }
puts "Downloaded via WebScraping.AI: #{local_path}"
true
else
puts "API Error: #{response.code} - #{response.body}"
false
end
end
end
# Usage
api_downloader = WebScrapingAIDownloader.new('your_api_key_here')
api_downloader.download_file(
'https://example.com/protected-file.pdf',
'downloads/protected_file.pdf',
{ proxy: 'residential', js: true }
)
Best Practices and Considerations
Performance Optimization
- Use streaming for large files to avoid memory issues
- Implement connection pooling for multiple downloads
- Set appropriate timeouts to handle slow connections
- Use concurrent downloads with reasonable thread limits
Security Considerations
- Validate file types before saving
- Sanitize filenames to prevent directory traversal attacks
- Check file sizes to prevent disk space exhaustion
- Use HTTPS when possible for secure downloads
Error Handling
- Implement retry logic for transient failures
- Handle different HTTP status codes appropriately
- Log download attempts for debugging
- Validate downloaded files for completeness
Common Challenges and Solutions
When downloading files through web scraping, you might encounter several challenges:
- Authentication Requirements: Use Mechanize for session-based authentication or handle authentication flows with browser automation
- Dynamic Content: For JavaScript-generated download links, use Selenium WebDriver
- Rate Limiting: Implement delays and respect robots.txt files
- Large Files: Use streaming downloads and progress tracking
- File Type Detection: Check Content-Type headers and validate file extensions
Conclusion
Ruby provides multiple approaches for handling file downloads during web scraping, from simple HTTP requests with Net::HTTP to sophisticated browser automation with Selenium. Choose the method that best fits your specific requirements:
- Use Net::HTTP or Open-URI for simple, direct file downloads
- Use Mechanize for downloads requiring session management or form interactions
- Use Selenium for downloads requiring JavaScript execution or complex user interactions
- Use WebScraping.AI API for enterprise-grade scraping with advanced anti-bot protection
- Implement concurrent downloads for better performance with multiple files
- Always include proper error handling and retry logic for production applications
The key to successful file downloading in Ruby web scraping is understanding your target website's requirements and choosing the appropriate tool and technique for your specific use case. Whether you're downloading a single document or processing thousands of files, Ruby's ecosystem provides the tools you need to build robust and efficient download solutions.