What File Formats Can Mechanize Download and Save Automatically?
Mechanize is a powerful Ruby library that can automatically download and save virtually any file format available on the web. Unlike browser automation tools that require specific handling for downloads, Mechanize treats file downloads as standard HTTP requests, making it incredibly versatile for downloading various content types.
Universal File Format Support
Mechanize doesn't limit you to specific file formats. It can download any content that's accessible via HTTP/HTTPS, including:
Document Formats
- PDF files (.pdf)
- Microsoft Office documents (.docx, .xlsx, .pptx)
- Text files (.txt, .csv, .json, .xml)
- Rich text documents (.rtf, .odt)
Media Formats
- Images (.jpg, .png, .gif, .svg, .webp, .bmp, .tiff)
- Audio files (.mp3, .wav, .ogg, .flac)
- Video files (.mp4, .avi, .mov, .webm)
Archive Formats
- Compressed files (.zip, .rar, .tar.gz, .7z)
- Installer packages (.exe, .msi, .dmg, .deb, .rpm)
Code and Data
- Source code (.rb, .py, .js, .css, .html)
- Database exports (.sql, .db)
- Configuration files (.conf, .ini, .yaml)
Basic File Download Implementation
Here's how to download files using Mechanize:
require 'mechanize'
# Create a Mechanize agent
agent = Mechanize.new
# Download a PDF file
pdf_response = agent.get('https://example.com/document.pdf')
File.open('downloaded_document.pdf', 'wb') do |file|
file.write(pdf_response.body)
end
# Download an image
image_response = agent.get('https://example.com/image.jpg')
File.open('downloaded_image.jpg', 'wb') do |file|
file.write(image_response.body)
end
# Download a ZIP archive
archive_response = agent.get('https://example.com/files.zip')
File.open('downloaded_files.zip', 'wb') do |file|
file.write(archive_response.body)
end
Advanced Download Techniques
Automatic File Extension Detection
Mechanize can automatically determine file extensions based on Content-Type headers:
require 'mechanize'
require 'uri'
agent = Mechanize.new
def download_with_auto_extension(agent, url, base_filename)
response = agent.get(url)
# Get content type from headers
content_type = response.response['content-type']
# Map common content types to extensions
extension_map = {
'application/pdf' => '.pdf',
'image/jpeg' => '.jpg',
'image/png' => '.png',
'application/zip' => '.zip',
'text/csv' => '.csv',
'application/json' => '.json',
'application/xml' => '.xml',
'text/plain' => '.txt'
}
extension = extension_map[content_type] || ''
filename = "#{base_filename}#{extension}"
File.open(filename, 'wb') do |file|
file.write(response.body)
end
puts "Downloaded: #{filename} (#{content_type})"
filename
end
# Usage
download_with_auto_extension(agent, 'https://api.example.com/data', 'api_data')
Bulk File Downloads
Download multiple files efficiently:
require 'mechanize'
agent = Mechanize.new
# List of file URLs to download
file_urls = [
'https://example.com/doc1.pdf',
'https://example.com/image1.jpg',
'https://example.com/data.csv',
'https://example.com/archive.zip'
]
def bulk_download(agent, urls, download_dir = 'downloads')
# Create download directory
Dir.mkdir(download_dir) unless Dir.exist?(download_dir)
urls.each_with_index do |url, index|
begin
puts "Downloading #{index + 1}/#{urls.length}: #{url}"
response = agent.get(url)
filename = File.basename(URI.parse(url).path)
# Handle URLs without clear filenames
filename = "file_#{index + 1}" if filename.empty?
filepath = File.join(download_dir, filename)
File.open(filepath, 'wb') do |file|
file.write(response.body)
end
puts "✓ Saved: #{filepath}"
rescue => e
puts "✗ Failed to download #{url}: #{e.message}"
end
end
end
bulk_download(agent, file_urls)
Handling Large Files
For large file downloads, implement streaming to avoid memory issues:
require 'mechanize'
agent = Mechanize.new
def download_large_file(agent, url, filename)
puts "Starting download: #{url}"
response = agent.get(url)
total_size = response.response['content-length'].to_i
downloaded = 0
File.open(filename, 'wb') do |file|
response.body.each_char do |chunk|
file.write(chunk)
downloaded += chunk.bytesize
# Show progress for large files
if total_size > 0 && downloaded % (total_size / 10) == 0
progress = (downloaded.to_f / total_size * 100).round(1)
puts "Progress: #{progress}%"
end
end
end
puts "Download complete: #{filename}"
end
# Download a large file with progress tracking
download_large_file(agent, 'https://example.com/large-video.mp4', 'video.mp4')
Error Handling and Validation
Implement robust error handling for file downloads:
require 'mechanize'
agent = Mechanize.new
def safe_download(agent, url, filename, max_retries = 3)
retries = 0
begin
response = agent.get(url)
# Validate response
unless response.code == '200'
raise "HTTP Error: #{response.code}"
end
# Check if content is actually a file (not an error page)
content_type = response.response['content-type']
if content_type&.include?('text/html')
puts "Warning: Received HTML instead of expected file format"
end
File.open(filename, 'wb') do |file|
file.write(response.body)
end
# Verify file was written
if File.exist?(filename) && File.size(filename) > 0
puts "✓ Successfully downloaded: #{filename}"
return true
else
raise "File was not properly saved"
end
rescue => e
retries += 1
if retries <= max_retries
puts "Retry #{retries}/#{max_retries}: #{e.message}"
sleep(2 ** retries) # Exponential backoff
retry
else
puts "✗ Failed to download after #{max_retries} attempts: #{e.message}"
return false
end
end
end
# Usage with error handling
safe_download(agent, 'https://example.com/important-file.pdf', 'important.pdf')
Working with Authentication
Download files from protected resources:
require 'mechanize'
agent = Mechanize.new
# Basic authentication
agent.auth('username', 'password')
# Or handle login forms first
login_page = agent.get('https://example.com/login')
login_form = login_page.form_with(:action => '/login')
login_form.username = 'your_username'
login_form.password = 'your_password'
agent.submit(login_form)
# Now download protected files
protected_file = agent.get('https://example.com/protected/document.pdf')
File.open('protected_document.pdf', 'wb') do |file|
file.write(protected_file.body)
end
Content-Type Based Processing
Process different file types based on their content:
require 'mechanize'
agent = Mechanize.new
def process_by_content_type(agent, url)
response = agent.get(url)
content_type = response.response['content-type']
case content_type
when /^image\//
puts "Processing image file"
# Save with timestamp
filename = "image_#{Time.now.to_i}.#{content_type.split('/').last}"
when /^application\/pdf/
puts "Processing PDF document"
filename = "document_#{Time.now.to_i}.pdf"
when /^application\/json/
puts "Processing JSON data"
filename = "data_#{Time.now.to_i}.json"
# Could also parse JSON here
when /^text\//
puts "Processing text file"
filename = "text_#{Time.now.to_i}.txt"
else
puts "Processing unknown file type: #{content_type}"
filename = "file_#{Time.now.to_i}"
end
File.open(filename, 'wb') do |file|
file.write(response.body)
end
filename
end
# Download and process various file types
process_by_content_type(agent, 'https://api.example.com/data.json')
Comparison with Browser Automation Tools
Unlike browser automation tools like Puppeteer's file download handling, Mechanize treats all downloads as direct HTTP requests. This approach offers several advantages:
- No browser overhead: Downloads are faster and use less memory
- Universal format support: Any file accessible via HTTP can be downloaded
- Simpler implementation: No need to configure download directories or wait for files
- Better for APIs: Direct access to file content without browser intervention
However, Mechanize cannot download files that require JavaScript execution or complex browser interactions, where tools like browser authentication handling might be necessary.
Best Practices
- Always use binary mode (
'wb'
) when writing files to avoid encoding issues - Implement retry logic for network-related failures
- Validate downloads by checking file size and content type
- Handle large files with streaming to prevent memory issues
- Use meaningful filenames and organize downloads in directories
- Respect rate limits when downloading multiple files
- Check Content-Type headers to verify you're getting the expected file format
Conclusion
Mechanize's file download capabilities are extensive and flexible. It can handle virtually any file format available over HTTP/HTTPS, making it an excellent choice for automated file collection tasks. The key advantage is its simplicity – treating downloads as standard HTTP requests means you can download any content type without format-specific handling.
Whether you're downloading documents, images, archives, or any other file type, Mechanize provides the tools you need for reliable, automated file downloads in your Ruby applications.