Table of contents

What File Formats Can Mechanize Download and Save Automatically?

Mechanize is a powerful Ruby library that can automatically download and save virtually any file format available on the web. Unlike browser automation tools that require specific handling for downloads, Mechanize treats file downloads as standard HTTP requests, making it incredibly versatile for downloading various content types.

Universal File Format Support

Mechanize doesn't limit you to specific file formats. It can download any content that's accessible via HTTP/HTTPS, including:

Document Formats

  • PDF files (.pdf)
  • Microsoft Office documents (.docx, .xlsx, .pptx)
  • Text files (.txt, .csv, .json, .xml)
  • Rich text documents (.rtf, .odt)

Media Formats

  • Images (.jpg, .png, .gif, .svg, .webp, .bmp, .tiff)
  • Audio files (.mp3, .wav, .ogg, .flac)
  • Video files (.mp4, .avi, .mov, .webm)

Archive Formats

  • Compressed files (.zip, .rar, .tar.gz, .7z)
  • Installer packages (.exe, .msi, .dmg, .deb, .rpm)

Code and Data

  • Source code (.rb, .py, .js, .css, .html)
  • Database exports (.sql, .db)
  • Configuration files (.conf, .ini, .yaml)

Basic File Download Implementation

Here's how to download files using Mechanize:

require 'mechanize'

# Create a Mechanize agent
agent = Mechanize.new

# Download a PDF file
pdf_response = agent.get('https://example.com/document.pdf')
File.open('downloaded_document.pdf', 'wb') do |file|
  file.write(pdf_response.body)
end

# Download an image
image_response = agent.get('https://example.com/image.jpg')
File.open('downloaded_image.jpg', 'wb') do |file|
  file.write(image_response.body)
end

# Download a ZIP archive
archive_response = agent.get('https://example.com/files.zip')
File.open('downloaded_files.zip', 'wb') do |file|
  file.write(archive_response.body)
end

Advanced Download Techniques

Automatic File Extension Detection

Mechanize can automatically determine file extensions based on Content-Type headers:

require 'mechanize'
require 'uri'

agent = Mechanize.new

def download_with_auto_extension(agent, url, base_filename)
  response = agent.get(url)

  # Get content type from headers
  content_type = response.response['content-type']

  # Map common content types to extensions
  extension_map = {
    'application/pdf' => '.pdf',
    'image/jpeg' => '.jpg',
    'image/png' => '.png',
    'application/zip' => '.zip',
    'text/csv' => '.csv',
    'application/json' => '.json',
    'application/xml' => '.xml',
    'text/plain' => '.txt'
  }

  extension = extension_map[content_type] || ''
  filename = "#{base_filename}#{extension}"

  File.open(filename, 'wb') do |file|
    file.write(response.body)
  end

  puts "Downloaded: #{filename} (#{content_type})"
  filename
end

# Usage
download_with_auto_extension(agent, 'https://api.example.com/data', 'api_data')

Bulk File Downloads

Download multiple files efficiently:

require 'mechanize'

agent = Mechanize.new

# List of file URLs to download
file_urls = [
  'https://example.com/doc1.pdf',
  'https://example.com/image1.jpg',
  'https://example.com/data.csv',
  'https://example.com/archive.zip'
]

def bulk_download(agent, urls, download_dir = 'downloads')
  # Create download directory
  Dir.mkdir(download_dir) unless Dir.exist?(download_dir)

  urls.each_with_index do |url, index|
    begin
      puts "Downloading #{index + 1}/#{urls.length}: #{url}"

      response = agent.get(url)
      filename = File.basename(URI.parse(url).path)

      # Handle URLs without clear filenames
      filename = "file_#{index + 1}" if filename.empty?

      filepath = File.join(download_dir, filename)

      File.open(filepath, 'wb') do |file|
        file.write(response.body)
      end

      puts "✓ Saved: #{filepath}"

    rescue => e
      puts "✗ Failed to download #{url}: #{e.message}"
    end
  end
end

bulk_download(agent, file_urls)

Handling Large Files

For large file downloads, implement streaming to avoid memory issues:

require 'mechanize'

agent = Mechanize.new

def download_large_file(agent, url, filename)
  puts "Starting download: #{url}"

  response = agent.get(url)
  total_size = response.response['content-length'].to_i
  downloaded = 0

  File.open(filename, 'wb') do |file|
    response.body.each_char do |chunk|
      file.write(chunk)
      downloaded += chunk.bytesize

      # Show progress for large files
      if total_size > 0 && downloaded % (total_size / 10) == 0
        progress = (downloaded.to_f / total_size * 100).round(1)
        puts "Progress: #{progress}%"
      end
    end
  end

  puts "Download complete: #{filename}"
end

# Download a large file with progress tracking
download_large_file(agent, 'https://example.com/large-video.mp4', 'video.mp4')

Error Handling and Validation

Implement robust error handling for file downloads:

require 'mechanize'

agent = Mechanize.new

def safe_download(agent, url, filename, max_retries = 3)
  retries = 0

  begin
    response = agent.get(url)

    # Validate response
    unless response.code == '200'
      raise "HTTP Error: #{response.code}"
    end

    # Check if content is actually a file (not an error page)
    content_type = response.response['content-type']
    if content_type&.include?('text/html')
      puts "Warning: Received HTML instead of expected file format"
    end

    File.open(filename, 'wb') do |file|
      file.write(response.body)
    end

    # Verify file was written
    if File.exist?(filename) && File.size(filename) > 0
      puts "✓ Successfully downloaded: #{filename}"
      return true
    else
      raise "File was not properly saved"
    end

  rescue => e
    retries += 1
    if retries <= max_retries
      puts "Retry #{retries}/#{max_retries}: #{e.message}"
      sleep(2 ** retries) # Exponential backoff
      retry
    else
      puts "✗ Failed to download after #{max_retries} attempts: #{e.message}"
      return false
    end
  end
end

# Usage with error handling
safe_download(agent, 'https://example.com/important-file.pdf', 'important.pdf')

Working with Authentication

Download files from protected resources:

require 'mechanize'

agent = Mechanize.new

# Basic authentication
agent.auth('username', 'password')

# Or handle login forms first
login_page = agent.get('https://example.com/login')
login_form = login_page.form_with(:action => '/login')
login_form.username = 'your_username'
login_form.password = 'your_password'
agent.submit(login_form)

# Now download protected files
protected_file = agent.get('https://example.com/protected/document.pdf')
File.open('protected_document.pdf', 'wb') do |file|
  file.write(protected_file.body)
end

Content-Type Based Processing

Process different file types based on their content:

require 'mechanize'

agent = Mechanize.new

def process_by_content_type(agent, url)
  response = agent.get(url)
  content_type = response.response['content-type']

  case content_type
  when /^image\//
    puts "Processing image file"
    # Save with timestamp
    filename = "image_#{Time.now.to_i}.#{content_type.split('/').last}"

  when /^application\/pdf/
    puts "Processing PDF document"
    filename = "document_#{Time.now.to_i}.pdf"

  when /^application\/json/
    puts "Processing JSON data"
    filename = "data_#{Time.now.to_i}.json"
    # Could also parse JSON here

  when /^text\//
    puts "Processing text file"
    filename = "text_#{Time.now.to_i}.txt"

  else
    puts "Processing unknown file type: #{content_type}"
    filename = "file_#{Time.now.to_i}"
  end

  File.open(filename, 'wb') do |file|
    file.write(response.body)
  end

  filename
end

# Download and process various file types
process_by_content_type(agent, 'https://api.example.com/data.json')

Comparison with Browser Automation Tools

Unlike browser automation tools like Puppeteer's file download handling, Mechanize treats all downloads as direct HTTP requests. This approach offers several advantages:

  • No browser overhead: Downloads are faster and use less memory
  • Universal format support: Any file accessible via HTTP can be downloaded
  • Simpler implementation: No need to configure download directories or wait for files
  • Better for APIs: Direct access to file content without browser intervention

However, Mechanize cannot download files that require JavaScript execution or complex browser interactions, where tools like browser authentication handling might be necessary.

Best Practices

  1. Always use binary mode ('wb') when writing files to avoid encoding issues
  2. Implement retry logic for network-related failures
  3. Validate downloads by checking file size and content type
  4. Handle large files with streaming to prevent memory issues
  5. Use meaningful filenames and organize downloads in directories
  6. Respect rate limits when downloading multiple files
  7. Check Content-Type headers to verify you're getting the expected file format

Conclusion

Mechanize's file download capabilities are extensive and flexible. It can handle virtually any file format available over HTTP/HTTPS, making it an excellent choice for automated file collection tasks. The key advantage is its simplicity – treating downloads as standard HTTP requests means you can download any content type without format-specific handling.

Whether you're downloading documents, images, archives, or any other file type, Mechanize provides the tools you need for reliable, automated file downloads in your Ruby applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon