How do I scrape images and download them using Ruby?

To scrape images and download them using Ruby, you can use the nokogiri gem to parse HTML and extract image URLs, and the open-uri module to download the images. Here's a step-by-step guide to doing so:

Step 1: Install Required Gems

First, you need to install the nokogiri gem if you haven't already:

gem install nokogiri

Step 2: Require Necessary Libraries

In your Ruby script, require the necessary libraries:

require 'nokogiri'
require 'open-uri'

Step 3: Fetch the HTML Content

Use open-uri to fetch the HTML content of the webpage you want to scrape:

url = 'http://example.com' # Replace with the actual URL
html_content = URI.open(url)

Step 4: Parse HTML with Nokogiri

Parse the HTML content with Nokogiri:

doc = Nokogiri::HTML(html_content)

Step 5: Extract Image URLs

Select the image tags and extract the src attribute which contains the image URLs:

image_urls = doc.css('img').map { |img| img['src'] }

Step 6: Download the Images

Loop through the image URLs and download each one. Make sure to handle relative URLs correctly by using URI.join:

image_urls.each_with_index do |image_url, index|
  # Handle relative URLs
  image_url = URI.join(url, image_url).to_s

  # Open the image URL and read the content
  image_data = URI.open(image_url).read

  # Determine the file extension (assuming the URL path contains the file name)
  file_extension = File.extname(URI.parse(image_url).path)

  # Set a file name (index is used to avoid name conflicts)
  file_name = "image_#{index}#{file_extension}"

  # Write the content to a file
  File.open(file_name, 'wb') do |file|
    file.write(image_data)
  end

  puts "Downloaded #{file_name}"
end

Full Example

Here's the full example as a single Ruby script:

require 'nokogiri'
require 'open-uri'

# URL of the webpage you want to scrape
url = 'http://example.com' # Replace with the actual URL

# Fetch and parse HTML document
html_content = URI.open(url)
doc = Nokogiri::HTML(html_content)

# Extract image URLs
image_urls = doc.css('img').map { |img| img['src'] }

# Download each image
image_urls.each_with_index do |image_url, index|
  # Handle relative URLs
  image_url = URI.join(url, image_url).to_s

  # Open the image URL and read the content
  image_data = URI.open(image_url).read

  # Determine the file extension
  file_extension = File.extname(URI.parse(image_url).path)

  # Set a file name
  file_name = "image_#{index}#{file_extension}"

  # Write the content to a file
  File.open(file_name, 'wb') do |file|
    file.write(image_data)
  end

  puts "Downloaded #{file_name}"
end

Remember to respect the website's robots.txt and terms of service before scraping, as not all websites allow scraping of their content. It's also a good practice to not overload the server by making too many requests in a short period of time.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon