To scrape images and download them using Ruby, you can use the nokogiri
gem to parse HTML and extract image URLs, and the open-uri
module to download the images. Here's a step-by-step guide to doing so:
Step 1: Install Required Gems
First, you need to install the nokogiri
gem if you haven't already:
gem install nokogiri
Step 2: Require Necessary Libraries
In your Ruby script, require the necessary libraries:
require 'nokogiri'
require 'open-uri'
Step 3: Fetch the HTML Content
Use open-uri
to fetch the HTML content of the webpage you want to scrape:
url = 'http://example.com' # Replace with the actual URL
html_content = URI.open(url)
Step 4: Parse HTML with Nokogiri
Parse the HTML content with Nokogiri:
doc = Nokogiri::HTML(html_content)
Step 5: Extract Image URLs
Select the image tags and extract the src
attribute which contains the image URLs:
image_urls = doc.css('img').map { |img| img['src'] }
Step 6: Download the Images
Loop through the image URLs and download each one. Make sure to handle relative URLs correctly by using URI.join
:
image_urls.each_with_index do |image_url, index|
# Handle relative URLs
image_url = URI.join(url, image_url).to_s
# Open the image URL and read the content
image_data = URI.open(image_url).read
# Determine the file extension (assuming the URL path contains the file name)
file_extension = File.extname(URI.parse(image_url).path)
# Set a file name (index is used to avoid name conflicts)
file_name = "image_#{index}#{file_extension}"
# Write the content to a file
File.open(file_name, 'wb') do |file|
file.write(image_data)
end
puts "Downloaded #{file_name}"
end
Full Example
Here's the full example as a single Ruby script:
require 'nokogiri'
require 'open-uri'
# URL of the webpage you want to scrape
url = 'http://example.com' # Replace with the actual URL
# Fetch and parse HTML document
html_content = URI.open(url)
doc = Nokogiri::HTML(html_content)
# Extract image URLs
image_urls = doc.css('img').map { |img| img['src'] }
# Download each image
image_urls.each_with_index do |image_url, index|
# Handle relative URLs
image_url = URI.join(url, image_url).to_s
# Open the image URL and read the content
image_data = URI.open(image_url).read
# Determine the file extension
file_extension = File.extname(URI.parse(image_url).path)
# Set a file name
file_name = "image_#{index}#{file_extension}"
# Write the content to a file
File.open(file_name, 'wb') do |file|
file.write(image_data)
end
puts "Downloaded #{file_name}"
end
Remember to respect the website's robots.txt
and terms of service before scraping, as not all websites allow scraping of their content. It's also a good practice to not overload the server by making too many requests in a short period of time.