How can I extract images and their attributes from HTML using Nokogiri?
Extracting images and their attributes from HTML documents is a common web scraping task, and Nokogiri provides powerful tools to accomplish this efficiently. Whether you need to download images, analyze image metadata, or build an image gallery, Nokogiri's CSS selectors and XPath expressions make it straightforward to extract comprehensive image information.
Basic Image Extraction
Let's start with the fundamentals of extracting images using Nokogiri:
require 'nokogiri'
require 'open-uri'
# Sample HTML with various image elements
html = <<~HTML
<html>
<body>
<img src="https://example.com/image1.jpg" alt="Sample Image" width="300" height="200">
<img src="/relative/path/image2.png" alt="Another Image" class="thumbnail">
<img src="https://example.com/image3.gif" title="Animated GIF" data-lazy="true">
<picture>
<source srcset="image4-large.webp" media="(min-width: 800px)">
<img src="image4-small.jpg" alt="Responsive Image">
</picture>
</body>
</html>
HTML
# Parse the HTML document
doc = Nokogiri::HTML(html)
# Extract all image elements
images = doc.css('img')
# Iterate through images and extract basic attributes
images.each_with_index do |img, index|
puts "Image #{index + 1}:"
puts " Source: #{img['src']}"
puts " Alt text: #{img['alt']}"
puts " Width: #{img['width']}" if img['width']
puts " Height: #{img['height']}" if img['height']
puts "---"
end
Extracting Comprehensive Image Attributes
For more detailed image analysis, you'll want to extract all available attributes:
def extract_image_data(img_element)
{
src: img_element['src'],
alt: img_element['alt'],
title: img_element['title'],
width: img_element['width']&.to_i,
height: img_element['height']&.to_i,
class: img_element['class'],
id: img_element['id'],
loading: img_element['loading'], # lazy, eager
decoding: img_element['decoding'], # sync, async, auto
crossorigin: img_element['crossorigin'],
referrerpolicy: img_element['referrerpolicy'],
sizes: img_element['sizes'],
srcset: img_element['srcset'],
usemap: img_element['usemap']
}.reject { |_, v| v.nil? || v == '' }
end
# Extract comprehensive data for all images
doc.css('img').each_with_index do |img, index|
image_data = extract_image_data(img)
puts "Image #{index + 1}: #{image_data}"
end
Handling Different Image Scenarios
Working with Data Attributes
Many modern websites use data attributes for lazy loading and other functionality:
# Extract custom data attributes
def extract_data_attributes(img_element)
data_attrs = {}
img_element.attributes.each do |name, attr|
if name.start_with?('data-')
data_attrs[name] = attr.value
end
end
data_attrs
end
# Example usage
doc.css('img').each do |img|
data_attrs = extract_data_attributes(img)
unless data_attrs.empty?
puts "Data attributes: #{data_attrs}"
end
end
Extracting Images from Picture Elements
Modern responsive images often use the <picture>
element:
def extract_picture_data(picture_element)
sources = picture_element.css('source').map do |source|
{
srcset: source['srcset'],
media: source['media'],
type: source['type'],
sizes: source['sizes']
}.reject { |_, v| v.nil? || v == '' }
end
img = picture_element.at_css('img')
fallback_img = img ? extract_image_data(img) : nil
{
sources: sources,
fallback: fallback_img
}
end
# Extract data from picture elements
doc.css('picture').each_with_index do |picture, index|
picture_data = extract_picture_data(picture)
puts "Picture #{index + 1}: #{picture_data}"
end
Advanced Filtering and Selection
Filtering by Image Type
# Filter images by file extension
def filter_by_extension(images, extensions)
images.select do |img|
src = img['src']
next false unless src
ext = File.extname(src).downcase.delete('.')
extensions.include?(ext)
end
end
# Get only JPEG and PNG images
jpeg_png_images = filter_by_extension(doc.css('img'), ['jpg', 'jpeg', 'png'])
puts "Found #{jpeg_png_images.length} JPEG/PNG images"
Filtering by Size Attributes
# Find images with specific dimensions
def find_large_images(images, min_width: 500, min_height: 300)
images.select do |img|
width = img['width']&.to_i || 0
height = img['height']&.to_i || 0
width >= min_width && height >= min_height
end
end
large_images = find_large_images(doc.css('img'))
puts "Found #{large_images.length} large images"
Using XPath for Complex Queries
# Find images with specific attributes using XPath
images_with_alt = doc.xpath('//img[@alt and @alt != ""]')
lazy_images = doc.xpath('//img[@loading="lazy" or @data-lazy]')
responsive_images = doc.xpath('//img[@srcset or parent::picture]')
puts "Images with alt text: #{images_with_alt.length}"
puts "Lazy-loaded images: #{lazy_images.length}"
puts "Responsive images: #{responsive_images.length}"
Building an Image Scraper Class
Here's a comprehensive Ruby class for image extraction:
class ImageExtractor
def initialize(html_content)
@doc = Nokogiri::HTML(html_content)
end
def extract_all_images
{
standard_images: extract_standard_images,
picture_elements: extract_picture_elements,
background_images: extract_background_images
}
end
private
def extract_standard_images
@doc.css('img').map do |img|
extract_image_data(img).merge(
data_attributes: extract_data_attributes(img),
parent_element: img.parent.name
)
end
end
def extract_picture_elements
@doc.css('picture').map { |picture| extract_picture_data(picture) }
end
def extract_background_images
elements_with_bg = @doc.css('*[style*="background-image"]')
elements_with_bg.map do |element|
style = element['style']
bg_match = style.match(/background-image:\s*url\(['"]?([^'"]*?)['"]?\)/)
if bg_match
{
url: bg_match[1],
element: element.name,
class: element['class'],
id: element['id']
}
end
end.compact
end
# ... (include helper methods from previous examples)
end
# Usage
extractor = ImageExtractor.new(html_content)
all_images = extractor.extract_all_images
puts "Total images found: #{all_images[:standard_images].length}"
Handling URLs and Path Resolution
When scraping images, you often need to resolve relative URLs:
require 'uri'
def resolve_image_url(img_src, base_url)
return img_src if img_src.match?(/^https?:\/\//)
base_uri = URI.parse(base_url)
URI.join(base_uri, img_src).to_s
rescue URI::InvalidURIError
nil
end
# Example usage
base_url = 'https://example.com/page'
doc.css('img').each do |img|
src = img['src']
resolved_url = resolve_image_url(src, base_url)
puts "Original: #{src}"
puts "Resolved: #{resolved_url}"
puts "---"
end
Error Handling and Validation
Robust image extraction requires proper error handling:
def safe_extract_images(html_content)
begin
doc = Nokogiri::HTML(html_content)
images = []
doc.css('img').each do |img|
begin
src = img['src']
next if src.nil? || src.empty?
image_data = {
src: src,
alt: img['alt'] || '',
width: parse_dimension(img['width']),
height: parse_dimension(img['height']),
valid: validate_image_url(src)
}
images << image_data
rescue StandardError => e
puts "Error processing image: #{e.message}"
next
end
end
images
rescue Nokogiri::XML::SyntaxError => e
puts "HTML parsing error: #{e.message}"
[]
end
end
def parse_dimension(value)
return nil if value.nil?
value.to_i if value.match?(/^\d+$/)
end
def validate_image_url(url)
uri = URI.parse(url)
%w[http https].include?(uri.scheme) || url.start_with?('/')
rescue URI::InvalidURIError
false
end
Performance Optimization Tips
For large documents or high-volume scraping:
# Use more specific selectors to reduce processing
specific_images = doc.css('article img, .gallery img, .content img')
# Cache parsed documents when processing multiple queries
class CachedImageExtractor
def initialize(html_content)
@doc = Nokogiri::HTML(html_content)
@images_cache = nil
end
def images
@images_cache ||= @doc.css('img')
end
def count
images.length
end
def with_alt_text
images.select { |img| img['alt'] && !img['alt'].empty? }
end
end
Integration with Web Scraping Workflows
When building larger scraping applications, consider integrating image extraction with other scraping tools. For complex JavaScript-heavy sites where images load dynamically, you might need to combine Nokogiri with browser automation tools like Puppeteer for handling dynamic content.
For sites requiring authentication before accessing images, you can combine Nokogiri's parsing capabilities with session management techniques to extract images from protected content.
Best Practices
- Always check for null values before accessing attributes
- Validate URLs before attempting to download images
- Handle relative URLs by resolving them against the base URL
- Respect robots.txt and rate limits when downloading images
- Cache parsed documents when performing multiple queries
- Use specific CSS selectors to improve performance
- Implement proper error handling for malformed HTML
Conclusion
Nokogiri provides powerful and flexible tools for extracting images and their attributes from HTML documents. Whether you're building an image gallery, analyzing website content, or downloading resources, the techniques covered in this guide will help you efficiently extract comprehensive image data. Remember to handle edge cases, validate data, and implement proper error handling for robust image extraction workflows.
The combination of CSS selectors, XPath expressions, and Ruby's string manipulation capabilities makes Nokogiri an excellent choice for image extraction tasks in web scraping projects.