Table of contents

How can I extract images and their attributes from HTML using Nokogiri?

Extracting images and their attributes from HTML documents is a common web scraping task, and Nokogiri provides powerful tools to accomplish this efficiently. Whether you need to download images, analyze image metadata, or build an image gallery, Nokogiri's CSS selectors and XPath expressions make it straightforward to extract comprehensive image information.

Basic Image Extraction

Let's start with the fundamentals of extracting images using Nokogiri:

require 'nokogiri'
require 'open-uri'

# Sample HTML with various image elements
html = <<~HTML
  <html>
    <body>
      <img src="https://example.com/image1.jpg" alt="Sample Image" width="300" height="200">
      <img src="/relative/path/image2.png" alt="Another Image" class="thumbnail">
      <img src="https://example.com/image3.gif" title="Animated GIF" data-lazy="true">
      <picture>
        <source srcset="image4-large.webp" media="(min-width: 800px)">
        <img src="image4-small.jpg" alt="Responsive Image">
      </picture>
    </body>
  </html>
HTML

# Parse the HTML document
doc = Nokogiri::HTML(html)

# Extract all image elements
images = doc.css('img')

# Iterate through images and extract basic attributes
images.each_with_index do |img, index|
  puts "Image #{index + 1}:"
  puts "  Source: #{img['src']}"
  puts "  Alt text: #{img['alt']}"
  puts "  Width: #{img['width']}" if img['width']
  puts "  Height: #{img['height']}" if img['height']
  puts "---"
end

Extracting Comprehensive Image Attributes

For more detailed image analysis, you'll want to extract all available attributes:

def extract_image_data(img_element)
  {
    src: img_element['src'],
    alt: img_element['alt'],
    title: img_element['title'],
    width: img_element['width']&.to_i,
    height: img_element['height']&.to_i,
    class: img_element['class'],
    id: img_element['id'],
    loading: img_element['loading'], # lazy, eager
    decoding: img_element['decoding'], # sync, async, auto
    crossorigin: img_element['crossorigin'],
    referrerpolicy: img_element['referrerpolicy'],
    sizes: img_element['sizes'],
    srcset: img_element['srcset'],
    usemap: img_element['usemap']
  }.reject { |_, v| v.nil? || v == '' }
end

# Extract comprehensive data for all images
doc.css('img').each_with_index do |img, index|
  image_data = extract_image_data(img)
  puts "Image #{index + 1}: #{image_data}"
end

Handling Different Image Scenarios

Working with Data Attributes

Many modern websites use data attributes for lazy loading and other functionality:

# Extract custom data attributes
def extract_data_attributes(img_element)
  data_attrs = {}

  img_element.attributes.each do |name, attr|
    if name.start_with?('data-')
      data_attrs[name] = attr.value
    end
  end

  data_attrs
end

# Example usage
doc.css('img').each do |img|
  data_attrs = extract_data_attributes(img)
  unless data_attrs.empty?
    puts "Data attributes: #{data_attrs}"
  end
end

Extracting Images from Picture Elements

Modern responsive images often use the <picture> element:

def extract_picture_data(picture_element)
  sources = picture_element.css('source').map do |source|
    {
      srcset: source['srcset'],
      media: source['media'],
      type: source['type'],
      sizes: source['sizes']
    }.reject { |_, v| v.nil? || v == '' }
  end

  img = picture_element.at_css('img')
  fallback_img = img ? extract_image_data(img) : nil

  {
    sources: sources,
    fallback: fallback_img
  }
end

# Extract data from picture elements
doc.css('picture').each_with_index do |picture, index|
  picture_data = extract_picture_data(picture)
  puts "Picture #{index + 1}: #{picture_data}"
end

Advanced Filtering and Selection

Filtering by Image Type

# Filter images by file extension
def filter_by_extension(images, extensions)
  images.select do |img|
    src = img['src']
    next false unless src

    ext = File.extname(src).downcase.delete('.')
    extensions.include?(ext)
  end
end

# Get only JPEG and PNG images
jpeg_png_images = filter_by_extension(doc.css('img'), ['jpg', 'jpeg', 'png'])
puts "Found #{jpeg_png_images.length} JPEG/PNG images"

Filtering by Size Attributes

# Find images with specific dimensions
def find_large_images(images, min_width: 500, min_height: 300)
  images.select do |img|
    width = img['width']&.to_i || 0
    height = img['height']&.to_i || 0
    width >= min_width && height >= min_height
  end
end

large_images = find_large_images(doc.css('img'))
puts "Found #{large_images.length} large images"

Using XPath for Complex Queries

# Find images with specific attributes using XPath
images_with_alt = doc.xpath('//img[@alt and @alt != ""]')
lazy_images = doc.xpath('//img[@loading="lazy" or @data-lazy]')
responsive_images = doc.xpath('//img[@srcset or parent::picture]')

puts "Images with alt text: #{images_with_alt.length}"
puts "Lazy-loaded images: #{lazy_images.length}"
puts "Responsive images: #{responsive_images.length}"

Building an Image Scraper Class

Here's a comprehensive Ruby class for image extraction:

class ImageExtractor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content)
  end

  def extract_all_images
    {
      standard_images: extract_standard_images,
      picture_elements: extract_picture_elements,
      background_images: extract_background_images
    }
  end

  private

  def extract_standard_images
    @doc.css('img').map do |img|
      extract_image_data(img).merge(
        data_attributes: extract_data_attributes(img),
        parent_element: img.parent.name
      )
    end
  end

  def extract_picture_elements
    @doc.css('picture').map { |picture| extract_picture_data(picture) }
  end

  def extract_background_images
    elements_with_bg = @doc.css('*[style*="background-image"]')

    elements_with_bg.map do |element|
      style = element['style']
      bg_match = style.match(/background-image:\s*url\(['"]?([^'"]*?)['"]?\)/)

      if bg_match
        {
          url: bg_match[1],
          element: element.name,
          class: element['class'],
          id: element['id']
        }
      end
    end.compact
  end

  # ... (include helper methods from previous examples)
end

# Usage
extractor = ImageExtractor.new(html_content)
all_images = extractor.extract_all_images
puts "Total images found: #{all_images[:standard_images].length}"

Handling URLs and Path Resolution

When scraping images, you often need to resolve relative URLs:

require 'uri'

def resolve_image_url(img_src, base_url)
  return img_src if img_src.match?(/^https?:\/\//)

  base_uri = URI.parse(base_url)
  URI.join(base_uri, img_src).to_s
rescue URI::InvalidURIError
  nil
end

# Example usage
base_url = 'https://example.com/page'
doc.css('img').each do |img|
  src = img['src']
  resolved_url = resolve_image_url(src, base_url)
  puts "Original: #{src}"
  puts "Resolved: #{resolved_url}"
  puts "---"
end

Error Handling and Validation

Robust image extraction requires proper error handling:

def safe_extract_images(html_content)
  begin
    doc = Nokogiri::HTML(html_content)
    images = []

    doc.css('img').each do |img|
      begin
        src = img['src']
        next if src.nil? || src.empty?

        image_data = {
          src: src,
          alt: img['alt'] || '',
          width: parse_dimension(img['width']),
          height: parse_dimension(img['height']),
          valid: validate_image_url(src)
        }

        images << image_data
      rescue StandardError => e
        puts "Error processing image: #{e.message}"
        next
      end
    end

    images
  rescue Nokogiri::XML::SyntaxError => e
    puts "HTML parsing error: #{e.message}"
    []
  end
end

def parse_dimension(value)
  return nil if value.nil?
  value.to_i if value.match?(/^\d+$/)
end

def validate_image_url(url)
  uri = URI.parse(url)
  %w[http https].include?(uri.scheme) || url.start_with?('/')
rescue URI::InvalidURIError
  false
end

Performance Optimization Tips

For large documents or high-volume scraping:

# Use more specific selectors to reduce processing
specific_images = doc.css('article img, .gallery img, .content img')

# Cache parsed documents when processing multiple queries
class CachedImageExtractor
  def initialize(html_content)
    @doc = Nokogiri::HTML(html_content)
    @images_cache = nil
  end

  def images
    @images_cache ||= @doc.css('img')
  end

  def count
    images.length
  end

  def with_alt_text
    images.select { |img| img['alt'] && !img['alt'].empty? }
  end
end

Integration with Web Scraping Workflows

When building larger scraping applications, consider integrating image extraction with other scraping tools. For complex JavaScript-heavy sites where images load dynamically, you might need to combine Nokogiri with browser automation tools like Puppeteer for handling dynamic content.

For sites requiring authentication before accessing images, you can combine Nokogiri's parsing capabilities with session management techniques to extract images from protected content.

Best Practices

  1. Always check for null values before accessing attributes
  2. Validate URLs before attempting to download images
  3. Handle relative URLs by resolving them against the base URL
  4. Respect robots.txt and rate limits when downloading images
  5. Cache parsed documents when performing multiple queries
  6. Use specific CSS selectors to improve performance
  7. Implement proper error handling for malformed HTML

Conclusion

Nokogiri provides powerful and flexible tools for extracting images and their attributes from HTML documents. Whether you're building an image gallery, analyzing website content, or downloading resources, the techniques covered in this guide will help you efficiently extract comprehensive image data. Remember to handle edge cases, validate data, and implement proper error handling for robust image extraction workflows.

The combination of CSS selectors, XPath expressions, and Ruby's string manipulation capabilities makes Nokogiri an excellent choice for image extraction tasks in web scraping projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon