Table of contents

How do I extract attributes from HTML elements using Nokogiri?

Nokogiri is a powerful Ruby library for parsing HTML and XML documents. Extracting attributes from HTML elements is a common task in web scraping, and Nokogiri provides several methods to accomplish this efficiently.

Installation

First, install Nokogiri by adding it to your Gemfile or installing it directly:

# Gemfile
gem 'nokogiri'
# Direct installation
gem install nokogiri

Basic Attribute Extraction

Method 1: Hash-style Access ([])

The most common way to extract attributes is using hash-style access:

require 'nokogiri'

html_content = <<-HTML
<html>
  <body>
    <div id="main" class="content">
      <a href="https://example.com" title="Example Website" data-id="123">
        Click here
      </a>
      <img src="/images/logo.png" alt="Company Logo" width="200" height="100">
    </div>
  </body>
</html>
HTML

doc = Nokogiri::HTML(html_content)

# Extract link attributes
link = doc.css('a').first
puts link['href']     # => "https://example.com"
puts link['title']    # => "Example Website"
puts link['data-id']  # => "123"

# Extract image attributes
img = doc.css('img').first
puts img['src']    # => "/images/logo.png"
puts img['alt']    # => "Company Logo"
puts img['width']  # => "200"

Method 2: Using attr() Method

The attr() method provides the same functionality with a more explicit syntax:

link = doc.css('a').first
puts link.attr('href')     # => "https://example.com"
puts link.attr('title')    # => "Example Website"
puts link.attr('data-id')  # => "123"

Advanced Attribute Extraction

Extracting All Attributes

Get all attributes from an element as a hash:

link = doc.css('a').first
attributes = link.attributes

attributes.each do |name, attr|
  puts "#{name}: #{attr.value}"
end
# Output:
# href: https://example.com
# title: Example Website
# data-id: 123

Extracting Attributes from Multiple Elements

Process multiple elements and their attributes:

html_content = <<-HTML
<div class="products">
  <div class="product" data-id="1" data-price="29.99">
    <h3>Product A</h3>
    <a href="/product/1">View Details</a>
  </div>
  <div class="product" data-id="2" data-price="39.99">
    <h3>Product B</h3>
    <a href="/product/2">View Details</a>
  </div>
</div>
HTML

doc = Nokogiri::HTML(html_content)

# Extract data from all products
products = doc.css('.product')
products.each do |product|
  id = product['data-id']
  price = product['data-price']
  link = product.css('a').first['href']

  puts "Product ID: #{id}, Price: $#{price}, Link: #{link}"
end

Using XPath for Complex Selections

For more complex attribute extraction, use XPath:

# Extract all href attributes from links within a specific div
hrefs = doc.xpath('//div[@class="content"]//a/@href')
hrefs.each { |href| puts href.value }

# Extract attributes with conditions
expensive_products = doc.xpath('//div[@data-price > 30]/@data-id')
expensive_products.each { |id| puts "Expensive product ID: #{id.value}" }

Web Scraping Example

Here's a complete example of scraping attributes from a live website:

require 'nokogiri'
require 'net/http'
require 'uri'

def scrape_page_attributes(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)

  if response.code == '200'
    doc = Nokogiri::HTML(response.body)

    # Extract all image sources and alt texts
    images = doc.css('img')
    puts "Found #{images.count} images:"

    images.each_with_index do |img, index|
      src = img['src']
      alt = img['alt'] || 'No alt text'
      puts "#{index + 1}. #{src} (#{alt})"
    end

    # Extract all external links
    external_links = doc.css('a[href^="http"]')
    puts "\nFound #{external_links.count} external links:"

    external_links.each do |link|
      href = link['href']
      title = link['title'] || link.text.strip
      puts "- #{href} (#{title})"
    end
  else
    puts "Failed to fetch page: #{response.code}"
  end
end

# Usage
scrape_page_attributes('https://example.com')

Error Handling and Best Practices

Always handle cases where attributes might not exist:

link = doc.css('a').first

# Safe attribute extraction
href = link&.[]('href') || 'No href found'
title = link&.attr('title') || 'No title'

# Check if element exists before accessing attributes
if link
  puts "Link found: #{link['href']}"
else
  puts "No link found"
end

# Handle missing attributes gracefully
def safe_attr(element, attribute)
  element&.[](attribute) || "#{attribute} not found"
end

puts safe_attr(link, 'href')
puts safe_attr(link, 'nonexistent')

Common Use Cases

Form Data Extraction

# Extract form input values and attributes
form_inputs = doc.css('input')
form_inputs.each do |input|
  name = input['name']
  value = input['value']
  type = input['type']
  puts "#{name}: #{value} (#{type})"
end

Meta Tag Information

# Extract meta tags
meta_tags = doc.css('meta')
meta_tags.each do |meta|
  name = meta['name'] || meta['property']
  content = meta['content']
  puts "#{name}: #{content}" if name && content
end

Data Attributes for JavaScript

# Extract all data-* attributes
elements_with_data = doc.css('[data-id]')
elements_with_data.each do |element|
  data_attrs = element.attributes.select { |name, _| name.start_with?('data-') }
  data_attrs.each { |name, attr| puts "#{name}: #{attr.value}" }
end

Remember to respect robots.txt files and website terms of service when scraping. Consider adding delays between requests and handling rate limiting appropriately.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon