Table of contents

How can I extract inline styles from HTML elements using Nokogiri?

Extracting inline styles from HTML elements is a common task when scraping websites or analyzing web content. Nokogiri, the powerful HTML/XML parser for Ruby, provides several methods to access and parse inline CSS styles from HTML elements. This guide will show you how to extract, parse, and work with inline styles effectively.

Basic Style Attribute Extraction

The simplest way to extract inline styles is by accessing the style attribute of an element:

require 'nokogiri'

html = <<-HTML
<div style="color: red; font-size: 16px; margin: 10px;">
  Hello World
</div>
<p style="background-color: blue; padding: 5px;">
  Paragraph text
</p>
HTML

doc = Nokogiri::HTML(html)

# Extract style attribute from a specific element
div_element = doc.at_css('div')
style_string = div_element['style']
puts style_string
# Output: "color: red; font-size: 16px; margin: 10px;"

# Extract styles from all elements with style attributes
doc.css('[style]').each do |element|
  puts "#{element.name}: #{element['style']}"
end

Parsing Individual CSS Properties

To work with individual CSS properties, you'll need to parse the style string. Here's a helper method to convert the style string into a hash:

def parse_inline_styles(style_string)
  return {} if style_string.nil? || style_string.empty?

  styles = {}
  style_string.split(';').each do |declaration|
    next if declaration.strip.empty?

    property, value = declaration.split(':', 2)
    next unless property && value

    styles[property.strip] = value.strip
  end

  styles
end

# Usage example
html = '<div style="color: red; font-size: 16px; margin: 10px 5px;">Content</div>'
doc = Nokogiri::HTML(html)
element = doc.at_css('div')

styles = parse_inline_styles(element['style'])
puts styles
# Output: {"color"=>"red", "font-size"=>"16px", "margin"=>"10px 5px"}

# Access specific properties
puts "Color: #{styles['color']}"
puts "Font size: #{styles['font-size']}"

Advanced Style Extraction with Error Handling

Here's a more robust approach that handles edge cases and malformed CSS:

class StyleExtractor
  def self.extract_styles(element)
    return {} unless element.respond_to?(:[])

    style_attr = element['style']
    return {} if style_attr.nil? || style_attr.strip.empty?

    parse_style_string(style_attr)
  end

  private

  def self.parse_style_string(style_string)
    styles = {}

    # Handle various separators and clean up the string
    cleaned_string = style_string.gsub(/\s+/, ' ').strip

    declarations = cleaned_string.split(';')
    declarations.each do |declaration|
      next if declaration.strip.empty?

      parts = declaration.split(':', 2)
      next unless parts.length == 2

      property = parts[0].strip.downcase
      value = parts[1].strip

      # Skip empty properties or values
      next if property.empty? || value.empty?

      # Remove quotes if present
      value = value.gsub(/^["']|["']$/, '')

      styles[property] = value
    end

    styles
  end
end

# Usage example with complex HTML
html = <<-HTML
<div style="color: red; font-size: 16px; background-image: url('image.jpg'); border: 1px solid #ccc;">
  <span style="font-weight: bold; text-decoration: underline;">Bold text</span>
  <p style="margin: 0; padding: 10px 15px;">Paragraph</p>
</div>
HTML

doc = Nokogiri::HTML(html)

# Extract styles from all elements
doc.css('[style]').each do |element|
  styles = StyleExtractor.extract_styles(element)
  puts "#{element.name.upcase} styles:"
  styles.each { |prop, value| puts "  #{prop}: #{value}" }
  puts
end

Filtering and Searching for Specific Styles

You can search for elements based on their inline styles using custom methods:

def find_elements_with_style_property(doc, property, value = nil)
  elements = []

  doc.css('[style]').each do |element|
    styles = StyleExtractor.extract_styles(element)

    if value.nil?
      # Just check if property exists
      elements << element if styles.key?(property)
    else
      # Check if property has specific value
      elements << element if styles[property] == value
    end
  end

  elements
end

# Find all elements with color styles
red_elements = find_elements_with_style_property(doc, 'color', 'red')
puts "Found #{red_elements.length} red elements"

# Find all elements with any font-size property
font_sized_elements = find_elements_with_style_property(doc, 'font-size')
puts "Found #{font_sized_elements.length} elements with font-size"

Working with CSS Units and Values

When extracting styles, you might want to parse and work with CSS units:

def parse_css_value(value)
  # Match number and unit
  match = value.match(/^(-?\d*\.?\d+)([a-zA-Z%]*)$/)
  return { number: nil, unit: nil } unless match

  {
    number: match[1].include?('.') ? match[1].to_f : match[1].to_i,
    unit: match[2].empty? ? nil : match[2]
  }
end

# Example usage
html = '<div style="width: 100px; height: 50%; margin: 1.5em;">Content</div>'
doc = Nokogiri::HTML(html)
element = doc.at_css('div')

styles = StyleExtractor.extract_styles(element)
styles.each do |property, value|
  parsed = parse_css_value(value)
  if parsed[:number]
    puts "#{property}: #{parsed[:number]} #{parsed[:unit] || 'unitless'}"
  else
    puts "#{property}: #{value} (not a numeric value)"
  end
end

Extracting Styles from Complex Documents

For larger documents, you might want to extract styles more systematically:

def extract_all_styles(doc)
  style_data = {
    elements: [],
    unique_properties: Set.new,
    property_usage: Hash.new(0)
  }

  doc.css('[style]').each_with_index do |element, index|
    styles = StyleExtractor.extract_styles(element)

    element_data = {
      index: index,
      tag: element.name,
      xpath: element.path,
      styles: styles,
      style_count: styles.length
    }

    styles.each_key do |property|
      style_data[:unique_properties] << property
      style_data[:property_usage][property] += 1
    end

    style_data[:elements] << element_data
  end

  style_data
end

# Analyze styles in a document
style_analysis = extract_all_styles(doc)

puts "Total elements with styles: #{style_analysis[:elements].length}"
puts "Unique CSS properties used: #{style_analysis[:unique_properties].to_a.sort.join(', ')}"
puts "\nMost common properties:"
style_analysis[:property_usage].sort_by { |_, count| -count }.first(5).each do |prop, count|
  puts "  #{prop}: #{count} times"
end

Integration with Web Scraping Workflows

When scraping websites, you might need to extract styles as part of your data collection process. Here's how you can integrate style extraction into a typical scraping workflow:

require 'nokogiri'
require 'open-uri'

def scrape_page_with_styles(url)
  doc = Nokogiri::HTML(URI.open(url))

  data = {
    title: doc.title,
    elements_with_styles: []
  }

  doc.css('[style]').each do |element|
    styles = StyleExtractor.extract_styles(element)

    element_data = {
      tag: element.name,
      text: element.text.strip[0..100], # First 100 chars
      classes: element['class']&.split(' ') || [],
      styles: styles
    }

    data[:elements_with_styles] << element_data
  end

  data
end

# Usage (replace with actual URL)
# scraped_data = scrape_page_with_styles('https://example.com')

Performance Considerations

When working with large documents, consider these optimization strategies:

# Efficient style extraction for large documents
def extract_styles_efficiently(doc, selector_filter = nil)
  # Use more specific selectors when possible
  base_selector = selector_filter ? "#{selector_filter}[style]" : '[style]'

  doc.css(base_selector).map do |element|
    {
      element: element,
      styles: StyleExtractor.extract_styles(element)
    }
  end
end

# Example: Only extract styles from div and span elements
filtered_styles = extract_styles_efficiently(doc, 'div, span')

Common Use Cases

Extracting Color Information

def extract_color_palette(doc)
  colors = Set.new

  doc.css('[style]').each do |element|
    styles = StyleExtractor.extract_styles(element)

    # Extract various color properties
    %w[color background-color border-color].each do |prop|
      if styles[prop]
        colors << styles[prop]
      end
    end
  end

  colors.to_a
end

Converting Inline Styles to CSS Classes

def generate_css_classes_from_inline_styles(doc)
  style_groups = {}
  class_counter = 1

  doc.css('[style]').each do |element|
    styles = StyleExtractor.extract_styles(element)
    style_key = styles.sort.to_h.to_s

    unless style_groups[style_key]
      style_groups[style_key] = {
        class_name: "generated-class-#{class_counter}",
        styles: styles,
        elements: []
      }
      class_counter += 1
    end

    style_groups[style_key][:elements] << element
  end

  style_groups
end

Best Practices and Tips

  1. Always handle nil values: Style attributes might not exist or be empty
  2. Normalize property names: Convert to lowercase for consistent matching
  3. Handle malformed CSS: Real-world HTML often contains invalid CSS syntax
  4. Use specific selectors: Target only the elements you need for better performance
  5. Consider external stylesheets: Inline styles don't include external CSS rules

Troubleshooting Common Issues

Missing or Empty Styles

# Check if element has any styling
def has_styling?(element)
  style_attr = element['style']
  return false if style_attr.nil? || style_attr.strip.empty?

  styles = StyleExtractor.extract_styles(element)
  !styles.empty?
end

Handling CSS Comments

def clean_css_comments(style_string)
  # Remove CSS comments /* ... */
  style_string.gsub(/\/\*.*?\*\//, '')
end

Conclusion

Extracting inline styles from HTML elements using Nokogiri is straightforward once you understand the basic techniques. The key steps are:

  1. Access the style attribute using Nokogiri's attribute accessor
  2. Parse the CSS string by splitting on semicolons and colons
  3. Handle edge cases like malformed CSS or empty values
  4. Structure the data in a way that's useful for your application

Whether you're analyzing web design patterns, migrating styles to external CSS files, or extracting styling information for data analysis, these techniques will help you efficiently work with inline styles in your Ruby applications. For more complex web scraping scenarios involving dynamic content, you might also want to explore headless browser solutions for JavaScript-heavy websites.

Remember to handle errors gracefully and consider the performance implications when working with large documents. The techniques shown here provide a solid foundation for any style extraction needs in your web scraping or HTML processing projects.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon