How can I extract inline styles from HTML elements using Nokogiri?
Extracting inline styles from HTML elements is a common task when scraping websites or analyzing web content. Nokogiri, the powerful HTML/XML parser for Ruby, provides several methods to access and parse inline CSS styles from HTML elements. This guide will show you how to extract, parse, and work with inline styles effectively.
Basic Style Attribute Extraction
The simplest way to extract inline styles is by accessing the style
attribute of an element:
require 'nokogiri'
html = <<-HTML
<div style="color: red; font-size: 16px; margin: 10px;">
Hello World
</div>
<p style="background-color: blue; padding: 5px;">
Paragraph text
</p>
HTML
doc = Nokogiri::HTML(html)
# Extract style attribute from a specific element
div_element = doc.at_css('div')
style_string = div_element['style']
puts style_string
# Output: "color: red; font-size: 16px; margin: 10px;"
# Extract styles from all elements with style attributes
doc.css('[style]').each do |element|
puts "#{element.name}: #{element['style']}"
end
Parsing Individual CSS Properties
To work with individual CSS properties, you'll need to parse the style string. Here's a helper method to convert the style string into a hash:
def parse_inline_styles(style_string)
return {} if style_string.nil? || style_string.empty?
styles = {}
style_string.split(';').each do |declaration|
next if declaration.strip.empty?
property, value = declaration.split(':', 2)
next unless property && value
styles[property.strip] = value.strip
end
styles
end
# Usage example
html = '<div style="color: red; font-size: 16px; margin: 10px 5px;">Content</div>'
doc = Nokogiri::HTML(html)
element = doc.at_css('div')
styles = parse_inline_styles(element['style'])
puts styles
# Output: {"color"=>"red", "font-size"=>"16px", "margin"=>"10px 5px"}
# Access specific properties
puts "Color: #{styles['color']}"
puts "Font size: #{styles['font-size']}"
Advanced Style Extraction with Error Handling
Here's a more robust approach that handles edge cases and malformed CSS:
class StyleExtractor
def self.extract_styles(element)
return {} unless element.respond_to?(:[])
style_attr = element['style']
return {} if style_attr.nil? || style_attr.strip.empty?
parse_style_string(style_attr)
end
private
def self.parse_style_string(style_string)
styles = {}
# Handle various separators and clean up the string
cleaned_string = style_string.gsub(/\s+/, ' ').strip
declarations = cleaned_string.split(';')
declarations.each do |declaration|
next if declaration.strip.empty?
parts = declaration.split(':', 2)
next unless parts.length == 2
property = parts[0].strip.downcase
value = parts[1].strip
# Skip empty properties or values
next if property.empty? || value.empty?
# Remove quotes if present
value = value.gsub(/^["']|["']$/, '')
styles[property] = value
end
styles
end
end
# Usage example with complex HTML
html = <<-HTML
<div style="color: red; font-size: 16px; background-image: url('image.jpg'); border: 1px solid #ccc;">
<span style="font-weight: bold; text-decoration: underline;">Bold text</span>
<p style="margin: 0; padding: 10px 15px;">Paragraph</p>
</div>
HTML
doc = Nokogiri::HTML(html)
# Extract styles from all elements
doc.css('[style]').each do |element|
styles = StyleExtractor.extract_styles(element)
puts "#{element.name.upcase} styles:"
styles.each { |prop, value| puts " #{prop}: #{value}" }
puts
end
Filtering and Searching for Specific Styles
You can search for elements based on their inline styles using custom methods:
def find_elements_with_style_property(doc, property, value = nil)
elements = []
doc.css('[style]').each do |element|
styles = StyleExtractor.extract_styles(element)
if value.nil?
# Just check if property exists
elements << element if styles.key?(property)
else
# Check if property has specific value
elements << element if styles[property] == value
end
end
elements
end
# Find all elements with color styles
red_elements = find_elements_with_style_property(doc, 'color', 'red')
puts "Found #{red_elements.length} red elements"
# Find all elements with any font-size property
font_sized_elements = find_elements_with_style_property(doc, 'font-size')
puts "Found #{font_sized_elements.length} elements with font-size"
Working with CSS Units and Values
When extracting styles, you might want to parse and work with CSS units:
def parse_css_value(value)
# Match number and unit
match = value.match(/^(-?\d*\.?\d+)([a-zA-Z%]*)$/)
return { number: nil, unit: nil } unless match
{
number: match[1].include?('.') ? match[1].to_f : match[1].to_i,
unit: match[2].empty? ? nil : match[2]
}
end
# Example usage
html = '<div style="width: 100px; height: 50%; margin: 1.5em;">Content</div>'
doc = Nokogiri::HTML(html)
element = doc.at_css('div')
styles = StyleExtractor.extract_styles(element)
styles.each do |property, value|
parsed = parse_css_value(value)
if parsed[:number]
puts "#{property}: #{parsed[:number]} #{parsed[:unit] || 'unitless'}"
else
puts "#{property}: #{value} (not a numeric value)"
end
end
Extracting Styles from Complex Documents
For larger documents, you might want to extract styles more systematically:
def extract_all_styles(doc)
style_data = {
elements: [],
unique_properties: Set.new,
property_usage: Hash.new(0)
}
doc.css('[style]').each_with_index do |element, index|
styles = StyleExtractor.extract_styles(element)
element_data = {
index: index,
tag: element.name,
xpath: element.path,
styles: styles,
style_count: styles.length
}
styles.each_key do |property|
style_data[:unique_properties] << property
style_data[:property_usage][property] += 1
end
style_data[:elements] << element_data
end
style_data
end
# Analyze styles in a document
style_analysis = extract_all_styles(doc)
puts "Total elements with styles: #{style_analysis[:elements].length}"
puts "Unique CSS properties used: #{style_analysis[:unique_properties].to_a.sort.join(', ')}"
puts "\nMost common properties:"
style_analysis[:property_usage].sort_by { |_, count| -count }.first(5).each do |prop, count|
puts " #{prop}: #{count} times"
end
Integration with Web Scraping Workflows
When scraping websites, you might need to extract styles as part of your data collection process. Here's how you can integrate style extraction into a typical scraping workflow:
require 'nokogiri'
require 'open-uri'
def scrape_page_with_styles(url)
doc = Nokogiri::HTML(URI.open(url))
data = {
title: doc.title,
elements_with_styles: []
}
doc.css('[style]').each do |element|
styles = StyleExtractor.extract_styles(element)
element_data = {
tag: element.name,
text: element.text.strip[0..100], # First 100 chars
classes: element['class']&.split(' ') || [],
styles: styles
}
data[:elements_with_styles] << element_data
end
data
end
# Usage (replace with actual URL)
# scraped_data = scrape_page_with_styles('https://example.com')
Performance Considerations
When working with large documents, consider these optimization strategies:
# Efficient style extraction for large documents
def extract_styles_efficiently(doc, selector_filter = nil)
# Use more specific selectors when possible
base_selector = selector_filter ? "#{selector_filter}[style]" : '[style]'
doc.css(base_selector).map do |element|
{
element: element,
styles: StyleExtractor.extract_styles(element)
}
end
end
# Example: Only extract styles from div and span elements
filtered_styles = extract_styles_efficiently(doc, 'div, span')
Common Use Cases
Extracting Color Information
def extract_color_palette(doc)
colors = Set.new
doc.css('[style]').each do |element|
styles = StyleExtractor.extract_styles(element)
# Extract various color properties
%w[color background-color border-color].each do |prop|
if styles[prop]
colors << styles[prop]
end
end
end
colors.to_a
end
Converting Inline Styles to CSS Classes
def generate_css_classes_from_inline_styles(doc)
style_groups = {}
class_counter = 1
doc.css('[style]').each do |element|
styles = StyleExtractor.extract_styles(element)
style_key = styles.sort.to_h.to_s
unless style_groups[style_key]
style_groups[style_key] = {
class_name: "generated-class-#{class_counter}",
styles: styles,
elements: []
}
class_counter += 1
end
style_groups[style_key][:elements] << element
end
style_groups
end
Best Practices and Tips
- Always handle nil values: Style attributes might not exist or be empty
- Normalize property names: Convert to lowercase for consistent matching
- Handle malformed CSS: Real-world HTML often contains invalid CSS syntax
- Use specific selectors: Target only the elements you need for better performance
- Consider external stylesheets: Inline styles don't include external CSS rules
Troubleshooting Common Issues
Missing or Empty Styles
# Check if element has any styling
def has_styling?(element)
style_attr = element['style']
return false if style_attr.nil? || style_attr.strip.empty?
styles = StyleExtractor.extract_styles(element)
!styles.empty?
end
Handling CSS Comments
def clean_css_comments(style_string)
# Remove CSS comments /* ... */
style_string.gsub(/\/\*.*?\*\//, '')
end
Conclusion
Extracting inline styles from HTML elements using Nokogiri is straightforward once you understand the basic techniques. The key steps are:
- Access the style attribute using Nokogiri's attribute accessor
- Parse the CSS string by splitting on semicolons and colons
- Handle edge cases like malformed CSS or empty values
- Structure the data in a way that's useful for your application
Whether you're analyzing web design patterns, migrating styles to external CSS files, or extracting styling information for data analysis, these techniques will help you efficiently work with inline styles in your Ruby applications. For more complex web scraping scenarios involving dynamic content, you might also want to explore headless browser solutions for JavaScript-heavy websites.
Remember to handle errors gracefully and consider the performance implications when working with large documents. The techniques shown here provide a solid foundation for any style extraction needs in your web scraping or HTML processing projects.