How can I modify existing HTML content using Nokogiri?

Nokogiri is a powerful Ruby gem that not only excels at parsing and extracting data from HTML documents but also provides comprehensive capabilities for modifying existing HTML content. Whether you need to update text, add new elements, modify attributes, or restructure entire sections, Nokogiri offers an intuitive API for HTML manipulation.

Understanding Nokogiri's Modification Capabilities

Nokogiri treats HTML documents as mutable objects, allowing you to modify them in-place. Once you've parsed an HTML document, you can:

Add, remove, or replace elements
Modify element attributes
Change text content
Restructure the document hierarchy
Insert new content at specific positions

Setting Up Nokogiri

First, ensure you have Nokogiri installed in your Ruby environment:

gem install nokogiri

Or add it to your Gemfile:

gem 'nokogiri'

Basic HTML Modification Operations

Loading and Parsing HTML

Before modifying HTML content, you need to parse it into a Nokogiri document:

require 'nokogiri'

# Parse HTML from a string
html_string = '<html><body><h1>Original Title</h1><p>Original content</p></body></html>'
doc = Nokogiri::HTML(html_string)

# Parse HTML from a file
doc = Nokogiri::HTML(File.open('example.html'))

# Parse HTML from a URL (requires open-uri)
require 'open-uri'
doc = Nokogiri::HTML(URI.open('https://example.com'))

Modifying Text Content

The simplest modification is changing the text content of existing elements:

# Change text content of an element
title = doc.at_css('h1')
title.content = 'Updated Title'

# Modify multiple elements
doc.css('p').each do |paragraph|
  paragraph.content = paragraph.content.upcase
end

# Using inner_text for compatibility
doc.at_css('h1').inner_text = 'New Title'

Modifying Attributes

Nokogiri provides several methods to work with element attributes:

# Set a single attribute
link = doc.at_css('a')
link['href'] = 'https://newurl.com'
link['target'] = '_blank'

# Set multiple attributes
link.set_attribute('class', 'external-link')
link.set_attribute('rel', 'noopener')

# Remove attributes
link.remove_attribute('title')

# Check if attribute exists
if link.has_attribute?('data-id')
  puts "Element has data-id attribute"
end

Adding New Elements

You can create and insert new elements at various positions:

# Create a new element
new_paragraph = Nokogiri::XML::Node.new('p', doc)
new_paragraph.content = 'This is a new paragraph'

# Add element as the last child
body = doc.at_css('body')
body.add_child(new_paragraph)

# Create element with attributes
new_div = Nokogiri::XML::Node.new('div', doc)
new_div['class'] = 'highlight'
new_div['id'] = 'important-section'
new_div.content = 'Important content'

# Insert at specific positions
first_p = doc.at_css('p')
first_p.add_previous_sibling(new_div)  # Before the first paragraph
first_p.add_next_sibling(new_paragraph)  # After the first paragraph

Creating Complex HTML Structures

For more complex HTML structures, you can use Nokogiri's builder pattern:

# Using Nokogiri::HTML::Builder
new_section = Nokogiri::HTML::Builder.new do |doc|
  doc.div(class: 'card') {
    doc.h2 'Card Title'
    doc.p 'Card description goes here'
    doc.a 'Read More', href: '#', class: 'btn btn-primary'
  }
end

# Insert the built HTML
body = doc.at_css('body')
body.add_child(new_section.doc.root)

Advanced Modification Techniques

Replacing Elements

Sometimes you need to completely replace an existing element:

# Replace an element with new content
old_element = doc.at_css('.old-content')
new_element = Nokogiri::XML::Node.new('div', doc)
new_element['class'] = 'new-content'
new_element.content = 'Updated content'

old_element.replace(new_element)

# Replace with HTML string
old_element.replace('<div class="updated">New HTML content</div>')

Removing Elements

Remove unwanted elements from the document:

# Remove a single element
doc.at_css('.advertisement').remove

# Remove multiple elements
doc.css('.spam, .unwanted').remove

# Remove all script tags
doc.css('script').remove

# Remove empty paragraphs
doc.css('p').each do |p|
  p.remove if p.content.strip.empty?
end

Modifying HTML Structure

Reorganize the document structure by moving elements:

# Move an element to a different parent
sidebar = doc.at_css('#sidebar')
main_content = doc.at_css('#main')
article = doc.at_css('article')

# Move article from main to sidebar
article.parent = sidebar

# Clone an element before moving
cloned_article = article.dup
main_content.add_child(cloned_article)

Working with Forms

Modify form elements and their values:

# Update form action and method
form = doc.at_css('form')
form['action'] = '/new-endpoint'
form['method'] = 'POST'

# Modify input values
doc.css('input[type="text"]').each do |input|
  input['value'] = 'Default value'
end

# Add hidden inputs
hidden_input = Nokogiri::XML::Node.new('input', doc)
hidden_input['type'] = 'hidden'
hidden_input['name'] = 'csrf_token'
hidden_input['value'] = 'abc123'
form.add_child(hidden_input)

# Update select options
select = doc.at_css('select[name="country"]')
select.css('option').remove  # Clear existing options

['US', 'UK', 'CA'].each do |country|
  option = Nokogiri::XML::Node.new('option', doc)
  option['value'] = country
  option.content = country
  select.add_child(option)
end

Practical Examples

Example 1: Adding Social Media Meta Tags

require 'nokogiri'

html = '<html><head><title>Original Title</title></head><body></body></html>'
doc = Nokogiri::HTML(html)

# Add Open Graph meta tags
head = doc.at_css('head')

meta_tags = [
  { property: 'og:title', content: 'Amazing Article Title' },
  { property: 'og:description', content: 'This article covers amazing topics' },
  { property: 'og:image', content: 'https://example.com/image.jpg' },
  { property: 'og:url', content: 'https://example.com/article' }
]

meta_tags.each do |tag|
  meta = Nokogiri::XML::Node.new('meta', doc)
  meta['property'] = tag[:property]
  meta['content'] = tag[:content]
  head.add_child(meta)
end

puts doc.to_html

Example 2: Creating a Navigation Menu

# Create a dynamic navigation menu
nav_items = [
  { title: 'Home', url: '/' },
  { title: 'About', url: '/about' },
  { title: 'Contact', url: '/contact' }
]

nav = Nokogiri::XML::Node.new('nav', doc)
nav['class'] = 'main-navigation'

ul = Nokogiri::XML::Node.new('ul', doc)
nav.add_child(ul)

nav_items.each do |item|
  li = Nokogiri::XML::Node.new('li', doc)
  a = Nokogiri::XML::Node.new('a', doc)
  a['href'] = item[:url]
  a.content = item[:title]
  li.add_child(a)
  ul.add_child(li)
end

# Insert navigation into the document
body = doc.at_css('body')
body.children.first.add_previous_sibling(nav)

Example 3: Content Sanitization

# Remove potentially dangerous elements and attributes
doc.css('script, object, embed, iframe').remove

# Remove dangerous attributes
doc.css('*').each do |element|
  dangerous_attrs = %w[onclick onload onerror onmouseover]
  dangerous_attrs.each do |attr|
    element.remove_attribute(attr)
  end
end

# Sanitize links
doc.css('a').each do |link|
  href = link['href']
  if href && !href.start_with?('http://', 'https://', '/')
    link.remove_attribute('href')
  end
end

Working with JavaScript-Rendered Content

When dealing with modern web applications that rely heavily on JavaScript, you might need to combine Nokogiri with browser automation tools. For example, when handling dynamic content that loads after page load, you can first extract the rendered HTML using tools like Puppeteer, then use Nokogiri to modify the content.

# Example: Processing content after JavaScript rendering
require 'nokogiri'
require 'selenium-webdriver'

# Get HTML after JavaScript execution
driver = Selenium::WebDriver.for :chrome
driver.get('https://example.com')
rendered_html = driver.page_source
driver.quit

# Now modify with Nokogiri
doc = Nokogiri::HTML(rendered_html)
# Perform modifications as shown above

Performance Considerations and Best Practices

Memory Management

When working with large documents, be mindful of memory usage:

# Process documents in chunks for large modifications
large_doc = Nokogiri::HTML(large_html_content)

# Batch operations when possible
elements_to_modify = large_doc.css('.target-class')
elements_to_modify.each do |element|
  # Perform modifications
  element['class'] = 'modified'
end

# Clear references when done
large_doc = nil
GC.start  # Force garbage collection if needed

Validation and Error Handling

Always validate your modifications and handle potential errors:

begin
  element = doc.at_css('#target-element')

  if element
    element.content = new_content
  else
    puts "Target element not found"
  end
rescue Nokogiri::XML::SyntaxError => e
  puts "XML/HTML syntax error: #{e.message}"
rescue => e
  puts "Unexpected error: #{e.message}"
end

Chaining Operations

Nokogiri allows you to chain operations for more efficient code:

# Chain multiple modifications
doc.css('.article')
   .each { |article| article['data-processed'] = 'true' }
   .first&.add_child(new_element)

# Method chaining for element creation
new_element = Nokogiri::XML::Node.new('div', doc)
              .tap { |div| div['class'] = 'highlight' }
              .tap { |div| div.content = 'Important note' }

Outputting Modified HTML

After making your modifications, you can output the updated HTML in various formats:

# Output complete HTML document
puts doc.to_html

# Output specific elements
puts doc.at_css('body').to_html

# Pretty print with indentation
puts doc.to_html(indent: 2)

# Output as XML (if working with XML documents)
puts doc.to_xml

# Save to file
File.write('modified.html', doc.to_html)

# Output without HTML doctype
puts doc.at_css('body').inner_html

Integration with Authentication Workflows

For protected content that requires login, you might need to handle authentication before accessing and modifying the HTML. Once authenticated and the content is retrieved, Nokogiri can process and modify it as needed.

# Example: Modifying content after authentication
require 'mechanize'
require 'nokogiri'

agent = Mechanize.new
# Perform login
page = agent.post('https://example.com/login', username: 'user', password: 'pass')

# Get protected content
content_page = agent.get('https://example.com/protected-content')

# Parse and modify with Nokogiri
doc = Nokogiri::HTML(content_page.body)
# Perform modifications...

Conclusion

Nokogiri provides a comprehensive and intuitive API for modifying HTML content in Ruby applications. From simple text changes to complex structural modifications, you can efficiently transform HTML documents to meet your specific requirements. Whether you're building web scrapers, content management systems, or HTML processing tools, mastering Nokogiri's modification capabilities will significantly enhance your ability to work with HTML content programmatically.

Key takeaways for effective HTML modification with Nokogiri:

Always parse HTML before attempting modifications
Use appropriate methods for different types of changes (content, attributes, structure)
Handle errors gracefully and validate modifications
Consider performance implications when working with large documents
Combine with other tools when dealing with JavaScript-heavy applications

With these techniques and best practices, you'll be able to effectively modify HTML content using Nokogiri in your Ruby applications, creating robust solutions for web scraping, content processing, and document transformation tasks.

Table of contents