Nokogiri provides several methods to remove nodes from HTML and XML documents. The primary methods are remove()
, unlink()
, and replace()
, each with different behaviors and use cases.
Basic Node Removal Methods
The remove()
Method
The most common approach is using the remove()
method, which removes the node from the document and returns the removed node:
require 'nokogiri'
html = <<-HTML
<!DOCTYPE html>
<html>
<body>
<h1>Main Heading</h1>
<p class="delete-me">This will be removed</p>
<p>This will remain</p>
<div id="sidebar">Sidebar content</div>
</body>
</html>
HTML
doc = Nokogiri::HTML(html)
# Remove a single node by class
doc.at_css('.delete-me').remove
# Remove a single node by ID
doc.at_css('#sidebar')&.remove # Safe navigation operator
puts doc.to_html
The unlink()
Method
unlink()
is an alias for remove()
and behaves identically:
# These are equivalent
node.remove
node.unlink
Removing Multiple Nodes
When removing multiple nodes, iterate through the NodeSet:
doc = Nokogiri::HTML(html)
# Remove all paragraphs
doc.css('p').each(&:remove)
# Remove all elements with a specific class
doc.css('.unwanted').each(&:remove)
# Remove all empty elements
doc.css('*').each { |node| node.remove if node.content.strip.empty? }
Advanced Removal Techniques
Conditional Node Removal
Remove nodes based on their content or attributes:
html = <<-HTML
<div>
<p>Keep this paragraph</p>
<p>Remove this paragraph</p>
<span data-temp="true">Temporary element</span>
<img src="placeholder.jpg" alt="Remove me">
</div>
HTML
doc = Nokogiri::HTML::DocumentFragment.parse(html)
# Remove nodes containing specific text
doc.css('p').each do |p|
p.remove if p.text.include?('Remove')
end
# Remove nodes with specific attributes
doc.css('[data-temp]').each(&:remove)
# Remove images with specific src patterns
doc.css('img').each do |img|
img.remove if img['src']&.include?('placeholder')
end
Using XPath for Complex Removal
XPath provides more powerful selection capabilities:
doc = Nokogiri::HTML(html)
# Remove all empty paragraphs
doc.xpath('//p[not(normalize-space())]').each(&:remove)
# Remove all divs that contain only whitespace
doc.xpath('//div[not(normalize-space()) and not(*)]').each(&:remove)
# Remove all links with external URLs
doc.xpath('//a[starts-with(@href, "http")]').each(&:remove)
Node Replacement vs. Removal
Using replace()
Method
Sometimes you want to replace a node rather than just remove it:
doc = Nokogiri::HTML(html)
# Replace with new content
old_node = doc.at_css('h1')
old_node.replace('<h2>New Heading</h2>') if old_node
# Replace with empty string (effectively removes)
doc.at_css('.delete-me')&.replace('')
# Replace with text node
doc.at_css('p').replace(Nokogiri::XML::Text.new('Plain text', doc))
Removing Content but Keeping Structure
To remove only the content while preserving the element:
# Clear content but keep the element
doc.at_css('div').content = ''
# Remove all child nodes but keep the parent
doc.at_css('div').children.remove
Working with XML Documents
The same methods work with XML documents:
xml = <<-XML
<?xml version="1.0"?>
<root>
<item id="1">Keep</item>
<item id="2" delete="true">Remove</item>
<metadata>
<created>2023-01-01</created>
<temp>Remove this</temp>
</metadata>
</root>
XML
doc = Nokogiri::XML(xml)
# Remove nodes with specific attributes
doc.xpath('//item[@delete="true"]').each(&:remove)
# Remove temporary metadata
doc.at_xpath('//temp')&.remove
puts doc.to_xml
Error Handling and Best Practices
Safe Removal with Error Handling
Always check if nodes exist before removing them:
# Method 1: Using conditional
node = doc.at_css('.maybe-exists')
node.remove if node
# Method 2: Using safe navigation (Ruby 2.3+)
doc.at_css('.maybe-exists')&.remove
# Method 3: Using rescue
begin
doc.at_css('.target').remove
rescue NoMethodError
puts "Node not found"
end
Preserving Original Document
Node removal is destructive. To preserve the original:
# Method 1: Work on a duplicate
original_doc = Nokogiri::HTML(html)
working_doc = original_doc.dup
working_doc.css('.remove-me').each(&:remove)
# Method 2: Re-parse when needed
def remove_nodes_safely(html_string, selector)
doc = Nokogiri::HTML(html_string)
doc.css(selector).each(&:remove)
doc.to_html
end
Performance Considerations
When removing many nodes, collect them first to avoid modifying the collection during iteration:
# Efficient for large documents
nodes_to_remove = doc.css('p.unwanted').to_a
nodes_to_remove.each(&:remove)
# Or use reverse iteration
doc.css('p.unwanted').reverse_each(&:remove)
Common Use Cases
Cleaning HTML for Display
# Remove script and style tags for security
doc.css('script, style').each(&:remove)
# Remove comments
doc.xpath('//comment()').each(&:remove)
# Remove empty paragraphs and divs
doc.css('p, div').each do |element|
element.remove if element.content.strip.empty?
end
Extracting Specific Content
# Remove navigation and sidebar, keep main content
doc.css('nav, .sidebar, footer').each(&:remove)
# Remove all attributes except specific ones
doc.css('*').each do |element|
allowed_attrs = %w[href src alt title]
element.attributes.each do |name, attr|
attr.remove unless allowed_attrs.include?(name)
end
end
Remember that node removal operations are permanent for that document object. Always test your selectors carefully and consider backing up important data before performing bulk removal operations.