How do I remove nodes from a document with Nokogiri?

Nokogiri provides several methods to remove nodes from HTML and XML documents. The primary methods are remove(), unlink(), and replace(), each with different behaviors and use cases.

Basic Node Removal Methods

The remove() Method

The most common approach is using the remove() method, which removes the node from the document and returns the removed node:

require 'nokogiri'

html = <<-HTML
<!DOCTYPE html>
<html>
<body>
    <h1>Main Heading</h1>
    <p class="delete-me">This will be removed</p>
    <p>This will remain</p>
    <div id="sidebar">Sidebar content</div>
</body>
</html>
HTML

doc = Nokogiri::HTML(html)

# Remove a single node by class
doc.at_css('.delete-me').remove

# Remove a single node by ID
doc.at_css('#sidebar')&.remove  # Safe navigation operator

puts doc.to_html

The unlink() Method

unlink() is an alias for remove() and behaves identically:

# These are equivalent
node.remove
node.unlink

Removing Multiple Nodes

When removing multiple nodes, iterate through the NodeSet:

doc = Nokogiri::HTML(html)

# Remove all paragraphs
doc.css('p').each(&:remove)

# Remove all elements with a specific class
doc.css('.unwanted').each(&:remove)

# Remove all empty elements
doc.css('*').each { |node| node.remove if node.content.strip.empty? }

Advanced Removal Techniques

Conditional Node Removal

Remove nodes based on their content or attributes:

html = <<-HTML
<div>
    <p>Keep this paragraph</p>
    <p>Remove this paragraph</p>
    <span data-temp="true">Temporary element</span>
    <img src="placeholder.jpg" alt="Remove me">
</div>
HTML

doc = Nokogiri::HTML::DocumentFragment.parse(html)

# Remove nodes containing specific text
doc.css('p').each do |p|
  p.remove if p.text.include?('Remove')
end

# Remove nodes with specific attributes
doc.css('[data-temp]').each(&:remove)

# Remove images with specific src patterns
doc.css('img').each do |img|
  img.remove if img['src']&.include?('placeholder')
end

Using XPath for Complex Removal

XPath provides more powerful selection capabilities:

doc = Nokogiri::HTML(html)

# Remove all empty paragraphs
doc.xpath('//p[not(normalize-space())]').each(&:remove)

# Remove all divs that contain only whitespace
doc.xpath('//div[not(normalize-space()) and not(*)]').each(&:remove)

# Remove all links with external URLs
doc.xpath('//a[starts-with(@href, "http")]').each(&:remove)

Node Replacement vs. Removal

Using replace() Method

Sometimes you want to replace a node rather than just remove it:

doc = Nokogiri::HTML(html)

# Replace with new content
old_node = doc.at_css('h1')
old_node.replace('<h2>New Heading</h2>') if old_node

# Replace with empty string (effectively removes)
doc.at_css('.delete-me')&.replace('')

# Replace with text node
doc.at_css('p').replace(Nokogiri::XML::Text.new('Plain text', doc))

Removing Content but Keeping Structure

To remove only the content while preserving the element:

# Clear content but keep the element
doc.at_css('div').content = ''

# Remove all child nodes but keep the parent
doc.at_css('div').children.remove

Working with XML Documents

The same methods work with XML documents:

xml = <<-XML
<?xml version="1.0"?>
<root>
    <item id="1">Keep</item>
    <item id="2" delete="true">Remove</item>
    <metadata>
        <created>2023-01-01</created>
        <temp>Remove this</temp>
    </metadata>
</root>
XML

doc = Nokogiri::XML(xml)

# Remove nodes with specific attributes
doc.xpath('//item[@delete="true"]').each(&:remove)

# Remove temporary metadata
doc.at_xpath('//temp')&.remove

puts doc.to_xml

Error Handling and Best Practices

Safe Removal with Error Handling

Always check if nodes exist before removing them:

# Method 1: Using conditional
node = doc.at_css('.maybe-exists')
node.remove if node

# Method 2: Using safe navigation (Ruby 2.3+)
doc.at_css('.maybe-exists')&.remove

# Method 3: Using rescue
begin
  doc.at_css('.target').remove
rescue NoMethodError
  puts "Node not found"
end

Preserving Original Document

Node removal is destructive. To preserve the original:

# Method 1: Work on a duplicate
original_doc = Nokogiri::HTML(html)
working_doc = original_doc.dup
working_doc.css('.remove-me').each(&:remove)

# Method 2: Re-parse when needed
def remove_nodes_safely(html_string, selector)
  doc = Nokogiri::HTML(html_string)
  doc.css(selector).each(&:remove)
  doc.to_html
end

Performance Considerations

When removing many nodes, collect them first to avoid modifying the collection during iteration:

# Efficient for large documents
nodes_to_remove = doc.css('p.unwanted').to_a
nodes_to_remove.each(&:remove)

# Or use reverse iteration
doc.css('p.unwanted').reverse_each(&:remove)

Common Use Cases

Cleaning HTML for Display

# Remove script and style tags for security
doc.css('script, style').each(&:remove)

# Remove comments
doc.xpath('//comment()').each(&:remove)

# Remove empty paragraphs and divs
doc.css('p, div').each do |element|
  element.remove if element.content.strip.empty?
end

Extracting Specific Content

# Remove navigation and sidebar, keep main content
doc.css('nav, .sidebar, footer').each(&:remove)

# Remove all attributes except specific ones
doc.css('*').each do |element|
  allowed_attrs = %w[href src alt title]
  element.attributes.each do |name, attr|
    attr.remove unless allowed_attrs.include?(name)
  end
end

Remember that node removal operations are permanent for that document object. Always test your selectors carefully and consider backing up important data before performing bulk removal operations.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon