How do I install and set up Nokogiri for web scraping in Ruby?

Nokogiri is one of the most popular and powerful HTML/XML parsing libraries for Ruby, making it an essential tool for web scraping projects. This comprehensive guide will walk you through the installation process, configuration options, and basic setup to get you started with web scraping using Nokogiri.

What is Nokogiri?

Nokogiri is a Ruby gem that provides an HTML/XML/SAX/Reader parser with XPath and CSS selector support. Built on top of libxml2 and libxslt, it offers excellent performance and reliability for parsing web content. The name "Nokogiri" means "saw" in Japanese, reflecting its ability to "cut through" HTML and XML documents.

Installation Methods

Basic Installation with RubyGems

The simplest way to install Nokogiri is using the gem command:

gem install nokogiri

Installation with Bundler (Recommended)

For most Ruby projects, it's recommended to use Bundler to manage dependencies. Add Nokogiri to your Gemfile:

# Gemfile
source 'https://rubygems.org'

gem 'nokogiri', '~> 1.15'

Then run:

bundle install

Installation with Specific Versions

If you need a specific version of Nokogiri, you can specify it during installation:

gem install nokogiri -v 1.15.4

Platform-Specific Installation

macOS Installation

On macOS, you might encounter issues with native extensions. Here are the recommended approaches:

# Install using Homebrew dependencies
brew install libxml2 libxslt
gem install nokogiri -- --use-system-libraries

# Or use the pre-compiled binary
gem install nokogiri --platform=ruby

Ubuntu/Debian Installation

On Ubuntu or Debian systems, install the required development libraries first:

sudo apt-get update
sudo apt-get install build-essential patch ruby-dev zlib1g-dev liblzma-dev libxml2-dev libxslt1-dev
gem install nokogiri

CentOS/RHEL Installation

For CentOS or RHEL systems:

sudo yum install -y gcc ruby-devel libxml2-devel libxslt-devel
gem install nokogiri

Windows Installation

On Windows, Nokogiri typically installs without issues using the pre-compiled binary:

gem install nokogiri

Basic Setup and Configuration

Requiring Nokogiri

Once installed, you can start using Nokogiri in your Ruby scripts:

require 'nokogiri'
require 'open-uri'

# Basic usage example
html_content = URI.open('https://example.com').read
doc = Nokogiri::HTML(html_content)

Setting Up a Web Scraping Environment

Here's a complete setup for a basic web scraping project:

require 'nokogiri'
require 'net/http'
require 'uri'

class WebScraper
  def initialize
    @user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
  end

  def fetch_page(url)
    uri = URI(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true if uri.scheme == 'https'

    request = Net::HTTP::Get.new(uri)
    request['User-Agent'] = @user_agent

    response = http.request(request)

    if response.code == '200'
      Nokogiri::HTML(response.body)
    else
      raise "Failed to fetch page: #{response.code}"
    end
  end

  def parse_content(doc)
    # Your parsing logic here
    title = doc.css('title').text
    puts "Page title: #{title}"
  end
end

# Usage
scraper = WebScraper.new
doc = scraper.fetch_page('https://example.com')
scraper.parse_content(doc)

Advanced Configuration Options

Parser Options

Nokogiri provides various parser options for different scenarios:

# Strict parsing (default)
doc = Nokogiri::HTML(html_content)

# Relaxed parsing for malformed HTML
doc = Nokogiri::HTML(html_content) do |config|
  config.recover
end

# Custom parser configuration
doc = Nokogiri::HTML(html_content) do |config|
  config.noblanks  # Remove blank text nodes
  config.noent     # Substitute entities
  config.recover   # Recover from errors
end

Encoding Handling

Proper encoding handling is crucial for international content:

# Specify encoding explicitly
doc = Nokogiri::HTML(html_content, nil, 'UTF-8')

# Auto-detect encoding
doc = Nokogiri::HTML(html_content)
puts "Document encoding: #{doc.encoding}"

# Handle different encodings
def parse_with_encoding(html_content, encoding = nil)
  if encoding
    doc = Nokogiri::HTML(html_content, nil, encoding)
  else
    doc = Nokogiri::HTML(html_content)
  end
  doc
end

Common Parsing Techniques

CSS Selectors

Nokogiri supports CSS selectors for easy element selection:

doc = Nokogiri::HTML(html_content)

# Select elements by CSS
titles = doc.css('h1, h2, h3')
links = doc.css('a[href]')
images = doc.css('img[src]')

# Extract text and attributes
titles.each do |title|
  puts title.text.strip
end

links.each do |link|
  puts "#{link.text} -> #{link['href']}"
end

XPath Queries

For more complex selections, use XPath:

# XPath examples
doc.xpath('//div[@class="content"]//p').each do |paragraph|
  puts paragraph.text
end

# Select elements with specific text
doc.xpath('//a[contains(text(), "Download")]').each do |link|
  puts link['href']
end

# Complex XPath queries
price_elements = doc.xpath('//span[@class="price" or @class="cost"]')

Troubleshooting Common Issues

Installation Problems

If you encounter compilation errors:

# Clear gem cache and reinstall
gem uninstall nokogiri
gem install nokogiri -- --use-system-libraries

# Or try the latest version
gem install nokogiri --pre

Memory Management

For large documents or long-running scrapers:

# Explicitly remove document references
def process_large_document(url)
  doc = fetch_page(url)
  result = extract_data(doc)
  doc = nil  # Help garbage collection
  GC.start  # Force garbage collection
  result
end

Character Encoding Issues

# Handle encoding problems
def safe_parse(html_content)
  begin
    doc = Nokogiri::HTML(html_content)
  rescue Encoding::UndefinedConversionError
    # Try with different encoding
    html_content = html_content.encode('UTF-8', invalid: :replace, undef: :replace)
    doc = Nokogiri::HTML(html_content)
  end
  doc
end

Integration with HTTP Libraries

Using with HTTParty

require 'nokogiri'
require 'httparty'

class ScraperWithHTTParty
  include HTTParty

  def fetch_and_parse(url)
    response = self.class.get(url)
    Nokogiri::HTML(response.body)
  end
end

Using with Faraday

require 'nokogiri'
require 'faraday'

connection = Faraday.new do |conn|
  conn.adapter Faraday.default_adapter
end

response = connection.get('https://example.com')
doc = Nokogiri::HTML(response.body)

Performance Optimization

Efficient Parsing Strategies

# Use fragment parsing for small HTML snippets
fragment = Nokogiri::HTML::DocumentFragment.parse('<div>Content</div>')

# Stream parsing for very large documents
class SAXHandler < Nokogiri::XML::SAX::Document
  def start_element(name, attributes = [])
    if name == 'product'
      @current_product = {}
    end
  end

  def characters(string)
    # Process character data
  end

  def end_element(name)
    if name == 'product'
      process_product(@current_product)
    end
  end
end

parser = Nokogiri::XML::SAX::Parser.new(SAXHandler.new)
parser.parse(large_xml_file)

Best Practices

Always handle exceptions when fetching and parsing content
Respect robots.txt and implement proper delays between requests
Use CSS selectors for simple selections and XPath for complex queries
Cache parsed documents when processing multiple queries on the same content
Monitor memory usage in long-running scraping applications

While Nokogiri excels at parsing static HTML content, some modern websites rely heavily on JavaScript for content generation. In such cases, you might need to consider browser automation tools like how to handle authentication in Puppeteer for more complex scraping scenarios.

Conclusion

Nokogiri is an excellent choice for Ruby web scraping projects, offering powerful parsing capabilities with good performance. By following this installation and setup guide, you'll have a solid foundation for building robust web scraping applications. Remember to always scrape responsibly, respect website terms of service, and implement appropriate error handling and rate limiting in your applications.

For more advanced scraping scenarios involving dynamic content, consider exploring browser automation solutions alongside Nokogiri to create comprehensive scraping systems that can handle both static and dynamic web content effectively.

Table of contents