Table of contents

How to Install Nokogiri on macOS with Homebrew

Nokogiri is one of the most popular Ruby gems for parsing HTML and XML documents, making it an essential tool for web scraping projects. However, installing Nokogiri on macOS can sometimes be challenging due to its native C extensions and dependencies. This comprehensive guide will walk you through the proper installation process using Homebrew and help you troubleshoot common issues.

Prerequisites

Before installing Nokogiri, ensure you have the following prerequisites installed on your macOS system:

Install Homebrew

If you don't have Homebrew installed, install it first:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Install Ruby

Ensure you have Ruby installed. You can use the system Ruby, but we recommend using a Ruby version manager like rbenv:

# Install rbenv
brew install rbenv ruby-build

# Install a recent Ruby version
rbenv install 3.2.0
rbenv global 3.2.0

# Add rbenv to your shell profile
echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(rbenv init -)"' >> ~/.zshrc
source ~/.zshrc

Installing Required Dependencies

Nokogiri requires several system libraries to compile successfully. Install these dependencies using Homebrew:

# Install essential build tools
brew install libxml2 libxslt pkg-config

# Install additional dependencies that may be needed
brew install libiconv zlib

These libraries provide: - libxml2: XML parsing library - libxslt: XSLT processing library - pkg-config: Package configuration utility - libiconv: Character encoding conversion library - zlib: Compression library

Installing Nokogiri

Method 1: Standard Gem Installation

With the dependencies installed, you can now install Nokogiri using the standard gem command:

gem install nokogiri

Method 2: Bundle Installation

If you're working with a Gemfile, add Nokogiri to your Gemfile:

# Gemfile
gem 'nokogiri', '~> 1.15'

Then run:

bundle install

Method 3: Installation with Specific Configuration

If you encounter issues with the standard installation, you can specify the library paths explicitly:

gem install nokogiri -- \
  --use-system-libraries \
  --with-xml2-include=$(brew --prefix libxml2)/include/libxml2 \
  --with-xml2-lib=$(brew --prefix libxml2)/lib \
  --with-xslt-include=$(brew --prefix libxslt)/include \
  --with-xslt-lib=$(brew --prefix libxslt)/lib \
  --with-iconv-include=$(brew --prefix libiconv)/include \
  --with-iconv-lib=$(brew --prefix libiconv)/lib \
  --with-zlib-include=$(brew --prefix zlib)/include \
  --with-zlib-lib=$(brew --prefix zlib)/lib

Verifying the Installation

After installation, verify that Nokogiri is working correctly:

# test_nokogiri.rb
require 'nokogiri'
require 'open-uri'

# Test basic HTML parsing
html = <<-HTML
<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
  <div class="content">
    <h1>Hello World</h1>
    <p>This is a test.</p>
  </div>
</body>
</html>
HTML

doc = Nokogiri::HTML(html)
puts "Title: #{doc.css('title').text}"
puts "Header: #{doc.css('h1').text}"
puts "Paragraph: #{doc.css('p').text}"

# Test XML parsing
xml = <<-XML
<?xml version="1.0"?>
<catalog>
  <book id="1">
    <title>Ruby Programming</title>
    <author>Matz</author>
  </book>
</catalog>
XML

xml_doc = Nokogiri::XML(xml)
puts "Book title: #{xml_doc.css('title').text}"
puts "Author: #{xml_doc.css('author').text}"

Run the test:

ruby test_nokogiri.rb

Expected output: Title: Test Header: Hello World Paragraph: This is a test. Book title: Ruby Programming Author: Matz

Common Installation Issues and Solutions

Issue 1: Missing Development Tools

Error message: ERROR: Failed to build gem native extension. xcrun: error: invalid active developer path

Solution: Install Xcode command line tools:

xcode-select --install

Issue 2: Library Not Found Errors

Error message: ERROR: Failed to build gem native extension. libxml2 is missing

Solution: Ensure all dependencies are properly linked:

# Reinstall dependencies
brew reinstall libxml2 libxslt pkg-config

# Set environment variables
export PKG_CONFIG_PATH="$(brew --prefix libxml2)/lib/pkgconfig:$(brew --prefix libxslt)/lib/pkgconfig:$PKG_CONFIG_PATH"

Issue 3: M1 Mac Compatibility Issues

For Apple Silicon (M1/M2) Macs, you might need additional configuration:

# Set architecture-specific paths
export LDFLAGS="-L$(brew --prefix libxml2)/lib -L$(brew --prefix libxslt)/lib"
export CPPFLAGS="-I$(brew --prefix libxml2)/include -I$(brew --prefix libxslt)/include"

# Install with explicit architecture
arch -arm64 gem install nokogiri

Issue 4: Version Conflicts

If you have multiple Ruby versions or conflicting gems:

# Clean up existing installations
gem uninstall nokogiri

# Clear gem cache
gem cleanup

# Reinstall with verbose output
gem install nokogiri -V

Bundler Configuration

For consistent installation across different environments, configure Bundler to use system libraries:

# Set bundler configuration
bundle config build.nokogiri --use-system-libraries

# Or add to your .bundle/config file
echo "BUNDLE_BUILD__NOKOGIRI: --use-system-libraries" >> .bundle/config

Performance Optimization

After successful installation, you can optimize Nokogiri's performance:

# Enable libxml2's built-in memory management
Nokogiri::XML::Document.parse(xml_string) do |config|
  config.options = Nokogiri::XML::ParseOptions::NOBLANKS
end

# Use CSS selectors for better performance
doc.css('div.content p')  # Faster than XPath for simple selections

# Parse large documents efficiently
Nokogiri::XML::SAX::Parser.new(handler).parse(large_xml_file)

Integration with Web Scraping Projects

Once Nokogiri is installed, you can integrate it with other web scraping tools. For JavaScript-heavy websites that require browser automation, you might want to combine Nokogiri with tools like Puppeteer for handling dynamic content or use headless browser solutions for complex interactions.

Here's an example of combining Nokogiri with HTTP requests for basic web scraping:

require 'nokogiri'
require 'net/http'
require 'uri'

def scrape_website(url)
  uri = URI(url)
  response = Net::HTTP.get_response(uri)

  if response.code == '200'
    doc = Nokogiri::HTML(response.body)

    # Extract specific data
    title = doc.css('title').text
    headings = doc.css('h1, h2, h3').map(&:text)
    links = doc.css('a').map { |link| link['href'] }

    {
      title: title,
      headings: headings,
      links: links
    }
  else
    puts "Failed to retrieve page: #{response.code}"
    nil
  end
end

# Usage
data = scrape_website('https://example.com')
puts data[:title] if data

Best Practices

  1. Version Pinning: Always specify Nokogiri versions in your Gemfile:
   gem 'nokogiri', '~> 1.15.0'
  1. Environment Consistency: Use the same installation method across development, staging, and production environments.

  2. Documentation: Keep track of your installation configuration for team members:

   # Create installation notes
   echo "Nokogiri installed with system libraries on $(date)" >> INSTALL_NOTES.md
  1. Regular Updates: Keep Nokogiri updated for security patches:
   bundle update nokogiri
  1. Memory Management: For large-scale scraping, properly manage memory:
   # Clear document references when done
   doc = nil
   GC.start

Alternative Installation Methods

Using Docker

For consistent environments across different systems:

# Dockerfile
FROM ruby:3.2-alpine

RUN apk add --no-cache \
  build-base \
  libxml2-dev \
  libxslt-dev \
  nodejs \
  npm

COPY Gemfile* ./
RUN bundle install

COPY . .

Using System Package Managers

Alternative to Homebrew for specific use cases:

# Using MacPorts (if you prefer it over Homebrew)
sudo port install libxml2 +universal
sudo port install libxslt +universal
gem install nokogiri

Advanced Configuration

Custom Parser Options

Configure Nokogiri's parsing behavior for specific needs:

# Strict parsing
doc = Nokogiri::XML(xml_string) do |config|
  config.strict.nonet.noblanks
end

# Recover from errors
doc = Nokogiri::HTML(html_string) do |config|
  config.recover.noerror.nowarning
end

# Custom entity handling
doc = Nokogiri::XML(xml_string) do |config|
  config.noent.dtdload.dtdvalid
end

Working with Encodings

Handle different character encodings properly:

# Specify encoding explicitly
doc = Nokogiri::HTML(html_string, nil, 'UTF-8')

# Handle encoding detection
require 'charlock_holmes'
detection = CharlockHolmes::EncodingDetector.detect(content)
doc = Nokogiri::HTML(content, nil, detection[:encoding])

Troubleshooting Environment Variables

If you continue to experience issues, set these environment variables:

# Add to your shell profile (.zshrc or .bash_profile)
export NOKOGIRI_USE_SYSTEM_LIBRARIES=1
export PKG_CONFIG_PATH="/opt/homebrew/lib/pkgconfig"

# For older Intel Macs
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"

# Reload your shell
source ~/.zshrc

Conclusion

Installing Nokogiri on macOS with Homebrew is straightforward when you follow the proper steps and have the required dependencies. The key is ensuring that libxml2, libxslt, and other native libraries are properly installed and accessible to the gem compilation process.

Remember to test your installation thoroughly and keep your dependencies updated. If you encounter persistent issues, consider using containerized environments or consulting the official Nokogiri installation documentation for the most current troubleshooting information.

With Nokogiri properly installed, you'll have a powerful tool for parsing HTML and XML documents in your Ruby web scraping projects, enabling efficient data extraction and document manipulation. Whether you're building simple scrapers or complex data processing pipelines, Nokogiri provides the robust foundation you need for reliable HTML and XML parsing.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon