How do I Parse HTML Tables with Nokogiri?

Parsing HTML tables is one of the most common tasks in web scraping, and Nokogiri provides powerful tools to extract structured data from table elements efficiently. Whether you're dealing with simple data tables or complex nested structures, Nokogiri's CSS selectors and XPath expressions make table parsing straightforward and reliable.

What is Nokogiri?

Nokogiri is a Ruby gem that provides a simple and powerful interface for parsing HTML and XML documents. It's built on top of libxml2 and libxslt, making it fast and memory-efficient for processing large documents. Nokogiri supports both CSS selectors and XPath expressions, giving developers flexibility in how they target specific elements.

Basic Table Structure Understanding

Before diving into parsing techniques, it's important to understand HTML table structure:

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Age</th>
      <th>City</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>John Doe</td>
      <td>30</td>
      <td>New York</td>
    </tr>
    <tr>
      <td>Jane Smith</td>
      <td>25</td>
      <td>Los Angeles</td>
    </tr>
  </tbody>
</table>

Installation and Setup

First, install Nokogiri if you haven't already:

gem install nokogiri

Or add it to your Gemfile:

gem 'nokogiri'

Then require it in your Ruby script:

require 'nokogiri'
require 'open-uri'

Basic Table Parsing Methods

Method 1: Using CSS Selectors

CSS selectors provide an intuitive way to target table elements:

require 'nokogiri'
require 'open-uri'

# Load HTML content
html = <<-HTML
<table id="employees">
  <thead>
    <tr>
      <th>Name</th>
      <th>Department</th>
      <th>Salary</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alice Johnson</td>
      <td>Engineering</td>
      <td>$95,000</td>
    </tr>
    <tr>
      <td>Bob Wilson</td>
      <td>Marketing</td>
      <td>$75,000</td>
    </tr>
  </tbody>
</table>
HTML

# Parse the HTML
doc = Nokogiri::HTML(html)

# Extract table headers
headers = doc.css('table#employees thead th').map(&:text)
puts "Headers: #{headers}"

# Extract table rows
rows = []
doc.css('table#employees tbody tr').each do |row|
  cells = row.css('td').map(&:text)
  rows << cells
end

# Display results
puts "\nTable Data:"
rows.each_with_index do |row, index|
  puts "Row #{index + 1}: #{row}"
end

Method 2: Using XPath Expressions

XPath provides more powerful selection capabilities:

require 'nokogiri'

html = <<-HTML
<table class="data-table">
  <tr>
    <th>Product</th>
    <th>Price</th>
    <th>Stock</th>
  </tr>
  <tr>
    <td>Laptop</td>
    <td>$999</td>
    <td>15</td>
  </tr>
  <tr>
    <td>Mouse</td>
    <td>$25</td>
    <td>50</td>
  </tr>
</table>
HTML

doc = Nokogiri::HTML(html)

# Extract headers using XPath
headers = doc.xpath('//table[@class="data-table"]//th').map(&:text)
puts "Headers: #{headers}"

# Extract data rows using XPath
data_rows = doc.xpath('//table[@class="data-table"]//tr[position()>1]')
products = []

data_rows.each do |row|
  cells = row.xpath('.//td').map(&:text)
  products << {
    name: cells[0],
    price: cells[1],
    stock: cells[2].to_i
  }
end

products.each do |product|
  puts "#{product[:name]}: #{product[:price]} (#{product[:stock]} in stock)"
end

Advanced Table Parsing Techniques

Handling Tables with Colspan and Rowspan

Complex tables often use colspan and rowspan attributes. Here's how to handle them:

require 'nokogiri'

html = <<-HTML
<table>
  <tr>
    <th rowspan="2">Name</th>
    <th colspan="2">Contact</th>
  </tr>
  <tr>
    <th>Email</th>
    <th>Phone</th>
  </tr>
  <tr>
    <td>John Doe</td>
    <td>john@example.com</td>
    <td>555-1234</td>
  </tr>
</table>
HTML

doc = Nokogiri::HTML(html)

# Parse complex table structure
table = doc.css('table').first
rows = table.css('tr')

# Process each row considering span attributes
rows.each_with_index do |row, row_index|
  cells = row.css('td, th')
  puts "Row #{row_index + 1}:"

  cells.each_with_index do |cell, cell_index|
    colspan = cell['colspan']&.to_i || 1
    rowspan = cell['rowspan']&.to_i || 1

    puts "  Cell #{cell_index + 1}: '#{cell.text.strip}' (colspan: #{colspan}, rowspan: #{rowspan})"
  end
end

Creating a Reusable Table Parser Class

For production applications, create a reusable table parser:

class TableParser
  def initialize(html)
    @doc = Nokogiri::HTML(html)
  end

  def parse_table(selector)
    table = @doc.css(selector).first
    return nil unless table

    headers = extract_headers(table)
    rows = extract_rows(table)

    {
      headers: headers,
      data: rows,
      rows_count: rows.length,
      columns_count: headers.length
    }
  end

  private

  def extract_headers(table)
    # Try to find headers in thead first, then first tr
    headers = table.css('thead th, thead td')
    headers = table.css('tr:first-child th, tr:first-child td') if headers.empty?
    headers.map { |header| header.text.strip }
  end

  def extract_rows(table)
    # Skip header row if no thead exists
    row_selector = table.css('thead').any? ? 'tbody tr, tr' : 'tr:not(:first-child)'
    rows = table.css(row_selector)

    rows.map do |row|
      row.css('td, th').map { |cell| cell.text.strip }
    end
  end
end

# Usage example
html = URI.open('https://example.com/data-table').read
parser = TableParser.new(html)
result = parser.parse_table('table.main-data')

if result
  puts "Found table with #{result[:rows_count]} rows and #{result[:columns_count]} columns"
  puts "Headers: #{result[:headers]}"
  result[:data].each_with_index do |row, index|
    puts "Row #{index + 1}: #{row}"
  end
end

Error Handling and Edge Cases

Always implement proper error handling when parsing tables:

def safe_table_parse(html, table_selector)
  begin
    doc = Nokogiri::HTML(html)
    table = doc.css(table_selector).first

    unless table
      puts "Warning: No table found with selector '#{table_selector}'"
      return []
    end

    rows = table.css('tr')
    if rows.empty?
      puts "Warning: Table found but contains no rows"
      return []
    end

    # Parse table data
    data = []
    rows.each do |row|
      cells = row.css('td, th').map { |cell| cell.text.strip }
      data << cells unless cells.empty?
    end

    data
  rescue StandardError => e
    puts "Error parsing table: #{e.message}"
    []
  end
end

Performance Optimization Tips

1. Use Specific Selectors

# Good - specific selector
doc.css('table#data-table tbody tr td:first-child')

# Avoid - too general
doc.css('td')

2. Process Large Tables in Chunks

def process_large_table(doc, chunk_size = 100)
  rows = doc.css('table tr')

  rows.each_slice(chunk_size) do |chunk|
    process_chunk(chunk)
    # Optional: garbage collection for very large datasets
    GC.start if rows.length > 10000
  end
end

3. Cache Parsed Documents

class CachedTableParser
  def initialize
    @cache = {}
  end

  def parse(url, table_selector)
    @cache[url] ||= Nokogiri::HTML(URI.open(url))
    extract_table_data(@cache[url], table_selector)
  end
end

Real-World Example: Scraping Financial Data

Here's a practical example of scraping a financial data table:

require 'nokogiri'
require 'open-uri'
require 'csv'

class FinancialDataScraper
  def initialize(url)
    @url = url
    @doc = Nokogiri::HTML(URI.open(url))
  end

  def scrape_stock_table
    stocks = []

    # Target the specific table containing stock data
    @doc.css('table.stock-data tbody tr').each do |row|
      cells = row.css('td')
      next if cells.length < 6

      stock = {
        symbol: cells[0].text.strip,
        company: cells[1].text.strip,
        price: parse_price(cells[2].text),
        change: parse_price(cells[3].text),
        change_percent: cells[4].text.strip,
        volume: parse_volume(cells[5].text)
      }

      stocks << stock
    end

    stocks
  end

  def export_to_csv(stocks, filename)
    CSV.open(filename, 'w', write_headers: true, headers: stocks.first.keys) do |csv|
      stocks.each { |stock| csv << stock.values }
    end
  end

  private

  def parse_price(price_text)
    price_text.gsub(/[$,]/, '').to_f
  end

  def parse_volume(volume_text)
    volume_text.gsub(/[,]/, '').to_i
  end
end

# Usage
scraper = FinancialDataScraper.new('https://example-finance.com/stocks')
stocks = scraper.scrape_stock_table
scraper.export_to_csv(stocks, 'stock_data.csv')
puts "Scraped #{stocks.length} stocks successfully!"

Integration with Web Scraping APIs

While Nokogiri is excellent for parsing HTML tables, you might want to combine it with web scraping APIs for more robust data extraction, especially when dealing with JavaScript-heavy sites. When working with dynamic content that requires JavaScript execution, consider using headless browsers in conjunction with Nokogiri for optimal results.

Best Practices and Common Pitfalls

1. Always Validate Table Structure

def validate_table_structure(table, expected_columns)
  first_row = table.css('tr').first
  return false unless first_row

  actual_columns = first_row.css('td, th').length
  actual_columns == expected_columns
end

2. Handle Missing Data Gracefully

def safe_cell_text(cell, default = '')
  cell ? cell.text.strip : default
end

3. Normalize Data Types

def normalize_table_data(raw_data)
  raw_data.map do |row|
    row.map do |cell|
      # Remove extra whitespace
      normalized = cell.strip

      # Convert numeric strings
      if normalized.match?(/^\d+\.?\d*$/)
        normalized.include?('.') ? normalized.to_f : normalized.to_i
      else
        normalized
      end
    end
  end
end

Testing Your Table Parser

Always write tests for your table parsing logic:

require 'minitest/autorun'

class TableParserTest < Minitest::Test
  def setup
    @html = <<-HTML
      <table>
        <tr><th>Name</th><th>Age</th></tr>
        <tr><td>John</td><td>30</td></tr>
        <tr><td>Jane</td><td>25</td></tr>
      </table>
    HTML
  end

  def test_extracts_headers_correctly
    doc = Nokogiri::HTML(@html)
    headers = doc.css('th').map(&:text)
    assert_equal ['Name', 'Age'], headers
  end

  def test_extracts_data_rows
    doc = Nokogiri::HTML(@html)
    rows = doc.css('tr:not(:first-child)').map do |row|
      row.css('td').map(&:text)
    end

    expected = [['John', '30'], ['Jane', '25']]
    assert_equal expected, rows
  end
end

Conclusion

Parsing HTML tables with Nokogiri is a fundamental skill for web scraping in Ruby. By combining CSS selectors and XPath expressions with proper error handling and optimization techniques, you can build robust table parsers that handle various table structures efficiently. Remember to always validate your data, handle edge cases gracefully, and consider the performance implications when working with large datasets.

Whether you're extracting financial data, product catalogs, or any other tabular information, Nokogiri provides the tools you need to parse HTML tables effectively. For more complex scenarios involving authentication workflows or JavaScript-rendered content, consider combining Nokogiri with other web scraping tools for comprehensive data extraction solutions.

Table of contents