Table of contents

How do I parse HTML tables using Nokogiri in Ruby?

Parsing HTML tables is one of the most common web scraping tasks, and Nokogiri provides powerful tools to extract structured data from table elements. This guide covers everything from basic table parsing to handling complex table structures with headers, merged cells, and nested data.

Getting Started with Nokogiri

First, ensure you have Nokogiri installed in your Ruby environment:

gem install nokogiri

Or add it to your Gemfile:

gem 'nokogiri'

Basic Table Parsing

Let's start with a simple HTML table structure:

require 'nokogiri'

html = <<-HTML
<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Age</th>
      <th>City</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>John Doe</td>
      <td>30</td>
      <td>New York</td>
    </tr>
    <tr>
      <td>Jane Smith</td>
      <td>25</td>
      <td>Los Angeles</td>
    </tr>
  </tbody>
</table>
HTML

doc = Nokogiri::HTML(html)

# Extract all table rows from tbody
rows = doc.css('tbody tr')

rows.each do |row|
  cells = row.css('td')
  name = cells[0].text.strip
  age = cells[1].text.strip
  city = cells[2].text.strip

  puts "Name: #{name}, Age: #{age}, City: #{city}"
end

Extracting Table Headers

When working with tables, it's often useful to extract headers first to understand the data structure:

# Extract headers
headers = doc.css('thead th').map { |th| th.text.strip }
puts "Headers: #{headers.join(', ')}"

# Create a hash for each row using headers as keys
data = []
doc.css('tbody tr').each do |row|
  row_data = {}
  row.css('td').each_with_index do |cell, index|
    row_data[headers[index]] = cell.text.strip
  end
  data << row_data
end

puts data.inspect

Handling Tables Without Explicit Headers

Some tables don't have <thead> elements. You can treat the first row as headers:

html_simple = <<-HTML
<table>
  <tr>
    <td>Product</td>
    <td>Price</td>
    <td>Stock</td>
  </tr>
  <tr>
    <td>Laptop</td>
    <td>$999</td>
    <td>15</td>
  </tr>
  <tr>
    <td>Mouse</td>
    <td>$25</td>
    <td>100</td>
  </tr>
</table>
HTML

doc = Nokogiri::HTML(html_simple)
all_rows = doc.css('table tr')

# Use first row as headers
headers = all_rows.first.css('td').map { |td| td.text.strip }
data_rows = all_rows[1..-1]  # Skip the first row

products = []
data_rows.each do |row|
  product_data = {}
  row.css('td').each_with_index do |cell, index|
    product_data[headers[index]] = cell.text.strip
  end
  products << product_data
end

puts products.inspect

Advanced Table Parsing Techniques

Working with Multiple Tables

When a page contains multiple tables, you can target specific ones using CSS selectors or XPath:

# Select table by class
specific_table = doc.css('table.data-table')

# Select table by ID
specific_table = doc.css('#results-table')

# Select the second table on the page
second_table = doc.css('table')[1]

# Using XPath for more complex selection
table_with_specific_header = doc.xpath('//table[.//th[contains(text(), "Results")]]')

Handling Colspan and Rowspan

Tables with merged cells require special attention:

def parse_complex_table(table)
  rows = table.css('tr')
  parsed_data = []

  rows.each_with_index do |row, row_index|
    cells = row.css('td, th')
    cell_data = []

    cells.each do |cell|
      content = cell.text.strip
      colspan = cell['colspan'] ? cell['colspan'].to_i : 1
      rowspan = cell['rowspan'] ? cell['rowspan'].to_i : 1

      # Store cell data with span information
      cell_data << {
        content: content,
        colspan: colspan,
        rowspan: rowspan
      }
    end

    parsed_data << cell_data
  end

  parsed_data
end

Extracting Table Data with CSS Classes

Many modern websites use CSS classes to identify different types of data:

# Extract specific columns by class
prices = doc.css('table td.price').map { |cell| cell.text.strip }
dates = doc.css('table td.date').map { |cell| cell.text.strip }

# Extract rows with specific classes
featured_rows = doc.css('table tr.featured')

Real-World Example: Scraping Stock Data

Here's a practical example that combines multiple techniques:

require 'nokogiri'
require 'open-uri'

def scrape_stock_table(html_content)
  doc = Nokogiri::HTML(html_content)

  # Find the stock data table
  stock_table = doc.css('table#stock-data').first
  return [] unless stock_table

  # Extract headers
  headers = stock_table.css('thead th').map { |th| th.text.strip.downcase.gsub(/\s+/, '_') }

  # Extract data rows
  stocks = []
  stock_table.css('tbody tr').each do |row|
    stock_data = {}

    row.css('td').each_with_index do |cell, index|
      next if index >= headers.length

      # Clean and process cell content
      value = cell.text.strip

      # Handle different data types
      case headers[index]
      when 'price', 'change'
        value = value.gsub(/[$,]/, '').to_f
      when 'volume'
        value = value.gsub(/[,]/, '').to_i
      when 'symbol'
        # Extract link if present
        link = cell.css('a').first
        value = {
          symbol: value,
          url: link ? link['href'] : nil
        }
      end

      stock_data[headers[index]] = value
    end

    stocks << stock_data unless stock_data.empty?
  end

  stocks
end

Handling Dynamic Content and JavaScript Tables

When tables are rendered dynamically with JavaScript, Nokogiri alone won't be sufficient since it only parses static HTML. For such cases, you'll need to combine Ruby with browser automation tools or use specialized scraping services that can handle dynamic content that loads after page load.

# For static HTML parsing with Nokogiri
def parse_static_table(html_content)
  doc = Nokogiri::HTML(html_content)
  # ... parsing logic
end

# For JavaScript-rendered content, you'd need tools like:
# - Watir
# - Capybara with a JavaScript driver
# - External APIs that handle JavaScript rendering

Error Handling and Best Practices

When parsing tables, always implement proper error handling:

def safe_table_parse(html_content)
  begin
    doc = Nokogiri::HTML(html_content)

    # Check if table exists
    table = doc.css('table').first
    unless table
      puts "No table found in the HTML content"
      return []
    end

    # Verify table has rows
    rows = table.css('tr')
    if rows.empty?
      puts "Table found but contains no rows"
      return []
    end

    # Parse table data
    data = []
    rows.each_with_index do |row, index|
      cells = row.css('td, th')

      if cells.empty?
        puts "Warning: Row #{index} contains no cells"
        next
      end

      row_data = cells.map { |cell| cell.text.strip }
      data << row_data
    end

    data

  rescue Nokogiri::XML::SyntaxError => e
    puts "HTML parsing error: #{e.message}"
    []
  rescue => e
    puts "Unexpected error: #{e.message}"
    []
  end
end

Performance Optimization

For large tables or multiple tables, consider these optimization techniques:

# Use CSS selectors instead of XPath when possible (faster)
rows = doc.css('tbody tr')  # Preferred
# rows = doc.xpath('//tbody//tr')  # Slower

# Pre-compile selectors for repeated use
row_selector = 'tbody tr'
cell_selector = 'td'

doc.css(row_selector).each do |row|
  cells = row.css(cell_selector)
  # Process cells...
end

# Use text nodes directly for better performance
def extract_text_efficiently(element)
  # This is faster than element.text for large elements
  element.children.select(&:text?).map(&:content).join.strip
end

Working with Nested Tables

Some websites use nested tables within table cells. Handle these carefully:

def parse_nested_tables(doc)
  main_tables = doc.css('body > table')  # Only top-level tables

  main_tables.each do |table|
    rows = table.css('> tbody > tr, > tr')  # Direct children only

    rows.each do |row|
      cells = row.css('> td, > th')  # Direct children only

      cells.each do |cell|
        # Check for nested tables
        nested_tables = cell.css('table')
        if nested_tables.any?
          puts "Found nested table in cell"
          # Handle nested table separately
        else
          puts "Cell content: #{cell.text.strip}"
        end
      end
    end
  end
end

Common Pitfalls and Solutions

  1. Empty cells: Always check for empty cells and handle them appropriately
  2. Nested tables: Be specific with your selectors to avoid selecting nested table elements
  3. Malformed HTML: Use Nokogiri's error recovery features to handle broken HTML
  4. Memory usage: For very large tables, consider processing rows in batches
# Handle empty cells
cell_value = cell.text.strip
cell_value = 'N/A' if cell_value.empty?

# Avoid nested table confusion
main_table_rows = doc.css('> table > tbody > tr')  # Direct children only

# Process large tables in batches
def process_large_table(table, batch_size = 100)
  rows = table.css('tr')

  rows.each_slice(batch_size) do |batch|
    batch.each do |row|
      # Process row
    end

    # Optional: Clear processed data from memory
    GC.start if batch_size > 50
  end
end

Extracting Links and Other Attributes

Tables often contain links and other HTML attributes that provide additional context:

def extract_table_with_links(table)
  data = []

  table.css('tbody tr').each do |row|
    row_data = {}

    row.css('td').each_with_index do |cell, index|
      # Extract text content
      text_content = cell.text.strip

      # Extract links if present
      links = cell.css('a').map do |link|
        {
          text: link.text.strip,
          href: link['href'],
          title: link['title']
        }
      end

      # Extract images if present
      images = cell.css('img').map do |img|
        {
          src: img['src'],
          alt: img['alt'],
          title: img['title']
        }
      end

      row_data["column_#{index}"] = {
        text: text_content,
        links: links,
        images: images,
        raw_html: cell.inner_html
      }
    end

    data << row_data
  end

  data
end

Conclusion

Nokogiri provides a robust foundation for HTML table parsing in Ruby applications. By combining proper CSS selectors, error handling, and optimization techniques, you can efficiently extract structured data from even complex table layouts.

For more advanced scenarios involving JavaScript-heavy pages, consider integrating Nokogiri with browser automation tools or using web scraping APIs that handle authentication and session management automatically.

Remember to always validate your parsing logic against different table structures, handle edge cases gracefully, and respect website terms of service when scraping data.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon