How do I Parse HTML Tables with Nokogiri?
Parsing HTML tables is one of the most common tasks in web scraping, and Nokogiri provides powerful tools to extract structured data from table elements efficiently. Whether you're dealing with simple data tables or complex nested structures, Nokogiri's CSS selectors and XPath expressions make table parsing straightforward and reliable.
What is Nokogiri?
Nokogiri is a Ruby gem that provides a simple and powerful interface for parsing HTML and XML documents. It's built on top of libxml2 and libxslt, making it fast and memory-efficient for processing large documents. Nokogiri supports both CSS selectors and XPath expressions, giving developers flexibility in how they target specific elements.
Basic Table Structure Understanding
Before diving into parsing techniques, it's important to understand HTML table structure:
<table>
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>30</td>
<td>New York</td>
</tr>
<tr>
<td>Jane Smith</td>
<td>25</td>
<td>Los Angeles</td>
</tr>
</tbody>
</table>
Installation and Setup
First, install Nokogiri if you haven't already:
gem install nokogiri
Or add it to your Gemfile:
gem 'nokogiri'
Then require it in your Ruby script:
require 'nokogiri'
require 'open-uri'
Basic Table Parsing Methods
Method 1: Using CSS Selectors
CSS selectors provide an intuitive way to target table elements:
require 'nokogiri'
require 'open-uri'
# Load HTML content
html = <<-HTML
<table id="employees">
<thead>
<tr>
<th>Name</th>
<th>Department</th>
<th>Salary</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice Johnson</td>
<td>Engineering</td>
<td>$95,000</td>
</tr>
<tr>
<td>Bob Wilson</td>
<td>Marketing</td>
<td>$75,000</td>
</tr>
</tbody>
</table>
HTML
# Parse the HTML
doc = Nokogiri::HTML(html)
# Extract table headers
headers = doc.css('table#employees thead th').map(&:text)
puts "Headers: #{headers}"
# Extract table rows
rows = []
doc.css('table#employees tbody tr').each do |row|
cells = row.css('td').map(&:text)
rows << cells
end
# Display results
puts "\nTable Data:"
rows.each_with_index do |row, index|
puts "Row #{index + 1}: #{row}"
end
Method 2: Using XPath Expressions
XPath provides more powerful selection capabilities:
require 'nokogiri'
html = <<-HTML
<table class="data-table">
<tr>
<th>Product</th>
<th>Price</th>
<th>Stock</th>
</tr>
<tr>
<td>Laptop</td>
<td>$999</td>
<td>15</td>
</tr>
<tr>
<td>Mouse</td>
<td>$25</td>
<td>50</td>
</tr>
</table>
HTML
doc = Nokogiri::HTML(html)
# Extract headers using XPath
headers = doc.xpath('//table[@class="data-table"]//th').map(&:text)
puts "Headers: #{headers}"
# Extract data rows using XPath
data_rows = doc.xpath('//table[@class="data-table"]//tr[position()>1]')
products = []
data_rows.each do |row|
cells = row.xpath('.//td').map(&:text)
products << {
name: cells[0],
price: cells[1],
stock: cells[2].to_i
}
end
products.each do |product|
puts "#{product[:name]}: #{product[:price]} (#{product[:stock]} in stock)"
end
Advanced Table Parsing Techniques
Handling Tables with Colspan and Rowspan
Complex tables often use colspan and rowspan attributes. Here's how to handle them:
require 'nokogiri'
html = <<-HTML
<table>
<tr>
<th rowspan="2">Name</th>
<th colspan="2">Contact</th>
</tr>
<tr>
<th>Email</th>
<th>Phone</th>
</tr>
<tr>
<td>John Doe</td>
<td>john@example.com</td>
<td>555-1234</td>
</tr>
</table>
HTML
doc = Nokogiri::HTML(html)
# Parse complex table structure
table = doc.css('table').first
rows = table.css('tr')
# Process each row considering span attributes
rows.each_with_index do |row, row_index|
cells = row.css('td, th')
puts "Row #{row_index + 1}:"
cells.each_with_index do |cell, cell_index|
colspan = cell['colspan']&.to_i || 1
rowspan = cell['rowspan']&.to_i || 1
puts " Cell #{cell_index + 1}: '#{cell.text.strip}' (colspan: #{colspan}, rowspan: #{rowspan})"
end
end
Creating a Reusable Table Parser Class
For production applications, create a reusable table parser:
class TableParser
def initialize(html)
@doc = Nokogiri::HTML(html)
end
def parse_table(selector)
table = @doc.css(selector).first
return nil unless table
headers = extract_headers(table)
rows = extract_rows(table)
{
headers: headers,
data: rows,
rows_count: rows.length,
columns_count: headers.length
}
end
private
def extract_headers(table)
# Try to find headers in thead first, then first tr
headers = table.css('thead th, thead td')
headers = table.css('tr:first-child th, tr:first-child td') if headers.empty?
headers.map { |header| header.text.strip }
end
def extract_rows(table)
# Skip header row if no thead exists
row_selector = table.css('thead').any? ? 'tbody tr, tr' : 'tr:not(:first-child)'
rows = table.css(row_selector)
rows.map do |row|
row.css('td, th').map { |cell| cell.text.strip }
end
end
end
# Usage example
html = URI.open('https://example.com/data-table').read
parser = TableParser.new(html)
result = parser.parse_table('table.main-data')
if result
puts "Found table with #{result[:rows_count]} rows and #{result[:columns_count]} columns"
puts "Headers: #{result[:headers]}"
result[:data].each_with_index do |row, index|
puts "Row #{index + 1}: #{row}"
end
end
Error Handling and Edge Cases
Always implement proper error handling when parsing tables:
def safe_table_parse(html, table_selector)
begin
doc = Nokogiri::HTML(html)
table = doc.css(table_selector).first
unless table
puts "Warning: No table found with selector '#{table_selector}'"
return []
end
rows = table.css('tr')
if rows.empty?
puts "Warning: Table found but contains no rows"
return []
end
# Parse table data
data = []
rows.each do |row|
cells = row.css('td, th').map { |cell| cell.text.strip }
data << cells unless cells.empty?
end
data
rescue StandardError => e
puts "Error parsing table: #{e.message}"
[]
end
end
Performance Optimization Tips
1. Use Specific Selectors
# Good - specific selector
doc.css('table#data-table tbody tr td:first-child')
# Avoid - too general
doc.css('td')
2. Process Large Tables in Chunks
def process_large_table(doc, chunk_size = 100)
rows = doc.css('table tr')
rows.each_slice(chunk_size) do |chunk|
process_chunk(chunk)
# Optional: garbage collection for very large datasets
GC.start if rows.length > 10000
end
end
3. Cache Parsed Documents
class CachedTableParser
def initialize
@cache = {}
end
def parse(url, table_selector)
@cache[url] ||= Nokogiri::HTML(URI.open(url))
extract_table_data(@cache[url], table_selector)
end
end
Real-World Example: Scraping Financial Data
Here's a practical example of scraping a financial data table:
require 'nokogiri'
require 'open-uri'
require 'csv'
class FinancialDataScraper
def initialize(url)
@url = url
@doc = Nokogiri::HTML(URI.open(url))
end
def scrape_stock_table
stocks = []
# Target the specific table containing stock data
@doc.css('table.stock-data tbody tr').each do |row|
cells = row.css('td')
next if cells.length < 6
stock = {
symbol: cells[0].text.strip,
company: cells[1].text.strip,
price: parse_price(cells[2].text),
change: parse_price(cells[3].text),
change_percent: cells[4].text.strip,
volume: parse_volume(cells[5].text)
}
stocks << stock
end
stocks
end
def export_to_csv(stocks, filename)
CSV.open(filename, 'w', write_headers: true, headers: stocks.first.keys) do |csv|
stocks.each { |stock| csv << stock.values }
end
end
private
def parse_price(price_text)
price_text.gsub(/[$,]/, '').to_f
end
def parse_volume(volume_text)
volume_text.gsub(/[,]/, '').to_i
end
end
# Usage
scraper = FinancialDataScraper.new('https://example-finance.com/stocks')
stocks = scraper.scrape_stock_table
scraper.export_to_csv(stocks, 'stock_data.csv')
puts "Scraped #{stocks.length} stocks successfully!"
Integration with Web Scraping APIs
While Nokogiri is excellent for parsing HTML tables, you might want to combine it with web scraping APIs for more robust data extraction, especially when dealing with JavaScript-heavy sites. When working with dynamic content that requires JavaScript execution, consider using headless browsers in conjunction with Nokogiri for optimal results.
Best Practices and Common Pitfalls
1. Always Validate Table Structure
def validate_table_structure(table, expected_columns)
first_row = table.css('tr').first
return false unless first_row
actual_columns = first_row.css('td, th').length
actual_columns == expected_columns
end
2. Handle Missing Data Gracefully
def safe_cell_text(cell, default = '')
cell ? cell.text.strip : default
end
3. Normalize Data Types
def normalize_table_data(raw_data)
raw_data.map do |row|
row.map do |cell|
# Remove extra whitespace
normalized = cell.strip
# Convert numeric strings
if normalized.match?(/^\d+\.?\d*$/)
normalized.include?('.') ? normalized.to_f : normalized.to_i
else
normalized
end
end
end
end
Testing Your Table Parser
Always write tests for your table parsing logic:
require 'minitest/autorun'
class TableParserTest < Minitest::Test
def setup
@html = <<-HTML
<table>
<tr><th>Name</th><th>Age</th></tr>
<tr><td>John</td><td>30</td></tr>
<tr><td>Jane</td><td>25</td></tr>
</table>
HTML
end
def test_extracts_headers_correctly
doc = Nokogiri::HTML(@html)
headers = doc.css('th').map(&:text)
assert_equal ['Name', 'Age'], headers
end
def test_extracts_data_rows
doc = Nokogiri::HTML(@html)
rows = doc.css('tr:not(:first-child)').map do |row|
row.css('td').map(&:text)
end
expected = [['John', '30'], ['Jane', '25']]
assert_equal expected, rows
end
end
Conclusion
Parsing HTML tables with Nokogiri is a fundamental skill for web scraping in Ruby. By combining CSS selectors and XPath expressions with proper error handling and optimization techniques, you can build robust table parsers that handle various table structures efficiently. Remember to always validate your data, handle edge cases gracefully, and consider the performance implications when working with large datasets.
Whether you're extracting financial data, product catalogs, or any other tabular information, Nokogiri provides the tools you need to parse HTML tables effectively. For more complex scenarios involving authentication workflows or JavaScript-rendered content, consider combining Nokogiri with other web scraping tools for comprehensive data extraction solutions.