How do I parse HTML tables using Nokogiri in Ruby?
Parsing HTML tables is one of the most common web scraping tasks, and Nokogiri provides powerful tools to extract structured data from table elements. This guide covers everything from basic table parsing to handling complex table structures with headers, merged cells, and nested data.
Getting Started with Nokogiri
First, ensure you have Nokogiri installed in your Ruby environment:
gem install nokogiri
Or add it to your Gemfile:
gem 'nokogiri'
Basic Table Parsing
Let's start with a simple HTML table structure:
require 'nokogiri'
html = <<-HTML
<table>
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>30</td>
<td>New York</td>
</tr>
<tr>
<td>Jane Smith</td>
<td>25</td>
<td>Los Angeles</td>
</tr>
</tbody>
</table>
HTML
doc = Nokogiri::HTML(html)
# Extract all table rows from tbody
rows = doc.css('tbody tr')
rows.each do |row|
cells = row.css('td')
name = cells[0].text.strip
age = cells[1].text.strip
city = cells[2].text.strip
puts "Name: #{name}, Age: #{age}, City: #{city}"
end
Extracting Table Headers
When working with tables, it's often useful to extract headers first to understand the data structure:
# Extract headers
headers = doc.css('thead th').map { |th| th.text.strip }
puts "Headers: #{headers.join(', ')}"
# Create a hash for each row using headers as keys
data = []
doc.css('tbody tr').each do |row|
row_data = {}
row.css('td').each_with_index do |cell, index|
row_data[headers[index]] = cell.text.strip
end
data << row_data
end
puts data.inspect
Handling Tables Without Explicit Headers
Some tables don't have <thead>
elements. You can treat the first row as headers:
html_simple = <<-HTML
<table>
<tr>
<td>Product</td>
<td>Price</td>
<td>Stock</td>
</tr>
<tr>
<td>Laptop</td>
<td>$999</td>
<td>15</td>
</tr>
<tr>
<td>Mouse</td>
<td>$25</td>
<td>100</td>
</tr>
</table>
HTML
doc = Nokogiri::HTML(html_simple)
all_rows = doc.css('table tr')
# Use first row as headers
headers = all_rows.first.css('td').map { |td| td.text.strip }
data_rows = all_rows[1..-1] # Skip the first row
products = []
data_rows.each do |row|
product_data = {}
row.css('td').each_with_index do |cell, index|
product_data[headers[index]] = cell.text.strip
end
products << product_data
end
puts products.inspect
Advanced Table Parsing Techniques
Working with Multiple Tables
When a page contains multiple tables, you can target specific ones using CSS selectors or XPath:
# Select table by class
specific_table = doc.css('table.data-table')
# Select table by ID
specific_table = doc.css('#results-table')
# Select the second table on the page
second_table = doc.css('table')[1]
# Using XPath for more complex selection
table_with_specific_header = doc.xpath('//table[.//th[contains(text(), "Results")]]')
Handling Colspan and Rowspan
Tables with merged cells require special attention:
def parse_complex_table(table)
rows = table.css('tr')
parsed_data = []
rows.each_with_index do |row, row_index|
cells = row.css('td, th')
cell_data = []
cells.each do |cell|
content = cell.text.strip
colspan = cell['colspan'] ? cell['colspan'].to_i : 1
rowspan = cell['rowspan'] ? cell['rowspan'].to_i : 1
# Store cell data with span information
cell_data << {
content: content,
colspan: colspan,
rowspan: rowspan
}
end
parsed_data << cell_data
end
parsed_data
end
Extracting Table Data with CSS Classes
Many modern websites use CSS classes to identify different types of data:
# Extract specific columns by class
prices = doc.css('table td.price').map { |cell| cell.text.strip }
dates = doc.css('table td.date').map { |cell| cell.text.strip }
# Extract rows with specific classes
featured_rows = doc.css('table tr.featured')
Real-World Example: Scraping Stock Data
Here's a practical example that combines multiple techniques:
require 'nokogiri'
require 'open-uri'
def scrape_stock_table(html_content)
doc = Nokogiri::HTML(html_content)
# Find the stock data table
stock_table = doc.css('table#stock-data').first
return [] unless stock_table
# Extract headers
headers = stock_table.css('thead th').map { |th| th.text.strip.downcase.gsub(/\s+/, '_') }
# Extract data rows
stocks = []
stock_table.css('tbody tr').each do |row|
stock_data = {}
row.css('td').each_with_index do |cell, index|
next if index >= headers.length
# Clean and process cell content
value = cell.text.strip
# Handle different data types
case headers[index]
when 'price', 'change'
value = value.gsub(/[$,]/, '').to_f
when 'volume'
value = value.gsub(/[,]/, '').to_i
when 'symbol'
# Extract link if present
link = cell.css('a').first
value = {
symbol: value,
url: link ? link['href'] : nil
}
end
stock_data[headers[index]] = value
end
stocks << stock_data unless stock_data.empty?
end
stocks
end
Handling Dynamic Content and JavaScript Tables
When tables are rendered dynamically with JavaScript, Nokogiri alone won't be sufficient since it only parses static HTML. For such cases, you'll need to combine Ruby with browser automation tools or use specialized scraping services that can handle dynamic content that loads after page load.
# For static HTML parsing with Nokogiri
def parse_static_table(html_content)
doc = Nokogiri::HTML(html_content)
# ... parsing logic
end
# For JavaScript-rendered content, you'd need tools like:
# - Watir
# - Capybara with a JavaScript driver
# - External APIs that handle JavaScript rendering
Error Handling and Best Practices
When parsing tables, always implement proper error handling:
def safe_table_parse(html_content)
begin
doc = Nokogiri::HTML(html_content)
# Check if table exists
table = doc.css('table').first
unless table
puts "No table found in the HTML content"
return []
end
# Verify table has rows
rows = table.css('tr')
if rows.empty?
puts "Table found but contains no rows"
return []
end
# Parse table data
data = []
rows.each_with_index do |row, index|
cells = row.css('td, th')
if cells.empty?
puts "Warning: Row #{index} contains no cells"
next
end
row_data = cells.map { |cell| cell.text.strip }
data << row_data
end
data
rescue Nokogiri::XML::SyntaxError => e
puts "HTML parsing error: #{e.message}"
[]
rescue => e
puts "Unexpected error: #{e.message}"
[]
end
end
Performance Optimization
For large tables or multiple tables, consider these optimization techniques:
# Use CSS selectors instead of XPath when possible (faster)
rows = doc.css('tbody tr') # Preferred
# rows = doc.xpath('//tbody//tr') # Slower
# Pre-compile selectors for repeated use
row_selector = 'tbody tr'
cell_selector = 'td'
doc.css(row_selector).each do |row|
cells = row.css(cell_selector)
# Process cells...
end
# Use text nodes directly for better performance
def extract_text_efficiently(element)
# This is faster than element.text for large elements
element.children.select(&:text?).map(&:content).join.strip
end
Working with Nested Tables
Some websites use nested tables within table cells. Handle these carefully:
def parse_nested_tables(doc)
main_tables = doc.css('body > table') # Only top-level tables
main_tables.each do |table|
rows = table.css('> tbody > tr, > tr') # Direct children only
rows.each do |row|
cells = row.css('> td, > th') # Direct children only
cells.each do |cell|
# Check for nested tables
nested_tables = cell.css('table')
if nested_tables.any?
puts "Found nested table in cell"
# Handle nested table separately
else
puts "Cell content: #{cell.text.strip}"
end
end
end
end
end
Common Pitfalls and Solutions
- Empty cells: Always check for empty cells and handle them appropriately
- Nested tables: Be specific with your selectors to avoid selecting nested table elements
- Malformed HTML: Use Nokogiri's error recovery features to handle broken HTML
- Memory usage: For very large tables, consider processing rows in batches
# Handle empty cells
cell_value = cell.text.strip
cell_value = 'N/A' if cell_value.empty?
# Avoid nested table confusion
main_table_rows = doc.css('> table > tbody > tr') # Direct children only
# Process large tables in batches
def process_large_table(table, batch_size = 100)
rows = table.css('tr')
rows.each_slice(batch_size) do |batch|
batch.each do |row|
# Process row
end
# Optional: Clear processed data from memory
GC.start if batch_size > 50
end
end
Extracting Links and Other Attributes
Tables often contain links and other HTML attributes that provide additional context:
def extract_table_with_links(table)
data = []
table.css('tbody tr').each do |row|
row_data = {}
row.css('td').each_with_index do |cell, index|
# Extract text content
text_content = cell.text.strip
# Extract links if present
links = cell.css('a').map do |link|
{
text: link.text.strip,
href: link['href'],
title: link['title']
}
end
# Extract images if present
images = cell.css('img').map do |img|
{
src: img['src'],
alt: img['alt'],
title: img['title']
}
end
row_data["column_#{index}"] = {
text: text_content,
links: links,
images: images,
raw_html: cell.inner_html
}
end
data << row_data
end
data
end
Conclusion
Nokogiri provides a robust foundation for HTML table parsing in Ruby applications. By combining proper CSS selectors, error handling, and optimization techniques, you can efficiently extract structured data from even complex table layouts.
For more advanced scenarios involving JavaScript-heavy pages, consider integrating Nokogiri with browser automation tools or using web scraping APIs that handle authentication and session management automatically.
Remember to always validate your parsing logic against different table structures, handle edge cases gracefully, and respect website terms of service when scraping data.