How to Extract Data from HTML Tables Using Mechanize
Extracting data from HTML tables is one of the most common web scraping tasks, and Mechanize provides powerful tools to handle this efficiently. Whether you're scraping financial data, product listings, or statistical information, understanding how to work with tables in Mechanize is essential for any Ruby developer involved in web scraping projects.
Understanding HTML Table Structure
Before diving into Mechanize-specific techniques, it's important to understand the basic structure of HTML tables:
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Data 1</td>
<td>Data 2</td>
<td>Data 3</td>
</tr>
<tr>
<td>Data 4</td>
<td>Data 5</td>
<td>Data 6</td>
</tr>
</tbody>
</table>
Basic Table Extraction with Mechanize
Setting Up Mechanize
First, ensure you have Mechanize installed and set up a basic agent:
require 'mechanize'
agent = Mechanize.new
agent.user_agent_alias = 'Windows Chrome'
page = agent.get('https://example.com/table-page')
Finding Tables on a Page
Mechanize uses Nokogiri under the hood, so you can use CSS selectors or XPath to locate tables:
# Find all tables on the page
tables = page.search('table')
# Find a specific table by class
table = page.at('table.data-table')
# Find a table by ID
table = page.at('#financial-data')
# Find tables containing specific text
table = page.search('table').find { |t| t.text.include?('Revenue') }
Extracting Table Data: Basic Approach
Method 1: Row-by-Row Extraction
require 'mechanize'
agent = Mechanize.new
page = agent.get('https://example.com/data-table')
table = page.at('table')
# Extract headers
headers = table.search('thead tr th').map(&:text).map(&:strip)
puts "Headers: #{headers}"
# Extract data rows
data = []
table.search('tbody tr').each do |row|
row_data = row.search('td').map(&:text).map(&:strip)
data << row_data unless row_data.empty?
end
# Display the extracted data
data.each_with_index do |row, index|
puts "Row #{index + 1}: #{row}"
end
Method 2: Column-Specific Extraction
# Extract specific columns by index
def extract_column(table, column_index)
table.search('tbody tr').map do |row|
cells = row.search('td')
cells[column_index]&.text&.strip if cells.length > column_index
end.compact
end
# Usage
table = page.at('table.price-list')
product_names = extract_column(table, 0)
prices = extract_column(table, 2)
product_names.zip(prices).each do |name, price|
puts "#{name}: #{price}"
end
Advanced Table Extraction Techniques
Handling Complex Tables with Merged Cells
def extract_complex_table(table)
rows = []
table.search('tr').each do |tr|
row = []
tr.search('td, th').each do |cell|
# Handle colspan and rowspan attributes
colspan = cell['colspan']&.to_i || 1
rowspan = cell['rowspan']&.to_i || 1
cell_data = {
text: cell.text.strip,
colspan: colspan,
rowspan: rowspan,
classes: cell['class']
}
row << cell_data
end
rows << row unless row.empty?
end
rows
end
Extracting Tables with Nested Elements
def extract_rich_table_data(table)
data = []
table.search('tbody tr').each do |row|
row_data = {}
row.search('td').each_with_index do |cell, index|
# Extract text content
row_data["column_#{index}"] = cell.text.strip
# Extract links if present
links = cell.search('a').map { |link| link['href'] }
row_data["column_#{index}_links"] = links unless links.empty?
# Extract images if present
images = cell.search('img').map { |img| img['src'] }
row_data["column_#{index}_images"] = images unless images.empty?
end
data << row_data
end
data
end
Working with Dynamic Tables
Handling Paginated Tables
def scrape_paginated_table(base_url, pages: 10)
agent = Mechanize.new
all_data = []
(1..pages).each do |page_num|
begin
url = "#{base_url}?page=#{page_num}"
page = agent.get(url)
table = page.at('table.results')
break unless table # No more pages
page_data = extract_table_data(table)
all_data.concat(page_data)
puts "Extracted #{page_data.length} rows from page #{page_num}"
sleep 1 # Be respectful to the server
rescue Mechanize::ResponseCodeError => e
puts "Error on page #{page_num}: #{e.message}"
break
end
end
all_data
end
Handling AJAX-Loaded Tables
While Mechanize cannot handle JavaScript-rendered content directly, you can often find the AJAX endpoints:
def find_ajax_table_data(agent, base_url)
# Look for AJAX endpoints in page source
page = agent.get(base_url)
# Search for API endpoints in script tags
scripts = page.search('script').map(&:text).join(' ')
# Extract potential API URLs
api_urls = scripts.scan(/['"]([^'"]*\/api\/[^'"]*table[^'"]*)['"]/).flatten
api_urls.each do |api_url|
begin
# Try to fetch JSON data directly
full_url = URI.join(base_url, api_url).to_s
response = agent.get(full_url)
if response.content_type.include?('json')
data = JSON.parse(response.body)
return process_json_table_data(data)
end
rescue => e
puts "Failed to fetch #{api_url}: #{e.message}"
end
end
end
Data Processing and Cleanup
Cleaning Extracted Data
def clean_table_data(raw_data)
raw_data.map do |row|
row.map do |cell|
# Remove extra whitespace
cleaned = cell.gsub(/\s+/, ' ').strip
# Remove currency symbols and convert to numbers where appropriate
if cleaned.match?(/^\$?[\d,]+\.?\d*$/)
cleaned.gsub(/[$,]/, '').to_f
else
cleaned
end
end
end
end
Converting to Structured Data
def table_to_hash(headers, data)
data.map do |row|
hash = {}
headers.each_with_index do |header, index|
hash[header.downcase.gsub(/\s+/, '_')] = row[index]
end
hash
end
end
# Usage
headers = ['Product Name', 'Price', 'Stock']
raw_data = [['iPhone 13', '$699', '50'], ['Samsung Galaxy', '$599', '30']]
structured_data = table_to_hash(headers, raw_data)
puts structured_data
# Output: [{"product_name"=>"iPhone 13", "price"=>699.0, "stock"=>"50"}, ...]
Error Handling and Best Practices
Robust Table Extraction
def safe_table_extraction(page, selector)
table = page.at(selector)
unless table
puts "Warning: Table not found with selector '#{selector}'"
return []
end
begin
headers = table.search('thead tr th, tr:first-child td').map(&:text).map(&:strip)
data = []
table.search('tbody tr, tr').drop(headers.any? ? 0 : 1).each do |row|
row_data = row.search('td').map(&:text).map(&:strip)
next if row_data.empty? || row_data.all?(&:empty?)
data << row_data
end
{ headers: headers, data: data }
rescue => e
puts "Error extracting table data: #{e.message}"
{ headers: [], data: [] }
end
end
Rate Limiting and Politeness
class PoliteTableScraper
def initialize(delay: 1)
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Chrome'
@delay = delay
end
def scrape_tables(urls)
results = []
urls.each_with_index do |url, index|
begin
puts "Scraping #{url} (#{index + 1}/#{urls.length})"
page = @agent.get(url)
tables_data = extract_all_tables(page)
results << { url: url, tables: tables_data }
sleep @delay unless index == urls.length - 1
rescue => e
puts "Error scraping #{url}: #{e.message}"
results << { url: url, error: e.message }
end
end
results
end
private
def extract_all_tables(page)
page.search('table').map.with_index do |table, index|
safe_table_extraction(page, "table:nth-of-type(#{index + 1})")
end
end
end
Exporting Extracted Data
Saving to CSV
require 'csv'
def export_to_csv(headers, data, filename)
CSV.open(filename, 'w') do |csv|
csv << headers
data.each { |row| csv << row }
end
puts "Data exported to #{filename}"
end
Saving to JSON
require 'json'
def export_to_json(structured_data, filename)
File.open(filename, 'w') do |file|
file.write(JSON.pretty_generate(structured_data))
end
puts "Data exported to #{filename}"
end
Integration with Other Tools
While Mechanize is excellent for server-side rendered tables, for JavaScript-heavy applications, you might need to combine it with other tools. For complex dynamic content, consider using tools that can handle browser sessions or specialized APIs for more reliable data extraction.
Complete Example: E-commerce Product Table
require 'mechanize'
require 'csv'
class ProductTableScraper
def initialize
@agent = Mechanize.new
@agent.user_agent_alias = 'Windows Chrome'
end
def scrape_products(url)
page = @agent.get(url)
table = page.at('table.product-table')
return [] unless table
products = []
table.search('tbody tr').each do |row|
cells = row.search('td')
next if cells.length < 4
product = {
name: cells[0].text.strip,
price: extract_price(cells[1].text),
stock: cells[2].text.strip.to_i,
rating: extract_rating(cells[3]),
url: extract_product_url(cells[0])
}
products << product
end
products
end
private
def extract_price(price_text)
price_text.gsub(/[^\d.]/, '').to_f
end
def extract_rating(cell)
stars = cell.search('.star.filled').length
stars.to_f
end
def extract_product_url(cell)
link = cell.at('a')
link ? link['href'] : nil
end
end
# Usage
scraper = ProductTableScraper.new
products = scraper.scrape_products('https://example-store.com/products')
# Export to CSV
CSV.open('products.csv', 'w') do |csv|
csv << ['Name', 'Price', 'Stock', 'Rating', 'URL']
products.each do |product|
csv << [product[:name], product[:price], product[:stock], product[:rating], product[:url]]
end
end
puts "Scraped #{products.length} products"
Conclusion
Extracting data from HTML tables using Mechanize is a powerful technique for web scraping in Ruby. By understanding table structure, implementing robust extraction methods, and following best practices for error handling and rate limiting, you can efficiently scrape tabular data from websites. Remember to always respect robots.txt files and implement appropriate delays to avoid overwhelming target servers.
For more complex scenarios involving dynamic content handling, consider combining Mechanize with other tools or using specialized web scraping APIs that can handle JavaScript-rendered content more effectively.