How to Extract Data from HTML Tables Using Mechanize

Extracting data from HTML tables is one of the most common web scraping tasks, and Mechanize provides powerful tools to handle this efficiently. Whether you're scraping financial data, product listings, or statistical information, understanding how to work with tables in Mechanize is essential for any Ruby developer involved in web scraping projects.

Understanding HTML Table Structure

Before diving into Mechanize-specific techniques, it's important to understand the basic structure of HTML tables:

<table>
  <thead>
    <tr>
      <th>Header 1</th>
      <th>Header 2</th>
      <th>Header 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Data 1</td>
      <td>Data 2</td>
      <td>Data 3</td>
    </tr>
    <tr>
      <td>Data 4</td>
      <td>Data 5</td>
      <td>Data 6</td>
    </tr>
  </tbody>
</table>

Basic Table Extraction with Mechanize

Setting Up Mechanize

First, ensure you have Mechanize installed and set up a basic agent:

require 'mechanize'

agent = Mechanize.new
agent.user_agent_alias = 'Windows Chrome'
page = agent.get('https://example.com/table-page')

Finding Tables on a Page

Mechanize uses Nokogiri under the hood, so you can use CSS selectors or XPath to locate tables:

# Find all tables on the page
tables = page.search('table')

# Find a specific table by class
table = page.at('table.data-table')

# Find a table by ID
table = page.at('#financial-data')

# Find tables containing specific text
table = page.search('table').find { |t| t.text.include?('Revenue') }

Extracting Table Data: Basic Approach

Method 1: Row-by-Row Extraction

require 'mechanize'

agent = Mechanize.new
page = agent.get('https://example.com/data-table')
table = page.at('table')

# Extract headers
headers = table.search('thead tr th').map(&:text).map(&:strip)
puts "Headers: #{headers}"

# Extract data rows
data = []
table.search('tbody tr').each do |row|
  row_data = row.search('td').map(&:text).map(&:strip)
  data << row_data unless row_data.empty?
end

# Display the extracted data
data.each_with_index do |row, index|
  puts "Row #{index + 1}: #{row}"
end

Method 2: Column-Specific Extraction

# Extract specific columns by index
def extract_column(table, column_index)
  table.search('tbody tr').map do |row|
    cells = row.search('td')
    cells[column_index]&.text&.strip if cells.length > column_index
  end.compact
end

# Usage
table = page.at('table.price-list')
product_names = extract_column(table, 0)
prices = extract_column(table, 2)

product_names.zip(prices).each do |name, price|
  puts "#{name}: #{price}"
end

Advanced Table Extraction Techniques

Handling Complex Tables with Merged Cells

def extract_complex_table(table)
  rows = []

  table.search('tr').each do |tr|
    row = []
    tr.search('td, th').each do |cell|
      # Handle colspan and rowspan attributes
      colspan = cell['colspan']&.to_i || 1
      rowspan = cell['rowspan']&.to_i || 1

      cell_data = {
        text: cell.text.strip,
        colspan: colspan,
        rowspan: rowspan,
        classes: cell['class']
      }

      row << cell_data
    end
    rows << row unless row.empty?
  end

  rows
end

Extracting Tables with Nested Elements

def extract_rich_table_data(table)
  data = []

  table.search('tbody tr').each do |row|
    row_data = {}

    row.search('td').each_with_index do |cell, index|
      # Extract text content
      row_data["column_#{index}"] = cell.text.strip

      # Extract links if present
      links = cell.search('a').map { |link| link['href'] }
      row_data["column_#{index}_links"] = links unless links.empty?

      # Extract images if present
      images = cell.search('img').map { |img| img['src'] }
      row_data["column_#{index}_images"] = images unless images.empty?
    end

    data << row_data
  end

  data
end

Working with Dynamic Tables

Handling Paginated Tables

def scrape_paginated_table(base_url, pages: 10)
  agent = Mechanize.new
  all_data = []

  (1..pages).each do |page_num|
    begin
      url = "#{base_url}?page=#{page_num}"
      page = agent.get(url)
      table = page.at('table.results')

      break unless table # No more pages

      page_data = extract_table_data(table)
      all_data.concat(page_data)

      puts "Extracted #{page_data.length} rows from page #{page_num}"
      sleep 1 # Be respectful to the server

    rescue Mechanize::ResponseCodeError => e
      puts "Error on page #{page_num}: #{e.message}"
      break
    end
  end

  all_data
end

Handling AJAX-Loaded Tables

While Mechanize cannot handle JavaScript-rendered content directly, you can often find the AJAX endpoints:

def find_ajax_table_data(agent, base_url)
  # Look for AJAX endpoints in page source
  page = agent.get(base_url)

  # Search for API endpoints in script tags
  scripts = page.search('script').map(&:text).join(' ')

  # Extract potential API URLs
  api_urls = scripts.scan(/['"]([^'"]*\/api\/[^'"]*table[^'"]*)['"]/).flatten

  api_urls.each do |api_url|
    begin
      # Try to fetch JSON data directly
      full_url = URI.join(base_url, api_url).to_s
      response = agent.get(full_url)

      if response.content_type.include?('json')
        data = JSON.parse(response.body)
        return process_json_table_data(data)
      end
    rescue => e
      puts "Failed to fetch #{api_url}: #{e.message}"
    end
  end
end

Data Processing and Cleanup

Cleaning Extracted Data

def clean_table_data(raw_data)
  raw_data.map do |row|
    row.map do |cell|
      # Remove extra whitespace
      cleaned = cell.gsub(/\s+/, ' ').strip

      # Remove currency symbols and convert to numbers where appropriate
      if cleaned.match?(/^\$?[\d,]+\.?\d*$/)
        cleaned.gsub(/[$,]/, '').to_f
      else
        cleaned
      end
    end
  end
end

Converting to Structured Data

def table_to_hash(headers, data)
  data.map do |row|
    hash = {}
    headers.each_with_index do |header, index|
      hash[header.downcase.gsub(/\s+/, '_')] = row[index]
    end
    hash
  end
end

# Usage
headers = ['Product Name', 'Price', 'Stock']
raw_data = [['iPhone 13', '$699', '50'], ['Samsung Galaxy', '$599', '30']]
structured_data = table_to_hash(headers, raw_data)

puts structured_data
# Output: [{"product_name"=>"iPhone 13", "price"=>699.0, "stock"=>"50"}, ...]

Error Handling and Best Practices

Robust Table Extraction

def safe_table_extraction(page, selector)
  table = page.at(selector)

  unless table
    puts "Warning: Table not found with selector '#{selector}'"
    return []
  end

  begin
    headers = table.search('thead tr th, tr:first-child td').map(&:text).map(&:strip)
    data = []

    table.search('tbody tr, tr').drop(headers.any? ? 0 : 1).each do |row|
      row_data = row.search('td').map(&:text).map(&:strip)
      next if row_data.empty? || row_data.all?(&:empty?)

      data << row_data
    end

    { headers: headers, data: data }

  rescue => e
    puts "Error extracting table data: #{e.message}"
    { headers: [], data: [] }
  end
end

Rate Limiting and Politeness

class PoliteTableScraper
  def initialize(delay: 1)
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Chrome'
    @delay = delay
  end

  def scrape_tables(urls)
    results = []

    urls.each_with_index do |url, index|
      begin
        puts "Scraping #{url} (#{index + 1}/#{urls.length})"

        page = @agent.get(url)
        tables_data = extract_all_tables(page)
        results << { url: url, tables: tables_data }

        sleep @delay unless index == urls.length - 1

      rescue => e
        puts "Error scraping #{url}: #{e.message}"
        results << { url: url, error: e.message }
      end
    end

    results
  end

  private

  def extract_all_tables(page)
    page.search('table').map.with_index do |table, index|
      safe_table_extraction(page, "table:nth-of-type(#{index + 1})")
    end
  end
end

Exporting Extracted Data

Saving to CSV

require 'csv'

def export_to_csv(headers, data, filename)
  CSV.open(filename, 'w') do |csv|
    csv << headers
    data.each { |row| csv << row }
  end

  puts "Data exported to #{filename}"
end

Saving to JSON

require 'json'

def export_to_json(structured_data, filename)
  File.open(filename, 'w') do |file|
    file.write(JSON.pretty_generate(structured_data))
  end

  puts "Data exported to #{filename}"
end

Integration with Other Tools

While Mechanize is excellent for server-side rendered tables, for JavaScript-heavy applications, you might need to combine it with other tools. For complex dynamic content, consider using tools that can handle browser sessions or specialized APIs for more reliable data extraction.

Complete Example: E-commerce Product Table

require 'mechanize'
require 'csv'

class ProductTableScraper
  def initialize
    @agent = Mechanize.new
    @agent.user_agent_alias = 'Windows Chrome'
  end

  def scrape_products(url)
    page = @agent.get(url)
    table = page.at('table.product-table')

    return [] unless table

    products = []

    table.search('tbody tr').each do |row|
      cells = row.search('td')
      next if cells.length < 4

      product = {
        name: cells[0].text.strip,
        price: extract_price(cells[1].text),
        stock: cells[2].text.strip.to_i,
        rating: extract_rating(cells[3]),
        url: extract_product_url(cells[0])
      }

      products << product
    end

    products
  end

  private

  def extract_price(price_text)
    price_text.gsub(/[^\d.]/, '').to_f
  end

  def extract_rating(cell)
    stars = cell.search('.star.filled').length
    stars.to_f
  end

  def extract_product_url(cell)
    link = cell.at('a')
    link ? link['href'] : nil
  end
end

# Usage
scraper = ProductTableScraper.new
products = scraper.scrape_products('https://example-store.com/products')

# Export to CSV
CSV.open('products.csv', 'w') do |csv|
  csv << ['Name', 'Price', 'Stock', 'Rating', 'URL']
  products.each do |product|
    csv << [product[:name], product[:price], product[:stock], product[:rating], product[:url]]
  end
end

puts "Scraped #{products.length} products"

Conclusion

Extracting data from HTML tables using Mechanize is a powerful technique for web scraping in Ruby. By understanding table structure, implementing robust extraction methods, and following best practices for error handling and rate limiting, you can efficiently scrape tabular data from websites. Remember to always respect robots.txt files and implement appropriate delays to avoid overwhelming target servers.

For more complex scenarios involving dynamic content handling, consider combining Mechanize with other tools or using specialized web scraping APIs that can handle JavaScript-rendered content more effectively.

Table of contents