Can Nokogiri extract data from tables in a structured format?

Yes, Nokogiri, a Ruby library for parsing HTML and XML, can be used to extract data from tables in a structured format. Nokogiri provides methods to navigate and search the document tree of a parsed HTML/XML document, allowing you to locate specific elements, such as table rows and cells, and extract their contents.

Here's a basic example of how to use Nokogiri to extract data from an HTML table:

require 'nokogiri'
require 'open-uri'

# Load the HTML content from a file, URL, or a string
html_content = <<-HTML
<table>
  <tr>
    <th>Name</th>
    <th>Age</th>
    <th>City</th>
  </tr>
  <tr>
    <td>Alice</td>
    <td>30</td>
    <td>New York</td>
  </tr>
  <tr>
    <td>Bob</td>
    <td>32</td>
    <td>Los Angeles</td>
  </tr>
</table>
HTML

# Parse the HTML content
doc = Nokogiri::HTML(html_content)

# Initialize an empty array to store the extracted data
table_data = []

# Locate the table in the document
table = doc.at('table')

# Iterate over each row in the table
table.search('tr').each_with_index do |row, index|
  # Skip the table header row
  next if index == 0

  # Extract the text from each cell in the row and add it to the array
  row_data = row.search('td').map { |td| td.text.strip }
  table_data << row_data
end

# Output the extracted data
puts table_data.inspect

This example uses a heredoc to simulate the HTML content you might obtain from a file or a website. In a real-world scenario, you would replace the html_content variable with the actual HTML you want to parse.

The script first parses the HTML content into a Nokogiri document. It then locates the table and iterates over each row, skipping the header row. For each row, it extracts the text from the table cells (td elements) and adds an array of this text to the table_data array. Finally, it outputs the extracted data.

Please note that in this example, we are assuming that the table doesn't have any complex structure, such as nested tables or rowspans/colspans. If the table structure is more complex, you might need to add additional logic to correctly parse and extract the data.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon