How do I use GoQuery to scrape HTML tables into structured data?

GoQuery is a package for Go that provides a set of features to scrape and manipulate HTML documents similarly to jQuery. To scrape HTML tables into structured data using GoQuery, you'll need to perform the following steps:

  1. Install GoQuery: If you haven't already installed GoQuery, you can do so using Go's package manager:

    go get github.com/PuerkitoBio/goquery
    
  2. Fetch the HTML content: Use Go's standard net/http package to fetch the HTML content you want to scrape.

  3. Parse the HTML content with GoQuery: Once you've got the HTML content, you can load it into GoQuery and start querying the document.

  4. Iterate over the table rows: Find the table and iterate over its rows, extracting the columns (cells) into your structured data format.

Here's a complete example of how to use GoQuery to scrape data from an HTML table:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func scrapeTable(url string) ([]map[string]string, error) {
    // Fetch the HTML page.
    res, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    defer res.Body.Close()

    if res.StatusCode != 200 {
        return nil, fmt.Errorf("status code error: %d %s", res.StatusCode, res.Status)
    }

    // Load the HTML document into GoQuery.
    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        return nil, err
    }

    var data []map[string]string

    // Find the table and iterate over its rows.
    doc.Find("table.my-table-selector tbody tr").Each(func(i int, row *goquery.Selection) {
        // For each row, find the cells.
        var rowData map[string]string
        row.Find("td").Each(func(j int, cell *goquery.Selection) {
            // Use the cell's index or a known header to create structured data.
            // For example, if you know the headers of the table:
            switch j {
            case 0:
                rowData["Header1"] = cell.Text()
            case 1:
                rowData["Header2"] = cell.Text()
            // Add more cases as needed.
            }
        })
        data = append(data, rowData)
    })

    return data, nil
}

func main() {
    url := "http://example.com/table.html"
    tableData, err := scrapeTable(url)
    if err != nil {
        log.Fatal(err)
    }

    // Do something with the scraped table data.
    fmt.Println(tableData)
}

In this example, scrapeTable is a function that takes a URL, fetches the HTML content, parses it with GoQuery, and then iterates over the rows of the table with the selector table.my-table-selector. For each row, it creates a map of strings representing the structured data for that row. You'll need to adjust the selector table.my-table-selector to match the actual selector of the table you want to scrape.

Finally, the main function calls scrapeTable with the URL of the page containing the table and prints out the structured data. You would replace "http://example.com/table.html" with the actual URL of the table you want to scrape.

Remember to handle the structure of the table properly. If the table has headers, you may want to use them to create the keys for your structured data maps. If not, you can rely on the column index, as shown in the switch statement in the example.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon