GoQuery is a package for Go that provides a set of features to scrape and manipulate HTML documents similarly to jQuery. To scrape HTML tables into structured data using GoQuery, you'll need to perform the following steps:
Install GoQuery: If you haven't already installed GoQuery, you can do so using Go's package manager:
go get github.com/PuerkitoBio/goquery
Fetch the HTML content: Use Go's standard
net/http
package to fetch the HTML content you want to scrape.Parse the HTML content with GoQuery: Once you've got the HTML content, you can load it into GoQuery and start querying the document.
Iterate over the table rows: Find the table and iterate over its rows, extracting the columns (cells) into your structured data format.
Here's a complete example of how to use GoQuery to scrape data from an HTML table:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func scrapeTable(url string) ([]map[string]string, error) {
// Fetch the HTML page.
res, err := http.Get(url)
if err != nil {
return nil, err
}
defer res.Body.Close()
if res.StatusCode != 200 {
return nil, fmt.Errorf("status code error: %d %s", res.StatusCode, res.Status)
}
// Load the HTML document into GoQuery.
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
return nil, err
}
var data []map[string]string
// Find the table and iterate over its rows.
doc.Find("table.my-table-selector tbody tr").Each(func(i int, row *goquery.Selection) {
// For each row, find the cells.
var rowData map[string]string
row.Find("td").Each(func(j int, cell *goquery.Selection) {
// Use the cell's index or a known header to create structured data.
// For example, if you know the headers of the table:
switch j {
case 0:
rowData["Header1"] = cell.Text()
case 1:
rowData["Header2"] = cell.Text()
// Add more cases as needed.
}
})
data = append(data, rowData)
})
return data, nil
}
func main() {
url := "http://example.com/table.html"
tableData, err := scrapeTable(url)
if err != nil {
log.Fatal(err)
}
// Do something with the scraped table data.
fmt.Println(tableData)
}
In this example, scrapeTable
is a function that takes a URL, fetches the HTML content, parses it with GoQuery, and then iterates over the rows of the table with the selector table.my-table-selector
. For each row, it creates a map of strings representing the structured data for that row. You'll need to adjust the selector table.my-table-selector
to match the actual selector of the table you want to scrape.
Finally, the main
function calls scrapeTable
with the URL of the page containing the table and prints out the structured data. You would replace "http://example.com/table.html"
with the actual URL of the table you want to scrape.
Remember to handle the structure of the table properly. If the table has headers, you may want to use them to create the keys for your structured data maps. If not, you can rely on the column index, as shown in the switch statement in the example.