In Go, you can use selectors to extract information from HTML documents by utilizing libraries like goquery
, which is inspired by jQuery and provides a set of functions to navigate and manipulate HTML documents.
Below is a step-by-step guide on how to use selectors with goquery
to extract information:
Step 1: Install goquery
First, you need to install the goquery
package. You can do this by running the following command in your terminal:
go get github.com/PuerkitoBio/goquery
Step 2: Import goquery
in Your Go Code
In your Go code, import the goquery
package like this:
import (
"github.com/PuerkitoBio/goquery"
"log"
"net/http"
)
Step 3: Fetch the HTML Document
Before you can use selectors, you need to fetch the HTML document from the web or read it from a local file. Here's how to get it from a web page using the http
package:
resp, err := http.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Fatalf("Error fetching: %s", resp.Status)
}
Step 4: Load the Document with goquery
Once you have the HTML content, you can load it into a goquery
document:
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
Step 5: Use Selectors to Extract Information
Now you can use CSS selectors to extract information from the document. goquery
allows you to select elements, manipulate them, and extract their content or attributes:
// Find all links and print their HREFs
doc.Find("a").Each(func(index int, item *goquery.Selection) {
href, exists := item.Attr("href")
if exists {
log.Println(href)
}
})
// Extract the text content of a specific element
title := doc.Find("title").Text()
log.Println("Page title is:", title)
// Extract information from a table
doc.Find("table tr").Each(func(index int, row *goquery.Selection) {
row.Find("td").Each(func(indexTd int, cell *goquery.Selection) {
log.Println(cell.Text())
})
})
In the examples above, Find
is used to select HTML elements, Each
to iterate over them, Attr
to get an attribute value, and Text
to get the text content.
Remember to handle errors appropriately in your production code, and respect the terms of service or robots.txt
of the websites you scrape.
Step 6: Compile and Run Your Go Program
Save your Go code in a .go
file, and then compile and run it with:
go run yourfile.go
Note: Web scraping can be subject to legal and ethical considerations. Always ensure that you have the right to scrape the website and that you comply with its terms of service. Use proper rate limiting and user-agent headers to avoid causing harm to the website's service.