To parse HTML in Go, the best way to start is by using the html
package which is a part of the larger golang.org/x/net/html
module. This package provides functions for parsing HTML documents and manipulating the parse tree.
Here is an example of how to parse HTML using the golang.org/x/net/html
package:
package main
import (
"fmt"
"golang.org/x/net/html"
"log"
"net/http"
"strings"
)
func main() {
// Example HTML data
rawHTML := `
<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a sample paragraph.</p>
<!-- This is a comment -->
<a href="http://example.com">Visit Example.com</a>
</body>
</html>
`
// Parse the HTML
doc, err := html.Parse(strings.NewReader(rawHTML))
if err != nil {
log.Fatal(err)
}
// Function to recursively traverse the HTML node tree
var traverse func(*html.Node)
traverse = func(n *html.Node) {
if n.Type == html.ElementNode {
fmt.Println(n.Data) // Print the name of the HTML element
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
traverse(c)
}
}
// Traverse the HTML document
traverse(doc)
}
Here's how the html
package works in the above code:
- The HTML content is parsed by
html.Parse
, which takes anio.Reader
as an input. In this case, we're usingstrings.NewReader
to convert a string of raw HTML into a reader. - The
traverse
function is a simple recursive function that prints out the name of each HTML element. - The recursion starts with
traverse(doc)
, wheredoc
is the root of the document tree.
If you're working with HTML from the internet, you might want to fetch the HTML using the net/http
package and then parse it:
resp, err := http.Get("http://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
log.Fatalf("Error: status code %d", resp.StatusCode)
}
doc, err := html.Parse(resp.Body)
if err != nil {
log.Fatal(err)
}
// Traverse and manipulate `doc` as needed
When working with the parsed HTML, you can use the various fields and types provided by the html
package to inspect and manipulate the document. For example, html.Node
has fields like Type
, Data
, Attr
, FirstChild
, and NextSibling
which can be used to navigate and process the HTML tree.
Remember to handle the html.Node
types appropriately to check for different node types such as ElementNode
, TextNode
, CommentNode
, etc., as you traverse the HTML tree.
To install the golang.org/x/net/html
package, you can use the following command:
go get -u golang.org/x/net/html
This will fetch the package and its dependencies, allowing you to import it into your Go project.