What is the best way to parse HTML in Go?

To parse HTML in Go, the best way to start is by using the html package which is a part of the larger golang.org/x/net/html module. This package provides functions for parsing HTML documents and manipulating the parse tree.

Here is an example of how to parse HTML using the golang.org/x/net/html package:

package main

import (
    "fmt"
    "golang.org/x/net/html"
    "log"
    "net/http"
    "strings"
)

func main() {
    // Example HTML data
    rawHTML := `
<!DOCTYPE html>
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <h1>Hello, World!</h1>
    <p>This is a sample paragraph.</p>
    <!-- This is a comment -->
    <a href="http://example.com">Visit Example.com</a>
</body>
</html>
`
    // Parse the HTML
    doc, err := html.Parse(strings.NewReader(rawHTML))
    if err != nil {
        log.Fatal(err)
    }

    // Function to recursively traverse the HTML node tree
    var traverse func(*html.Node)
    traverse = func(n *html.Node) {
        if n.Type == html.ElementNode {
            fmt.Println(n.Data) // Print the name of the HTML element
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            traverse(c)
        }
    }

    // Traverse the HTML document
    traverse(doc)
}

Here's how the html package works in the above code:

  1. The HTML content is parsed by html.Parse, which takes an io.Reader as an input. In this case, we're using strings.NewReader to convert a string of raw HTML into a reader.
  2. The traverse function is a simple recursive function that prints out the name of each HTML element.
  3. The recursion starts with traverse(doc), where doc is the root of the document tree.

If you're working with HTML from the internet, you might want to fetch the HTML using the net/http package and then parse it:

resp, err := http.Get("http://example.com")
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
    log.Fatalf("Error: status code %d", resp.StatusCode)
}
doc, err := html.Parse(resp.Body)
if err != nil {
    log.Fatal(err)
}

// Traverse and manipulate `doc` as needed

When working with the parsed HTML, you can use the various fields and types provided by the html package to inspect and manipulate the document. For example, html.Node has fields like Type, Data, Attr, FirstChild, and NextSibling which can be used to navigate and process the HTML tree.

Remember to handle the html.Node types appropriately to check for different node types such as ElementNode, TextNode, CommentNode, etc., as you traverse the HTML tree.

To install the golang.org/x/net/html package, you can use the following command:

go get -u golang.org/x/net/html

This will fetch the package and its dependencies, allowing you to import it into your Go project.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon