How can I handle errors when parsing HTML with GoQuery?

When parsing HTML with GoQuery in Go, you should handle errors at several points to ensure robustness in your code. GoQuery is a library that allows you to scrape and manipulate HTML documents in a jQuery-like fashion. It's important to handle errors when loading HTML, when querying the document, and when manipulating the selections.

Here's a step-by-step guide on how to handle errors with GoQuery:

1. Import the necessary packages:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

2. Load the HTML Document:

You can load an HTML document from an HTTP request, from a string, from a file, etc. You will need to handle network errors, status code errors, and errors while creating a GoQuery document.

Here's an example of loading an HTML document from a URL using GoQuery and handling the errors:

func fetchDocument(url string) (*goquery.Document, error) {
    // Make HTTP GET request
    response, err := http.Get(url)
    if err != nil {
        return nil, fmt.Errorf("error fetching URL: %w", err)
    }
    defer response.Body.Close()

    // Check for non-200 status code
    if response.StatusCode != 200 {
        return nil, fmt.Errorf("status code error: %d %s", response.StatusCode, response.Status)
    }

    // Load the HTML document
    doc, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        return nil, fmt.Errorf("error loading document: %w", err)
    }

    return doc, nil
}

3. Querying the Document:

When you query the document using selectors, there are typically no errors returned. However, you should still check for empty or non-existent selections.

Here's an example of querying the document and handling an empty selection:

func parseDocument(doc *goquery.Document) {
    // Find the review items
    reviewSelection := doc.Find(".review-item")
    if reviewSelection.Length() == 0 {
        log.Println("No review items found")
        return
    }

    // Iterate over each review
    reviewSelection.Each(func(index int, item *goquery.Selection) {
        // Extract the review text
        reviewText := item.Find(".review-text").Text()
        if reviewText == "" {
            log.Printf("Review text not found for item %d\n", index)
            return
        }
        fmt.Println("Review:", reviewText)
    })
}

4. Error Handling in Main Function:

In your main function, you will call the functions defined above and handle any returned errors.

func main() {
    url := "http://example.com"
    doc, err := fetchDocument(url)
    if err != nil {
        log.Fatalf("Failed to fetch document: %v", err)
    }

    parseDocument(doc)
}

Summary:

In this structure, errors are handled gracefully at each step. You first make an HTTP request and handle network-related errors. Next, you check the response status code and handle non-200 responses. Then, you load the HTML into a GoQuery document and handle any errors that occur during parsing. Finally, when querying the document, you check for empty selections, which allows you to handle the absence of expected elements without crashing your program.

Remember to always consider all possible failure points and handle errors in a way that keeps your application robust and provides meaningful feedback for debugging.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon