GoQuery is a package for the Go programming language that allows you to parse and traverse HTML documents in a manner similar to jQuery. It is primarily used for extracting data from HTML, which makes it a popular tool for web scraping tasks. However, GoQuery itself does not handle the HTTP requests part; it only deals with the parsing and manipulation of HTML. For handling HTTP requests, including redirects, you would typically use Go's standard net/http
package.
When using the net/http
package, by default, the HTTP client follows up to 10 redirects before stopping with an error. If you need to handle redirects differently, you can customize the CheckRedirect
function in an http.Client
.
Here's an example of how to handle redirects when scraping with GoQuery:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
// Create a custom HTTP client with a CheckRedirect function
client := &http.Client{
CheckRedirect: func(req *http.Request, via []*http.Request) error {
// This function gets called before a redirect is followed.
fmt.Printf("Redirecting from %s to %s\n", via[len(via)-1].URL, req.URL)
// Return nil to allow the redirect, or an error to stop it.
return nil
},
}
// Use the custom client to perform an HTTP GET request
resp, err := client.Get("http://example.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
// Check the status code to ensure we got a proper response
if resp.StatusCode != http.StatusOK {
log.Fatalf("Status error: %v", resp.StatusCode)
}
// Load the HTML document from the response body
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
// Use GoQuery to find elements and extract data as needed
doc.Find("a").Each(func(index int, item *goquery.Selection) {
href, exists := item.Attr("href")
if exists {
fmt.Printf("Link #%d: %s\n", index, href)
}
})
}
In this example:
- We're creating an
http.Client
with a customCheckRedirect
function that prints the URLs involved in the redirect and allows the redirect by returningnil
. - We then use this client to send a GET request to the specified URL.
- After making sure we received a successful status code, we parse the HTML body with
goquery.NewDocumentFromReader
. - Finally, we traverse through the document with GoQuery's jQuery-like methods.
If you need to handle redirects in a more specific way, such as logging them, counting them, or stopping after a certain number of redirects, you can customize the CheckRedirect
function accordingly. For instance, you could keep track of the number of redirects and return an error if it exceeds a certain threshold. Here's a simple modification to the above example:
CheckRedirect: func(req *http.Request, via []*http.Request) error {
if len(via) >= 10 {
return http.ErrUseLastResponse
}
fmt.Printf("Redirecting from %s to %s\n", via[len(via)-1].URL, req.URL)
return nil
},
In this modified function, we return http.ErrUseLastResponse
if the number of redirects exceeds 10, which will cause the client to stop following redirects and use the last response received.