Colly is a popular scraping framework for Golang that is used for building web scrapers. Handling redirects in Colly is a straightforward process since Colly provides built-in support for managing redirects.
By default, Colly follows up to 10 redirects before stopping. However, you can customize this behavior according to your scraping needs. Here's how you can handle redirects in Colly:
1. Set the Maximum Number of Redirects
You can set the maximum number of redirects that Colly will follow by setting the MaxRedirects
attribute of the Collector
:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
// Set the maximum number of redirects to follow
colly.MaxRedirects(5),
)
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
c.OnError(func(r *colly.Response, err error) {
fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})
// Start scraping
c.Visit("http://httpbin.org/redirect/1")
}
2. Disable Redirect Handling
If you do not want Colly to follow redirects at all, you can disable redirect handling by setting the CheckRedirect
function on the HTTP Client used by Colly's Collector:
package main
import (
"fmt"
"github.com/gocolly/colly"
"net/http"
)
func main() {
c := colly.NewCollector()
// Set the CheckRedirect function to the http.Client
c.WithTransport(&http.Transport{
Proxy: http.ProxyFromEnvironment,
})
c.CheckRedirect = func(req *http.Request, via []*http.Request) error {
// Returning an error prevents the redirect
return http.ErrUseLastResponse
}
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
})
// Start scraping
c.Visit("http://httpbin.org/redirect/1")
}
In the code above, the CheckRedirect
function is set to always return http.ErrUseLastResponse
, which effectively stops the redirect handling. This means Colly will make the initial request and then stop, regardless of whether the server sends a redirect response.
3. Handling Redirects Manually
You can also handle redirects manually by checking the response status code. If it's in the 3xx range, you can manually issue a request to the Location
header:
package main
import (
"fmt"
"github.com/gocolly/colly"
"net/http"
)
func main() {
c := colly.NewCollector()
c.OnResponse(func(r *colly.Response) {
if r.StatusCode >= 300 && r.StatusCode < 400 {
// Get the redirect location from the headers
location := r.Headers.Get("Location")
fmt.Printf("Redirecting to: %s\n", location)
// Visit the location
err := c.Visit(r.Request.AbsoluteURL(location))
if err != nil {
fmt.Printf("Redirect failed: %s\n", err.Error())
}
} else {
fmt.Println("Visited", r.Request.URL)
}
})
// Start scraping
c.Visit("http://httpbin.org/redirect/1")
}
Remember that when you manually handle redirects, you are responsible for preventing infinite redirect loops. You should keep track of the URLs visited and the number of redirects followed to avoid such issues.
By adjusting Colly's redirect handling settings, you can ensure that your web scraper behaves exactly as needed when encountering redirects during the scraping process.