How do I handle redirects in Colly?

Colly is a popular scraping framework for Golang that is used for building web scrapers. Handling redirects in Colly is a straightforward process since Colly provides built-in support for managing redirects.

By default, Colly follows up to 10 redirects before stopping. However, you can customize this behavior according to your scraping needs. Here's how you can handle redirects in Colly:

1. Set the Maximum Number of Redirects

You can set the maximum number of redirects that Colly will follow by setting the MaxRedirects attribute of the Collector:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        // Set the maximum number of redirects to follow
        colly.MaxRedirects(5),
    )

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL)
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", r.Request.URL)
    })

    c.OnError(func(r *colly.Response, err error) {
        fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    // Start scraping
    c.Visit("http://httpbin.org/redirect/1")
}

2. Disable Redirect Handling

If you do not want Colly to follow redirects at all, you can disable redirect handling by setting the CheckRedirect function on the HTTP Client used by Colly's Collector:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
    "net/http"
)

func main() {
    c := colly.NewCollector()

    // Set the CheckRedirect function to the http.Client
    c.WithTransport(&http.Transport{
        Proxy: http.ProxyFromEnvironment,
    })

    c.CheckRedirect = func(req *http.Request, via []*http.Request) error {
        // Returning an error prevents the redirect
        return http.ErrUseLastResponse
    }

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL)
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", r.Request.URL)
    })

    // Start scraping
    c.Visit("http://httpbin.org/redirect/1")
}

In the code above, the CheckRedirect function is set to always return http.ErrUseLastResponse, which effectively stops the redirect handling. This means Colly will make the initial request and then stop, regardless of whether the server sends a redirect response.

3. Handling Redirects Manually

You can also handle redirects manually by checking the response status code. If it's in the 3xx range, you can manually issue a request to the Location header:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
    "net/http"
)

func main() {
    c := colly.NewCollector()

    c.OnResponse(func(r *colly.Response) {
        if r.StatusCode >= 300 && r.StatusCode < 400 {
            // Get the redirect location from the headers
            location := r.Headers.Get("Location")
            fmt.Printf("Redirecting to: %s\n", location)
            // Visit the location
            err := c.Visit(r.Request.AbsoluteURL(location))
            if err != nil {
                fmt.Printf("Redirect failed: %s\n", err.Error())
            }
        } else {
            fmt.Println("Visited", r.Request.URL)
        }
    })

    // Start scraping
    c.Visit("http://httpbin.org/redirect/1")
}

Remember that when you manually handle redirects, you are responsible for preventing infinite redirect loops. You should keep track of the URLs visited and the number of redirects followed to avoid such issues.

By adjusting Colly's redirect handling settings, you can ensure that your web scraper behaves exactly as needed when encountering redirects during the scraping process.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon