What is Colly's OnRequest function and how is it used?

Colly is a popular Go package used for web scraping and crawling. It provides a simple and efficient way to scrape web content by making HTTP requests and parsing HTML documents. The OnRequest function in Colly is an event hook that is triggered before a request is sent to the server. You can use this function to modify the request before it's actually made—for instance, by setting headers, cookies, or changing the request URL.

Here's how you might use the OnRequest function:

package main

import (
    "fmt"
    "github.com/gocolly/colly"
)

func main() {
    // Create a new collector
    c := colly.NewCollector()

    // Attach an OnRequest callback function to the collector
    // This callback will be executed before each request is made
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
        // Here you can set headers, cookies or any other request options
        r.Headers.Set("User-Agent", "my-custom-user-agent")
    })

    // Define what to do when a page is visited
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Page Title:", e.Text)
    })

    // Start the scraping process
    err := c.Visit("http://httpbin.org/")
    if err != nil {
        fmt.Println("Error during the visit:", err)
    }
}

In the example above, we first create a new collector using colly.NewCollector(). Next, we attach an OnRequest callback function to the collector using c.OnRequest(...). This callback will be executed before each request is made. Inside the callback, we're simply printing out the URL that will be visited, and setting a custom User-Agent header for the request.

Once the OnRequest callback is set up, we define another callback for when a page content is parsed using c.OnHTML(...). Finally, we start the scraping process by calling c.Visit("http://httpbin.org/").

The OnRequest function is a powerful feature of Colly as it allows you to perform operations like:

  • Logging request information for debugging purposes.
  • Setting custom headers such as User-Agent, Referer, Authorization, etc.
  • Adding cookies to the request.
  • Changing the request method (e.g., to POST or PUT).
  • Performing any other modifications to the request before it's sent out.

Remember that when using OnRequest, you're modifying the request just before it goes out, so any changes you make will affect the response you receive. This can be particularly useful when dealing with sites that require certain headers or cookies to access content.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon