Colly is a popular Go package used for web scraping and crawling. It provides a simple and efficient way to scrape web content by making HTTP requests and parsing HTML documents. The OnRequest
function in Colly is an event hook that is triggered before a request is sent to the server. You can use this function to modify the request before it's actually made—for instance, by setting headers, cookies, or changing the request URL.
Here's how you might use the OnRequest
function:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
// Create a new collector
c := colly.NewCollector()
// Attach an OnRequest callback function to the collector
// This callback will be executed before each request is made
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
// Here you can set headers, cookies or any other request options
r.Headers.Set("User-Agent", "my-custom-user-agent")
})
// Define what to do when a page is visited
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Page Title:", e.Text)
})
// Start the scraping process
err := c.Visit("http://httpbin.org/")
if err != nil {
fmt.Println("Error during the visit:", err)
}
}
In the example above, we first create a new collector using colly.NewCollector()
. Next, we attach an OnRequest
callback function to the collector using c.OnRequest(...)
. This callback will be executed before each request is made. Inside the callback, we're simply printing out the URL that will be visited, and setting a custom User-Agent
header for the request.
Once the OnRequest
callback is set up, we define another callback for when a page content is parsed using c.OnHTML(...)
. Finally, we start the scraping process by calling c.Visit("http://httpbin.org/")
.
The OnRequest
function is a powerful feature of Colly as it allows you to perform operations like:
- Logging request information for debugging purposes.
- Setting custom headers such as
User-Agent
,Referer
,Authorization
, etc. - Adding cookies to the request.
- Changing the request method (e.g., to
POST
orPUT
). - Performing any other modifications to the request before it's sent out.
Remember that when using OnRequest
, you're modifying the request just before it goes out, so any changes you make will affect the response you receive. This can be particularly useful when dealing with sites that require certain headers or cookies to access content.