Colly is a popular scraping framework for Go, designed to provide an easy interface for writing scrapers. When using Colly, understanding and effectively using callback functions is key to controlling the scraping process and extracting the data you need.
Colly provides a variety of callbacks that you can set to handle different events that occur during the scraping process. Here are some of the most commonly used callback functions:
OnHTML
: Triggered when a specified HTML element is found.OnRequest
: Triggered before a request is made.OnResponse
: Triggered after a request has been made and a response is received.OnError
: Triggered when an error occurs during the request.OnScraped
: Triggered after OnHTML callbacks are executed for a response.
Using Callback Functions
Here's how to use these callback functions effectively:
OnHTML
OnHTML
is used to extract data from HTML elements that match a given selector. You can use Go's *colly.HTMLElement
object to query and manipulate HTML elements.
c := colly.NewCollector()
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
e.Request.Visit(link)
})
c.Visit("http://example.com")
OnRequest
OnRequest
allows you to modify requests before they are sent. For instance, you can set headers or cookies, or even change the request URL.
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "my-custom-user-agent")
})
c.Visit("http://example.com")
OnResponse
OnResponse
is useful for handling raw responses. You can use it to save binary data like images or to perform operations on the raw response body.
c := colly.NewCollector()
c.OnResponse(func(r *colly.Response) {
fmt.Println("Visited", r.Request.URL)
if r.Headers.Get("Content-Type") == "image/jpeg" {
// Save the image
err := r.Save(fmt.Sprintf("%s.jpg", r.FileName()))
if err != nil {
log.Fatal(err)
}
}
})
c.Visit("http://example.com")
OnError
OnError
is used to handle errors that occur during the scraping process. It is particularly useful for logging and debugging.
c := colly.NewCollector()
c.OnError(func(r *colly.Response, err error) {
fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})
c.Visit("http://example.com")
OnScraped
OnScraped
is called after all OnHTML
callbacks have been executed. This is a good place to do post-processing or cleanup.
c := colly.NewCollector()
c.OnScraped(func(r *colly.Response) {
fmt.Println("Finished scraping", r.Request.URL)
})
c.Visit("http://example.com")
Best Practices
- Reusability: Write modular callback functions that can be reused across different collectors or different runs.
- Error Handling: Always handle possible errors in your callbacks to prevent your scraper from crashing unexpectedly.
- Rate Limiting: Use Colly's built-in rate limiting features to avoid overwhelming the server you're scraping from.
- Concurrency: Utilize Colly's concurrency features to scrape more efficiently, but do so responsibly to avoid getting blocked.
- Selectors: When using
OnHTML
, make sure to use efficient and precise selectors to avoid unnecessary processing.
By effectively using Colly's callbacks, you can create powerful and efficient web scrapers that are tailored to your specific scraping tasks.