How do I use Colly's callback functions effectively?

Colly is a popular scraping framework for Go, designed to provide an easy interface for writing scrapers. When using Colly, understanding and effectively using callback functions is key to controlling the scraping process and extracting the data you need.

Colly provides a variety of callbacks that you can set to handle different events that occur during the scraping process. Here are some of the most commonly used callback functions:

  1. OnHTML: Triggered when a specified HTML element is found.
  2. OnRequest: Triggered before a request is made.
  3. OnResponse: Triggered after a request has been made and a response is received.
  4. OnError: Triggered when an error occurs during the request.
  5. OnScraped: Triggered after OnHTML callbacks are executed for a response.

Using Callback Functions

Here's how to use these callback functions effectively:

OnHTML

OnHTML is used to extract data from HTML elements that match a given selector. You can use Go's *colly.HTMLElement object to query and manipulate HTML elements.

c := colly.NewCollector()

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    e.Request.Visit(link)
})

c.Visit("http://example.com")

OnRequest

OnRequest allows you to modify requests before they are sent. For instance, you can set headers or cookies, or even change the request URL.

c := colly.NewCollector()

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", "my-custom-user-agent")
})

c.Visit("http://example.com")

OnResponse

OnResponse is useful for handling raw responses. You can use it to save binary data like images or to perform operations on the raw response body.

c := colly.NewCollector()

c.OnResponse(func(r *colly.Response) {
    fmt.Println("Visited", r.Request.URL)
    if r.Headers.Get("Content-Type") == "image/jpeg" {
        // Save the image
        err := r.Save(fmt.Sprintf("%s.jpg", r.FileName()))
        if err != nil {
            log.Fatal(err)
        }
    }
})

c.Visit("http://example.com")

OnError

OnError is used to handle errors that occur during the scraping process. It is particularly useful for logging and debugging.

c := colly.NewCollector()

c.OnError(func(r *colly.Response, err error) {
    fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})

c.Visit("http://example.com")

OnScraped

OnScraped is called after all OnHTML callbacks have been executed. This is a good place to do post-processing or cleanup.

c := colly.NewCollector()

c.OnScraped(func(r *colly.Response) {
    fmt.Println("Finished scraping", r.Request.URL)
})

c.Visit("http://example.com")

Best Practices

  • Reusability: Write modular callback functions that can be reused across different collectors or different runs.
  • Error Handling: Always handle possible errors in your callbacks to prevent your scraper from crashing unexpectedly.
  • Rate Limiting: Use Colly's built-in rate limiting features to avoid overwhelming the server you're scraping from.
  • Concurrency: Utilize Colly's concurrency features to scrape more efficiently, but do so responsibly to avoid getting blocked.
  • Selectors: When using OnHTML, make sure to use efficient and precise selectors to avoid unnecessary processing.

By effectively using Colly's callbacks, you can create powerful and efficient web scrapers that are tailored to your specific scraping tasks.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon