What are the different types of callbacks available in Colly?

Colly, the popular Go web scraping framework, provides a comprehensive set of callbacks that allow developers to handle different stages of the web scraping process. These callbacks enable you to process HTML elements, handle HTTP requests and responses, manage errors, and control the scraping workflow. Understanding these callbacks is essential for building efficient and robust web scrapers.

Core HTML and XML Processing Callbacks

OnHTML Callback

The OnHTML callback is the most commonly used callback in Colly. It's triggered when the scraper encounters HTML elements that match a specified CSS selector.

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Extract all links from a page
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        text := e.Text
        fmt.Printf("Found link: %s - %s\n", link, text)

        // Visit the found link
        e.Request.Visit(link)
    })

    // Extract article titles
    c.OnHTML("h1.title", func(e *colly.HTMLElement) {
        fmt.Printf("Article title: %s\n", e.Text)
    })

    c.Visit("https://example.com")
}

OnXML Callback

The OnXML callback is specifically designed for processing XML documents and RSS feeds using XPath selectors.

c.OnXML("//item", func(e *colly.XMLElement) {
    title := e.ChildText("title")
    link := e.ChildText("link")
    description := e.ChildText("description")

    fmt.Printf("RSS Item: %s\n", title)
    fmt.Printf("Link: %s\n", link)
    fmt.Printf("Description: %s\n", description)
})

c.OnXML("//channel/title", func(e *colly.XMLElement) {
    fmt.Printf("Feed title: %s\n", e.Text)
})

HTTP Lifecycle Callbacks

OnRequest Callback

The OnRequest callback is executed before making an HTTP request. This is useful for modifying headers, logging requests, or implementing authentication.

c.OnRequest(func(r *colly.Request) {
    // Add custom headers
    r.Headers.Set("User-Agent", "MyBot 1.0")
    r.Headers.Set("Authorization", "Bearer token123")

    // Log the request
    fmt.Printf("Visiting: %s\n", r.URL.String())

    // Modify request based on conditions
    if r.URL.Host == "api.example.com" {
        r.Headers.Set("Content-Type", "application/json")
    }
})

OnResponse Callback

The OnResponse callback is triggered after receiving an HTTP response but before processing HTML/XML content.

c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Response status: %d\n", r.StatusCode)
    fmt.Printf("Response size: %d bytes\n", len(r.Body))
    fmt.Printf("Content-Type: %s\n", r.Headers.Get("Content-Type"))

    // Process non-HTML content
    if r.Headers.Get("Content-Type") == "application/json" {
        // Handle JSON response
        fmt.Printf("JSON Response: %s\n", string(r.Body))
    }
})

OnResponseHeaders Callback

This callback is executed when response headers are received, before the response body is downloaded.

c.OnResponseHeaders(func(r *colly.Response) {
    // Check content type before downloading
    contentType := r.Headers.Get("Content-Type")
    if !strings.Contains(contentType, "text/html") {
        r.Request.Abort()
        return
    }

    // Check content length
    contentLength := r.Headers.Get("Content-Length")
    if contentLength != "" {
        if length, err := strconv.Atoi(contentLength); err == nil && length > 10000000 {
            fmt.Printf("Skipping large file: %d bytes\n", length)
            r.Request.Abort()
        }
    }
})

Error Handling Callbacks

OnError Callback

The OnError callback handles various types of errors that occur during scraping, including network errors, HTTP errors, and parsing errors.

c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error occurred: %s\n", err.Error())
    fmt.Printf("Failed URL: %s\n", r.Request.URL.String())
    fmt.Printf("Status Code: %d\n", r.StatusCode)

    // Implement retry logic
    if r.StatusCode == 429 || r.StatusCode >= 500 {
        // Retry after delay
        time.Sleep(5 * time.Second)
        r.Request.Retry()
    }

    // Log error details
    logError(r.Request.URL.String(), err)
})

Scraping Lifecycle Callbacks

OnScraped Callback

The OnScraped callback is executed after all OnHTML and OnXML callbacks have finished processing a response.

c.OnScraped(func(r *colly.Response) {
    fmt.Printf("Finished scraping: %s\n", r.Request.URL.String())

    // Clean up resources
    // Update database with completion status
    // Trigger next stage of processing

    // Example: Mark page as processed
    markPageAsProcessed(r.Request.URL.String())
})

Advanced Callback Patterns

Chaining Multiple Callbacks

You can register multiple callbacks of the same type, and they will be executed in the order they were registered.

// First OnHTML callback
c.OnHTML("article", func(e *colly.HTMLElement) {
    fmt.Printf("Processing article: %s\n", e.ChildText("h1"))
})

// Second OnHTML callback for the same selector
c.OnHTML("article", func(e *colly.HTMLElement) {
    // Extract additional data
    author := e.ChildText(".author")
    date := e.ChildText(".date")
    fmt.Printf("Author: %s, Date: %s\n", author, date)
})

Conditional Callback Execution

You can implement conditional logic within callbacks to handle different scenarios.

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")

    // Only follow internal links
    if strings.HasPrefix(link, "/") || strings.Contains(link, e.Request.URL.Host) {
        fmt.Printf("Following internal link: %s\n", link)
        e.Request.Visit(link)
    } else {
        fmt.Printf("Skipping external link: %s\n", link)
    }
})

Context-Aware Callbacks

Use the request context to pass data between callbacks and maintain state across requests.

c.OnRequest(func(r *colly.Request) {
    // Set context data
    r.Ctx.Put("start_time", time.Now())
    r.Ctx.Put("depth", "1")
})

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    // Access context data
    depth := e.Request.Ctx.Get("depth")
    if depth == "1" {
        // Only follow links on first level
        e.Request.Visit(e.Attr("href"))
    }
})

c.OnScraped(func(r *colly.Response) {
    startTime := r.Request.Ctx.GetAny("start_time").(time.Time)
    duration := time.Since(startTime)
    fmt.Printf("Page processed in: %v\n", duration)
})

Practical Implementation Example

Here's a complete example demonstrating multiple callback types working together:

package main

import (
    "fmt"
    "log"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    // Request preprocessing
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "WebScraper 1.0")
        fmt.Printf("Requesting: %s\n", r.URL.String())
    })

    // Response analysis
    c.OnResponse(func(r *colly.Response) {
        fmt.Printf("Received %d bytes from %s\n", len(r.Body), r.Request.URL)
    })

    // Extract and follow links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        if link != "" {
            e.Request.Visit(link)
        }
    })

    // Extract data
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Page title: %s\n", e.Text)
    })

    // Error handling
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error: %s - %s\n", r.Request.URL, err)
    })

    // Completion tracking
    c.OnScraped(func(r *colly.Response) {
        fmt.Printf("Finished: %s\n", r.Request.URL)
    })

    c.Visit("https://example.com")
    c.Wait()
}

Best Practices for Callback Usage

Performance Optimization

Keep callback functions lightweight and avoid heavy computations
Use goroutines for time-consuming operations within callbacks
Implement proper error handling to prevent callback failures from stopping the scraper

Memory Management

Be cautious with storing large amounts of data in callback closures
Clean up resources in OnScraped callbacks
Use context wisely to avoid memory leaks

Error Recovery

Always implement OnError callbacks for robust scraping
Use retry mechanisms for transient failures
Log errors appropriately for debugging and monitoring

Callback Execution Order

Understanding the order in which callbacks are executed is crucial for effective scraping:

OnRequest - Fired before the HTTP request is made
OnResponseHeaders - Fired when response headers are received
OnResponse - Fired when the complete response is received
OnHTML/OnXML - Fired for each matching element in the response
OnScraped - Fired after all HTML/XML callbacks have completed
OnError - Fired when any error occurs during the process

Working with Multiple Collectors

You can use different callback sets for different collectors to handle various types of content:

// Collector for main content
mainCollector := colly.NewCollector()
mainCollector.OnHTML("a.product-link", func(e *colly.HTMLElement) {
    productURL := e.Attr("href")
    productCollector.Visit(productURL)
})

// Specialized collector for product pages
productCollector := colly.NewCollector()
productCollector.OnHTML(".product-details", func(e *colly.HTMLElement) {
    name := e.ChildText(".product-name")
    price := e.ChildText(".price")
    fmt.Printf("Product: %s - %s\n", name, price)
})

Conclusion

Colly's callback system provides a powerful and flexible way to handle different aspects of web scraping. By understanding and properly utilizing these callbacks - OnHTML, OnXML, OnRequest, OnResponse, OnResponseHeaders, OnError, and OnScraped - you can build sophisticated scrapers that handle complex scenarios gracefully. The key to effective scraping lies in combining these callbacks strategically while maintaining clean, maintainable code that handles errors appropriately.

When working with complex web applications that require JavaScript execution, you might also want to explore how to handle browser events in Puppeteer or learn about handling authentication in browser-based scraping for scenarios where Colly's static HTML processing isn't sufficient.

Table of contents