What are the different types of callbacks available in Colly?
Colly, the popular Go web scraping framework, provides a comprehensive set of callbacks that allow developers to handle different stages of the web scraping process. These callbacks enable you to process HTML elements, handle HTTP requests and responses, manage errors, and control the scraping workflow. Understanding these callbacks is essential for building efficient and robust web scrapers.
Core HTML and XML Processing Callbacks
OnHTML Callback
The OnHTML
callback is the most commonly used callback in Colly. It's triggered when the scraper encounters HTML elements that match a specified CSS selector.
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Extract all links from a page
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
text := e.Text
fmt.Printf("Found link: %s - %s\n", link, text)
// Visit the found link
e.Request.Visit(link)
})
// Extract article titles
c.OnHTML("h1.title", func(e *colly.HTMLElement) {
fmt.Printf("Article title: %s\n", e.Text)
})
c.Visit("https://example.com")
}
OnXML Callback
The OnXML
callback is specifically designed for processing XML documents and RSS feeds using XPath selectors.
c.OnXML("//item", func(e *colly.XMLElement) {
title := e.ChildText("title")
link := e.ChildText("link")
description := e.ChildText("description")
fmt.Printf("RSS Item: %s\n", title)
fmt.Printf("Link: %s\n", link)
fmt.Printf("Description: %s\n", description)
})
c.OnXML("//channel/title", func(e *colly.XMLElement) {
fmt.Printf("Feed title: %s\n", e.Text)
})
HTTP Lifecycle Callbacks
OnRequest Callback
The OnRequest
callback is executed before making an HTTP request. This is useful for modifying headers, logging requests, or implementing authentication.
c.OnRequest(func(r *colly.Request) {
// Add custom headers
r.Headers.Set("User-Agent", "MyBot 1.0")
r.Headers.Set("Authorization", "Bearer token123")
// Log the request
fmt.Printf("Visiting: %s\n", r.URL.String())
// Modify request based on conditions
if r.URL.Host == "api.example.com" {
r.Headers.Set("Content-Type", "application/json")
}
})
OnResponse Callback
The OnResponse
callback is triggered after receiving an HTTP response but before processing HTML/XML content.
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Response status: %d\n", r.StatusCode)
fmt.Printf("Response size: %d bytes\n", len(r.Body))
fmt.Printf("Content-Type: %s\n", r.Headers.Get("Content-Type"))
// Process non-HTML content
if r.Headers.Get("Content-Type") == "application/json" {
// Handle JSON response
fmt.Printf("JSON Response: %s\n", string(r.Body))
}
})
OnResponseHeaders Callback
This callback is executed when response headers are received, before the response body is downloaded.
c.OnResponseHeaders(func(r *colly.Response) {
// Check content type before downloading
contentType := r.Headers.Get("Content-Type")
if !strings.Contains(contentType, "text/html") {
r.Request.Abort()
return
}
// Check content length
contentLength := r.Headers.Get("Content-Length")
if contentLength != "" {
if length, err := strconv.Atoi(contentLength); err == nil && length > 10000000 {
fmt.Printf("Skipping large file: %d bytes\n", length)
r.Request.Abort()
}
}
})
Error Handling Callbacks
OnError Callback
The OnError
callback handles various types of errors that occur during scraping, including network errors, HTTP errors, and parsing errors.
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error occurred: %s\n", err.Error())
fmt.Printf("Failed URL: %s\n", r.Request.URL.String())
fmt.Printf("Status Code: %d\n", r.StatusCode)
// Implement retry logic
if r.StatusCode == 429 || r.StatusCode >= 500 {
// Retry after delay
time.Sleep(5 * time.Second)
r.Request.Retry()
}
// Log error details
logError(r.Request.URL.String(), err)
})
Scraping Lifecycle Callbacks
OnScraped Callback
The OnScraped
callback is executed after all OnHTML and OnXML callbacks have finished processing a response.
c.OnScraped(func(r *colly.Response) {
fmt.Printf("Finished scraping: %s\n", r.Request.URL.String())
// Clean up resources
// Update database with completion status
// Trigger next stage of processing
// Example: Mark page as processed
markPageAsProcessed(r.Request.URL.String())
})
Advanced Callback Patterns
Chaining Multiple Callbacks
You can register multiple callbacks of the same type, and they will be executed in the order they were registered.
// First OnHTML callback
c.OnHTML("article", func(e *colly.HTMLElement) {
fmt.Printf("Processing article: %s\n", e.ChildText("h1"))
})
// Second OnHTML callback for the same selector
c.OnHTML("article", func(e *colly.HTMLElement) {
// Extract additional data
author := e.ChildText(".author")
date := e.ChildText(".date")
fmt.Printf("Author: %s, Date: %s\n", author, date)
})
Conditional Callback Execution
You can implement conditional logic within callbacks to handle different scenarios.
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Only follow internal links
if strings.HasPrefix(link, "/") || strings.Contains(link, e.Request.URL.Host) {
fmt.Printf("Following internal link: %s\n", link)
e.Request.Visit(link)
} else {
fmt.Printf("Skipping external link: %s\n", link)
}
})
Context-Aware Callbacks
Use the request context to pass data between callbacks and maintain state across requests.
c.OnRequest(func(r *colly.Request) {
// Set context data
r.Ctx.Put("start_time", time.Now())
r.Ctx.Put("depth", "1")
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// Access context data
depth := e.Request.Ctx.Get("depth")
if depth == "1" {
// Only follow links on first level
e.Request.Visit(e.Attr("href"))
}
})
c.OnScraped(func(r *colly.Response) {
startTime := r.Request.Ctx.GetAny("start_time").(time.Time)
duration := time.Since(startTime)
fmt.Printf("Page processed in: %v\n", duration)
})
Practical Implementation Example
Here's a complete example demonstrating multiple callback types working together:
package main
import (
"fmt"
"log"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
// Rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
// Request preprocessing
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "WebScraper 1.0")
fmt.Printf("Requesting: %s\n", r.URL.String())
})
// Response analysis
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Received %d bytes from %s\n", len(r.Body), r.Request.URL)
})
// Extract and follow links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
if link != "" {
e.Request.Visit(link)
}
})
// Extract data
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Printf("Page title: %s\n", e.Text)
})
// Error handling
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error: %s - %s\n", r.Request.URL, err)
})
// Completion tracking
c.OnScraped(func(r *colly.Response) {
fmt.Printf("Finished: %s\n", r.Request.URL)
})
c.Visit("https://example.com")
c.Wait()
}
Best Practices for Callback Usage
Performance Optimization
- Keep callback functions lightweight and avoid heavy computations
- Use goroutines for time-consuming operations within callbacks
- Implement proper error handling to prevent callback failures from stopping the scraper
Memory Management
- Be cautious with storing large amounts of data in callback closures
- Clean up resources in OnScraped callbacks
- Use context wisely to avoid memory leaks
Error Recovery
- Always implement OnError callbacks for robust scraping
- Use retry mechanisms for transient failures
- Log errors appropriately for debugging and monitoring
Callback Execution Order
Understanding the order in which callbacks are executed is crucial for effective scraping:
- OnRequest - Fired before the HTTP request is made
- OnResponseHeaders - Fired when response headers are received
- OnResponse - Fired when the complete response is received
- OnHTML/OnXML - Fired for each matching element in the response
- OnScraped - Fired after all HTML/XML callbacks have completed
- OnError - Fired when any error occurs during the process
Working with Multiple Collectors
You can use different callback sets for different collectors to handle various types of content:
// Collector for main content
mainCollector := colly.NewCollector()
mainCollector.OnHTML("a.product-link", func(e *colly.HTMLElement) {
productURL := e.Attr("href")
productCollector.Visit(productURL)
})
// Specialized collector for product pages
productCollector := colly.NewCollector()
productCollector.OnHTML(".product-details", func(e *colly.HTMLElement) {
name := e.ChildText(".product-name")
price := e.ChildText(".price")
fmt.Printf("Product: %s - %s\n", name, price)
})
Conclusion
Colly's callback system provides a powerful and flexible way to handle different aspects of web scraping. By understanding and properly utilizing these callbacks - OnHTML, OnXML, OnRequest, OnResponse, OnResponseHeaders, OnError, and OnScraped - you can build sophisticated scrapers that handle complex scenarios gracefully. The key to effective scraping lies in combining these callbacks strategically while maintaining clean, maintainable code that handles errors appropriately.
When working with complex web applications that require JavaScript execution, you might also want to explore how to handle browser events in Puppeteer or learn about handling authentication in browser-based scraping for scenarios where Colly's static HTML processing isn't sufficient.