Using Colly's Callback Functions Effectively
Colly is a powerful web scraping framework for Go that uses an event-driven architecture. Mastering callback functions is essential for building efficient and robust scrapers. This guide covers the core callbacks and advanced patterns for effective web scraping.
Core Callback Functions
Colly provides several callback functions that handle different stages of the scraping process:
| Callback | Trigger | Use Case |
|----------|---------|----------|
| OnRequest
| Before sending request | Modify headers, add authentication |
| OnResponse
| After receiving response | Handle raw data, save files |
| OnHTML
| When HTML element matches selector | Extract structured data |
| OnError
| When request fails | Handle errors, implement retries |
| OnScraped
| After all OnHTML callbacks finish | Post-processing, cleanup |
OnHTML - Data Extraction
OnHTML
is the workhorse for extracting structured data from web pages. It accepts CSS selectors and provides access to matched elements.
Basic Usage
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Extract all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
text := e.Text
fmt.Printf("Found link: %s -> %s\n", text, link)
})
c.Visit("https://example.com")
}
Advanced Data Extraction
c.OnHTML("article.post", func(e *colly.HTMLElement) {
post := struct {
Title string
Author string
Date string
Content string
Tags []string
}{
Title: e.ChildText("h1.title"),
Author: e.ChildText(".author"),
Date: e.ChildAttr("time", "datetime"),
Content: e.ChildText(".content"),
}
// Extract multiple tags
e.ForEach(".tag", func(i int, tag *colly.HTMLElement) {
post.Tags = append(post.Tags, tag.Text)
})
fmt.Printf("Post: %+v\n", post)
})
Following Links Recursively
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
// Only follow internal links
if strings.Contains(link, "example.com") {
e.Request.Visit(link)
}
})
OnRequest - Request Modification
Use OnRequest
to modify requests before they're sent. This is essential for authentication, headers, and request customization.
c.OnRequest(func(r *colly.Request) {
// Set custom headers
r.Headers.Set("User-Agent", "MyBot/1.0")
r.Headers.Set("Accept", "text/html,application/xhtml+xml")
// Add authentication
r.Headers.Set("Authorization", "Bearer " + apiToken)
// Log requests
fmt.Printf("Visiting: %s\n", r.URL.String())
})
Dynamic Request Modification
c.OnRequest(func(r *colly.Request) {
// Add timestamp to avoid caching
q := r.URL.Query()
q.Add("t", fmt.Sprintf("%d", time.Now().Unix()))
r.URL.RawQuery = q.Encode()
// Set referer for pages that require it
if strings.Contains(r.URL.Path, "/protected/") {
r.Headers.Set("Referer", "https://example.com/login")
}
})
OnResponse - Response Processing
OnResponse
handles raw response data, perfect for downloading files or processing non-HTML content.
c.OnResponse(func(r *colly.Response) {
contentType := r.Headers.Get("Content-Type")
switch {
case strings.Contains(contentType, "image/"):
// Save images
filename := fmt.Sprintf("images/%s", r.FileName())
r.Save(filename)
case strings.Contains(contentType, "application/json"):
// Process JSON responses
var data map[string]interface{}
if err := json.Unmarshal(r.Body, &data); err == nil {
fmt.Printf("JSON data: %+v\n", data)
}
default:
fmt.Printf("Received %s (%d bytes)\n", r.Request.URL, len(r.Body))
}
})
OnError - Error Handling
Robust error handling prevents crashes and enables retry logic.
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error on %s: %v\n", r.Request.URL, err)
// Implement retry logic
retryCount := r.Request.Ctx.GetAny("retryCount")
if retryCount == nil {
retryCount = 0
}
if retryCount.(int) < 3 {
r.Request.Ctx.Put("retryCount", retryCount.(int)+1)
time.Sleep(time.Second * 2) // Wait before retry
r.Request.Retry()
}
})
OnScraped - Post-Processing
OnScraped
executes after all OnHTML
callbacks complete, ideal for cleanup and aggregation.
var pageData []string
c.OnHTML("h1", func(e *colly.HTMLElement) {
pageData = append(pageData, e.Text)
})
c.OnScraped(func(r *colly.Response) {
fmt.Printf("Scraped %s: found %d headings\n",
r.Request.URL, len(pageData))
// Save collected data
saveToDatabase(pageData)
// Reset for next page
pageData = nil
})
Complete Example: E-commerce Scraper
package main
import (
"encoding/json"
"fmt"
"log"
"strconv"
"strings"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
type Product struct {
Name string `json:"name"`
Price float64 `json:"price"`
URL string `json:"url"`
}
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
var products []Product
// Set up request middleware
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Product-Scraper/1.0")
fmt.Printf("Visiting: %s\n", r.URL)
})
// Extract product data
c.OnHTML(".product", func(e *colly.HTMLElement) {
name := e.ChildText(".product-name")
priceStr := strings.TrimPrefix(e.ChildText(".price"), "$")
price, _ := strconv.ParseFloat(priceStr, 64)
url := e.Request.AbsoluteURL(e.ChildAttr("a", "href"))
product := Product{
Name: name,
Price: price,
URL: url,
}
products = append(products, product)
})
// Follow pagination
c.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
nextPage := e.Request.AbsoluteURL(e.Attr("href"))
e.Request.Visit(nextPage)
})
// Handle errors
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error scraping %s: %v", r.Request.URL, err)
})
// Process results
c.OnScraped(func(r *colly.Response) {
fmt.Printf("Finished scraping %s: %d products found\n",
r.Request.URL, len(products))
})
// Start scraping
c.Visit("https://example-store.com/products")
// Save results
if data, err := json.MarshalIndent(products, "", " "); err == nil {
fmt.Println(string(data))
}
}
Best Practices
1. Callback Order and Dependencies
// Callbacks execute in registration order
c.OnRequest(func(r *colly.Request) { /* First */ })
c.OnRequest(func(r *colly.Request) { /* Second */ })
2. Error Recovery
c.OnError(func(r *colly.Response, err error) {
if r.StatusCode == 429 { // Rate limited
time.Sleep(time.Minute)
r.Request.Retry()
}
})
3. Resource Management
c.OnScraped(func(r *colly.Response) {
// Clean up resources
closeConnections()
// Trigger garbage collection for long-running scrapers
if scrapeCount%100 == 0 {
runtime.GC()
}
})
4. Selective Processing
// Use domain-specific callbacks
c.OnHTML("title", func(e *colly.HTMLElement) {
if strings.Contains(e.Request.URL.Host, "target-site.com") {
// Process only target site
}
})
5. Context and State Management
c.OnRequest(func(r *colly.Request) {
r.Ctx.Put("startTime", time.Now())
})
c.OnScraped(func(r *colly.Response) {
startTime := r.Ctx.GetAny("startTime").(time.Time)
duration := time.Since(startTime)
fmt.Printf("Page scraped in %v\n", duration)
})
By mastering these callback patterns, you can build sophisticated web scrapers that handle complex scenarios while maintaining clean, maintainable code. Remember to always respect robots.txt and implement appropriate delays to avoid overwhelming target servers.