Table of contents

How do I use Colly's callback functions effectively?

Using Colly's Callback Functions Effectively

Colly is a powerful web scraping framework for Go that uses an event-driven architecture. Mastering callback functions is essential for building efficient and robust scrapers. This guide covers the core callbacks and advanced patterns for effective web scraping.

Core Callback Functions

Colly provides several callback functions that handle different stages of the scraping process:

| Callback | Trigger | Use Case | |----------|---------|----------| | OnRequest | Before sending request | Modify headers, add authentication | | OnResponse | After receiving response | Handle raw data, save files | | OnHTML | When HTML element matches selector | Extract structured data | | OnError | When request fails | Handle errors, implement retries | | OnScraped | After all OnHTML callbacks finish | Post-processing, cleanup |

OnHTML - Data Extraction

OnHTML is the workhorse for extracting structured data from web pages. It accepts CSS selectors and provides access to matched elements.

Basic Usage

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Extract all links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        text := e.Text
        fmt.Printf("Found link: %s -> %s\n", text, link)
    })

    c.Visit("https://example.com")
}

Advanced Data Extraction

c.OnHTML("article.post", func(e *colly.HTMLElement) {
    post := struct {
        Title   string
        Author  string
        Date    string
        Content string
        Tags    []string
    }{
        Title:   e.ChildText("h1.title"),
        Author:  e.ChildText(".author"),
        Date:    e.ChildAttr("time", "datetime"),
        Content: e.ChildText(".content"),
    }

    // Extract multiple tags
    e.ForEach(".tag", func(i int, tag *colly.HTMLElement) {
        post.Tags = append(post.Tags, tag.Text)
    })

    fmt.Printf("Post: %+v\n", post)
})

Following Links Recursively

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Request.AbsoluteURL(e.Attr("href"))

    // Only follow internal links
    if strings.Contains(link, "example.com") {
        e.Request.Visit(link)
    }
})

OnRequest - Request Modification

Use OnRequest to modify requests before they're sent. This is essential for authentication, headers, and request customization.

c.OnRequest(func(r *colly.Request) {
    // Set custom headers
    r.Headers.Set("User-Agent", "MyBot/1.0")
    r.Headers.Set("Accept", "text/html,application/xhtml+xml")

    // Add authentication
    r.Headers.Set("Authorization", "Bearer " + apiToken)

    // Log requests
    fmt.Printf("Visiting: %s\n", r.URL.String())
})

Dynamic Request Modification

c.OnRequest(func(r *colly.Request) {
    // Add timestamp to avoid caching
    q := r.URL.Query()
    q.Add("t", fmt.Sprintf("%d", time.Now().Unix()))
    r.URL.RawQuery = q.Encode()

    // Set referer for pages that require it
    if strings.Contains(r.URL.Path, "/protected/") {
        r.Headers.Set("Referer", "https://example.com/login")
    }
})

OnResponse - Response Processing

OnResponse handles raw response data, perfect for downloading files or processing non-HTML content.

c.OnResponse(func(r *colly.Response) {
    contentType := r.Headers.Get("Content-Type")

    switch {
    case strings.Contains(contentType, "image/"):
        // Save images
        filename := fmt.Sprintf("images/%s", r.FileName())
        r.Save(filename)

    case strings.Contains(contentType, "application/json"):
        // Process JSON responses
        var data map[string]interface{}
        if err := json.Unmarshal(r.Body, &data); err == nil {
            fmt.Printf("JSON data: %+v\n", data)
        }

    default:
        fmt.Printf("Received %s (%d bytes)\n", r.Request.URL, len(r.Body))
    }
})

OnError - Error Handling

Robust error handling prevents crashes and enables retry logic.

c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error on %s: %v\n", r.Request.URL, err)

    // Implement retry logic
    retryCount := r.Request.Ctx.GetAny("retryCount")
    if retryCount == nil {
        retryCount = 0
    }

    if retryCount.(int) < 3 {
        r.Request.Ctx.Put("retryCount", retryCount.(int)+1)
        time.Sleep(time.Second * 2) // Wait before retry
        r.Request.Retry()
    }
})

OnScraped - Post-Processing

OnScraped executes after all OnHTML callbacks complete, ideal for cleanup and aggregation.

var pageData []string

c.OnHTML("h1", func(e *colly.HTMLElement) {
    pageData = append(pageData, e.Text)
})

c.OnScraped(func(r *colly.Response) {
    fmt.Printf("Scraped %s: found %d headings\n", 
        r.Request.URL, len(pageData))

    // Save collected data
    saveToDatabase(pageData)

    // Reset for next page
    pageData = nil
})

Complete Example: E-commerce Scraper

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "strconv"
    "strings"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

type Product struct {
    Name  string  `json:"name"`
    Price float64 `json:"price"`
    URL   string  `json:"url"`
}

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    var products []Product

    // Set up request middleware
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Product-Scraper/1.0")
        fmt.Printf("Visiting: %s\n", r.URL)
    })

    // Extract product data
    c.OnHTML(".product", func(e *colly.HTMLElement) {
        name := e.ChildText(".product-name")
        priceStr := strings.TrimPrefix(e.ChildText(".price"), "$")
        price, _ := strconv.ParseFloat(priceStr, 64)
        url := e.Request.AbsoluteURL(e.ChildAttr("a", "href"))

        product := Product{
            Name:  name,
            Price: price,
            URL:   url,
        }

        products = append(products, product)
    })

    // Follow pagination
    c.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
        nextPage := e.Request.AbsoluteURL(e.Attr("href"))
        e.Request.Visit(nextPage)
    })

    // Handle errors
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error scraping %s: %v", r.Request.URL, err)
    })

    // Process results
    c.OnScraped(func(r *colly.Response) {
        fmt.Printf("Finished scraping %s: %d products found\n", 
            r.Request.URL, len(products))
    })

    // Start scraping
    c.Visit("https://example-store.com/products")

    // Save results
    if data, err := json.MarshalIndent(products, "", "  "); err == nil {
        fmt.Println(string(data))
    }
}

Best Practices

1. Callback Order and Dependencies

// Callbacks execute in registration order
c.OnRequest(func(r *colly.Request) { /* First */ })
c.OnRequest(func(r *colly.Request) { /* Second */ })

2. Error Recovery

c.OnError(func(r *colly.Response, err error) {
    if r.StatusCode == 429 { // Rate limited
        time.Sleep(time.Minute)
        r.Request.Retry()
    }
})

3. Resource Management

c.OnScraped(func(r *colly.Response) {
    // Clean up resources
    closeConnections()

    // Trigger garbage collection for long-running scrapers
    if scrapeCount%100 == 0 {
        runtime.GC()
    }
})

4. Selective Processing

// Use domain-specific callbacks
c.OnHTML("title", func(e *colly.HTMLElement) {
    if strings.Contains(e.Request.URL.Host, "target-site.com") {
        // Process only target site
    }
})

5. Context and State Management

c.OnRequest(func(r *colly.Request) {
    r.Ctx.Put("startTime", time.Now())
})

c.OnScraped(func(r *colly.Response) {
    startTime := r.Ctx.GetAny("startTime").(time.Time)
    duration := time.Since(startTime)
    fmt.Printf("Page scraped in %v\n", duration)
})

By mastering these callback patterns, you can build sophisticated web scrapers that handle complex scenarios while maintaining clean, maintainable code. Remember to always respect robots.txt and implement appropriate delays to avoid overwhelming target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon