Table of contents

How do I implement caching mechanisms in Colly?

Caching is essential for building efficient web scrapers that minimize redundant requests and improve performance. Colly, the fast and elegant Go web scraping framework, provides several built-in caching mechanisms and allows for custom cache implementations. This guide covers how to implement various caching strategies in Colly to optimize your web scraping projects.

Understanding Colly's Caching Options

Colly supports multiple types of caching mechanisms:

  1. HTTP Cache - Browser-like caching based on HTTP headers
  2. In-Memory Cache - Fast access for frequently requested pages
  3. File-based Cache - Persistent caching across application restarts
  4. Custom Cache - Implementing your own caching logic

HTTP Cache Implementation

The simplest way to enable caching in Colly is using the built-in HTTP cache, which respects standard HTTP caching headers like Cache-Control, ETag, and Last-Modified.

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
    "github.com/gocolly/colly/v2/storage"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Enable HTTP cache with in-memory storage
    c.CacheDir = "./cache"

    // Set cache headers
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("Cache-Control", "max-age=3600")
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s\n", e.Text)
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error: %s\n", err.Error())
    })

    // First request - will be cached
    c.Visit("https://example.com")

    // Second request - will use cache if available
    c.Visit("https://example.com")
}

In-Memory Cache

For faster access to frequently requested pages, implement an in-memory cache using Colly's storage interface:

package main

import (
    "fmt"
    "log"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/storage"
)

func main() {
    c := colly.NewCollector()

    // Create in-memory storage
    memStorage := &storage.InMemoryStorage{}
    c.SetStorage(memStorage)

    // Configure cache expiration
    c.OnRequest(func(r *colly.Request) {
        // Set custom cache duration
        r.Ctx.Put("cache_duration", "1h")
    })

    c.OnHTML("h1", func(e *colly.HTMLElement) {
        fmt.Printf("Heading: %s\n", e.Text)
    })

    // These requests will be cached in memory
    urls := []string{
        "https://example.com",
        "https://example.com/page1",
        "https://example.com/page2",
    }

    for _, url := range urls {
        c.Visit(url)
        time.Sleep(100 * time.Millisecond)
    }

    // Revisit the same URLs - they'll be served from cache
    for _, url := range urls {
        c.Visit(url)
    }
}

File-Based Cache

For persistent caching that survives application restarts, use file-based storage:

package main

import (
    "fmt"
    "log"
    "os"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/storage"
)

func main() {
    c := colly.NewCollector()

    // Create cache directory
    cacheDir := "./colly_cache"
    if err := os.MkdirAll(cacheDir, 0755); err != nil {
        log.Fatal(err)
    }

    // Set up file-based storage
    fileStorage := &storage.SQLiteStorage{
        Filename: cacheDir + "/cache.db",
    }

    if err := c.SetStorage(fileStorage); err != nil {
        log.Fatal(err)
    }

    // Configure cache behavior
    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Requesting: %s\n", r.URL)
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Printf("Response received: %d bytes\n", len(r.Body))
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s\n", e.Text)
    })

    // Visit pages - will be cached to disk
    c.Visit("https://httpbin.org/html")
    c.Visit("https://httpbin.org/json")

    // Close storage properly
    defer fileStorage.Close()
}

Custom Cache Implementation

For advanced caching requirements, implement a custom cache with TTL (Time To Live) and size limits:

package main

import (
    "crypto/md5"
    "fmt"
    "sync"
    "time"

    "github.com/gocolly/colly/v2"
)

type CacheItem struct {
    Data      []byte
    ExpiresAt time.Time
}

type CustomCache struct {
    items   map[string]*CacheItem
    mutex   sync.RWMutex
    maxSize int
    ttl     time.Duration
}

func NewCustomCache(maxSize int, ttl time.Duration) *CustomCache {
    cache := &CustomCache{
        items:   make(map[string]*CacheItem),
        maxSize: maxSize,
        ttl:     ttl,
    }

    // Start cleanup goroutine
    go cache.cleanup()
    return cache
}

func (c *CustomCache) Get(url string) ([]byte, bool) {
    c.mutex.RLock()
    defer c.mutex.RUnlock()

    key := c.generateKey(url)
    item, exists := c.items[key]

    if !exists || time.Now().After(item.ExpiresAt) {
        return nil, false
    }

    return item.Data, true
}

func (c *CustomCache) Set(url string, data []byte) {
    c.mutex.Lock()
    defer c.mutex.Unlock()

    // Check size limit
    if len(c.items) >= c.maxSize {
        c.evictOldest()
    }

    key := c.generateKey(url)
    c.items[key] = &CacheItem{
        Data:      data,
        ExpiresAt: time.Now().Add(c.ttl),
    }
}

func (c *CustomCache) generateKey(url string) string {
    hash := md5.Sum([]byte(url))
    return fmt.Sprintf("%x", hash)
}

func (c *CustomCache) evictOldest() {
    var oldestKey string
    var oldestTime time.Time

    for key, item := range c.items {
        if oldestKey == "" || item.ExpiresAt.Before(oldestTime) {
            oldestKey = key
            oldestTime = item.ExpiresAt
        }
    }

    if oldestKey != "" {
        delete(c.items, oldestKey)
    }
}

func (c *CustomCache) cleanup() {
    ticker := time.NewTicker(5 * time.Minute)
    defer ticker.Stop()

    for range ticker.C {
        c.mutex.Lock()
        now := time.Now()
        for key, item := range c.items {
            if now.After(item.ExpiresAt) {
                delete(c.items, key)
            }
        }
        c.mutex.Unlock()
    }
}

func main() {
    c := colly.NewCollector()
    cache := NewCustomCache(1000, 30*time.Minute)

    // Implement cache middleware
    c.OnRequest(func(r *colly.Request) {
        if cachedData, found := cache.Get(r.URL.String()); found {
            r.Abort()
            fmt.Printf("Cache hit for: %s\n", r.URL)
            // Process cached data
            return
        }
        fmt.Printf("Cache miss for: %s\n", r.URL)
    })

    c.OnResponse(func(r *colly.Response) {
        cache.Set(r.Request.URL.String(), r.Body)
        fmt.Printf("Cached response for: %s\n", r.Request.URL)
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s\n", e.Text)
    })

    // Test caching
    urls := []string{
        "https://httpbin.org/html",
        "https://httpbin.org/json",
        "https://httpbin.org/xml",
    }

    // First round - cache misses
    for _, url := range urls {
        c.Visit(url)
    }

    time.Sleep(1 * time.Second)

    // Second round - cache hits
    for _, url := range urls {
        c.Visit(url)
    }
}

Redis-Based Distributed Cache

For distributed scraping systems, implement a Redis-based cache:

package main

import (
    "context"
    "encoding/json"
    "fmt"
    "time"

    "github.com/go-redis/redis/v8"
    "github.com/gocolly/colly/v2"
)

type RedisCache struct {
    client *redis.Client
    ctx    context.Context
    ttl    time.Duration
}

func NewRedisCache(addr, password string, db int, ttl time.Duration) *RedisCache {
    rdb := redis.NewClient(&redis.Options{
        Addr:     addr,
        Password: password,
        DB:       db,
    })

    return &RedisCache{
        client: rdb,
        ctx:    context.Background(),
        ttl:    ttl,
    }
}

func (r *RedisCache) Get(url string) ([]byte, error) {
    val, err := r.client.Get(r.ctx, url).Result()
    if err != nil {
        return nil, err
    }
    return []byte(val), nil
}

func (r *RedisCache) Set(url string, data []byte) error {
    return r.client.Set(r.ctx, url, data, r.ttl).Err()
}

func main() {
    c := colly.NewCollector()
    cache := NewRedisCache("localhost:6379", "", 0, 1*time.Hour)

    c.OnRequest(func(r *colly.Request) {
        if cachedData, err := cache.Get(r.URL.String()); err == nil {
            fmt.Printf("Redis cache hit for: %s\n", r.URL)
            r.Abort()
            return
        }
        fmt.Printf("Redis cache miss for: %s\n", r.URL)
    })

    c.OnResponse(func(r *colly.Response) {
        if err := cache.Set(r.Request.URL.String(), r.Body); err != nil {
            fmt.Printf("Failed to cache: %v\n", err)
        } else {
            fmt.Printf("Cached in Redis: %s\n", r.Request.URL)
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s\n", e.Text)
    })

    c.Visit("https://httpbin.org/html")
}

Cache Configuration Best Practices

Setting Appropriate TTL Values

func configureCacheTTL(c *colly.Collector) {
    c.OnRequest(func(r *colly.Request) {
        // Different TTL for different content types
        if isStaticResource(r.URL.String()) {
            r.Ctx.Put("cache_ttl", "24h")
        } else if isAPIEndpoint(r.URL.String()) {
            r.Ctx.Put("cache_ttl", "5m")
        } else {
            r.Ctx.Put("cache_ttl", "1h")
        }
    })
}

func isStaticResource(url string) bool {
    // Check for static file extensions
    return false // Implementation depends on your needs
}

func isAPIEndpoint(url string) bool {
    // Check if URL is an API endpoint
    return false // Implementation depends on your needs
}

Cache Invalidation

type InvalidationCache struct {
    cache       map[string]*CacheItem
    invalidated map[string]bool
    mutex       sync.RWMutex
}

func (ic *InvalidationCache) Invalidate(pattern string) {
    ic.mutex.Lock()
    defer ic.mutex.Unlock()

    for key := range ic.cache {
        if matchesPattern(key, pattern) {
            ic.invalidated[key] = true
        }
    }
}

func matchesPattern(key, pattern string) bool {
    // Implement pattern matching logic
    return false
}

Monitoring and Debugging Cache Performance

type CacheStats struct {
    Hits   int64
    Misses int64
    mutex  sync.RWMutex
}

func (cs *CacheStats) RecordHit() {
    cs.mutex.Lock()
    defer cs.mutex.Unlock()
    cs.Hits++
}

func (cs *CacheStats) RecordMiss() {
    cs.mutex.Lock()
    defer cs.mutex.Unlock()
    cs.Misses++
}

func (cs *CacheStats) HitRatio() float64 {
    cs.mutex.RLock()
    defer cs.mutex.RUnlock()

    total := cs.Hits + cs.Misses
    if total == 0 {
        return 0
    }
    return float64(cs.Hits) / float64(total)
}

Integration with Other Caching Libraries

For production applications, consider integrating with established caching libraries like Ristretto or BigCache:

go get github.com/dgraph-io/ristretto
go get github.com/allegro/bigcache/v3
import (
    "github.com/dgraph-io/ristretto"
    "github.com/allegro/bigcache/v3"
)

func setupRistrettoCache() *ristretto.Cache {
    cache, _ := ristretto.NewCache(&ristretto.Config{
        NumCounters: 1e7,     // 10M keys tracking frequency
        MaxCost:     1 << 30, // Maximum cost of cache (1GB)
        BufferItems: 64,      // Number of keys per Get buffer
    })
    return cache
}

Conclusion

Implementing effective caching mechanisms in Colly significantly improves scraping performance and reduces server load. Choose the caching strategy that best fits your use case:

  • Use HTTP cache for simple, standards-compliant caching
  • Implement in-memory cache for fast, temporary caching
  • Use file-based cache for persistent caching across restarts
  • Build custom cache for advanced requirements like TTL and size limits
  • Consider Redis for distributed systems

When working with browser automation tools like handling authentication workflows, proper caching becomes even more critical for maintaining session state and reducing the overhead of repeated login procedures.

Remember to monitor cache performance and adjust TTL values based on your specific requirements. Proper cache implementation can reduce response times by up to 90% and significantly decrease the load on target servers.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon