What are the Performance Considerations When Using Colly?

Colly is a powerful and fast web scraping framework for Go, but achieving optimal performance requires understanding and implementing several key considerations. Whether you're scraping a few pages or millions of URLs, these performance optimizations can significantly impact your scraper's efficiency and resource usage.

Parallelism and Concurrency Control

Async Mode Configuration

One of Colly's most powerful features is its ability to handle concurrent requests. By default, Colly operates synchronously, but enabling async mode can dramatically improve performance:

package main

import (
    "fmt"
    "time"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Enable async mode with custom parallelism
    c.Async = true

    // Limit concurrent requests to avoid overwhelming servers
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2, // Number of concurrent requests
        Delay:       1 * time.Second, // Delay between requests
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.Visit("https://example.com")

    // Wait for all requests to complete
    c.Wait()
}

Optimal Parallelism Settings

Finding the right parallelism level is crucial for performance:

// Conservative approach for respectful scraping
c.Limit(&colly.LimitRule{
    DomainGlob:  "*httpbin.org*",
    Parallelism: 2,
    Delay:       500 * time.Millisecond,
})

// Aggressive approach for internal APIs or with permission
c.Limit(&colly.LimitRule{
    DomainGlob:  "*internal-api.com*",
    Parallelism: 10,
    Delay:       100 * time.Millisecond,
})

// Different rules for different domains
c.Limit(&colly.LimitRule{
    DomainGlob:  "*social-media.com*",
    Parallelism: 1, // Strict rate limiting
    Delay:       2 * time.Second,
})

Memory Management Optimization

Request Queue Management

Colly maintains an internal queue of requests that can consume significant memory for large scraping operations:

c := colly.NewCollector()

// Limit the number of threads to control memory usage
c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 4, // Balance between speed and memory
})

// Use callbacks to process data immediately rather than storing
c.OnHTML(".product", func(e *colly.HTMLElement) {
    // Process and save data immediately
    product := Product{
        Name:  e.ChildText(".name"),
        Price: e.ChildText(".price"),
    }

    // Save to database or file immediately
    saveProduct(product)

    // Don't store in memory for later processing
})

Efficient Data Processing

Process scraped data immediately to avoid memory accumulation:

// Bad: Storing all data in memory
var products []Product

c.OnHTML(".product", func(e *colly.HTMLElement) {
    products = append(products, Product{
        Name: e.ChildText(".name"),
    })
})

// Good: Process data immediately
c.OnHTML(".product", func(e *colly.HTMLElement) {
    product := Product{
        Name: e.ChildText(".name"),
    }

    // Stream to file or database
    writeToCSV(product)
    // or
    insertToDatabase(product)
})

HTTP Connection Optimization

Connection Pooling and Keep-Alive

Configure HTTP transport settings for better connection reuse:

import (
    "net/http"
    "time"
)

c := colly.NewCollector()

// Configure custom HTTP transport
c.OnRequest(func(r *colly.Request) {
    transport := &http.Transport{
        MaxIdleConns:        100,              // Total idle connections
        MaxIdleConnsPerHost: 10,               // Idle connections per host
        IdleConnTimeout:     90 * time.Second, // How long to keep idle connections
        DisableKeepAlives:   false,            // Enable keep-alive
    }

    r.Headers.Set("Connection", "keep-alive")
})

Timeout Configuration

Set appropriate timeouts to prevent hanging requests:

c := colly.NewCollector()

// Set request timeout
c.SetRequestTimeout(30 * time.Second)

// Configure transport timeouts
transport := &http.Transport{
    DialTimeout:         10 * time.Second,
    TLSHandshakeTimeout: 10 * time.Second,
}

Rate Limiting and Respectful Scraping

Intelligent Rate Limiting

Implement smart rate limiting that adapts to server responses:

c := colly.NewCollector()

// Basic rate limiting
c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 2,
    Delay:       1 * time.Second,
})

// Handle rate limiting responses
c.OnResponse(func(r *colly.Response) {
    if r.StatusCode == 429 { // Too Many Requests
        // Exponentially increase delay
        time.Sleep(5 * time.Second)
    }
})

// Monitor response times and adjust
c.OnResponse(func(r *colly.Response) {
    responseTime := time.Since(r.Request.RequestTime)

    // Adjust delays based on response time
    if responseTime > 3*time.Second {
        // Server is slow, increase delays
        c.Limit(&colly.LimitRule{
            DomainGlob:  r.Request.URL.Host,
            Delay:       2 * time.Second,
        })
    }
})

Caching and Storage Optimization

Implementing Response Caching

Cache responses to avoid redundant requests:

import (
    "github.com/gocolly/colly/v2/storage"
)

c := colly.NewCollector()

// Use in-memory storage for small datasets
c.CacheDir = "./cache"

// Or use Redis for distributed caching
storage := &storage.RedisStorage{
    Address:  "127.0.0.1:6379",
    Password: "",
    DB:       0,
    Prefix:   "colly_cache",
}

err := c.SetStorage(storage)
if err != nil {
    panic(err)
}

Selective Content Processing

Only process the content you need to improve performance:

import "strings"

c := colly.NewCollector()

// Only download HTML content, skip images/CSS/JS
c.OnRequest(func(r *colly.Request) {
    contentType := r.Headers.Get("Content-Type")
    if strings.Contains(contentType, "image/") ||
       strings.Contains(contentType, "text/css") ||
       strings.Contains(contentType, "application/javascript") {
        r.Abort()
    }
})

// Use specific selectors to minimize DOM parsing
c.OnHTML("div.content a[href]", func(e *colly.HTMLElement) {
    // More specific selector = faster parsing
    link := e.Attr("href")
    e.Request.Visit(link)
})

Error Handling and Retry Logic

Efficient Error Recovery

Implement smart retry mechanisms to handle temporary failures:

c := colly.NewCollector()

// Track retry attempts
retryCount := make(map[string]int)

c.OnError(func(r *colly.Response, err error) {
    url := r.Request.URL.String()
    retryCount[url]++

    // Retry with exponential backoff
    if retryCount[url] < 3 {
        delay := time.Duration(retryCount[url]) * time.Second
        time.Sleep(delay)
        r.Request.Retry()
    } else {
        fmt.Printf("Failed after 3 retries: %s\n", url)
    }
})

Monitoring and Profiling

Performance Metrics Collection

Monitor your scraper's performance in real-time:

import (
    "sync/atomic"
    "time"
)

var (
    requestCount  int64
    responseCount int64
    errorCount    int64
    startTime     = time.Now()
)

c := colly.NewCollector()

c.OnRequest(func(r *colly.Request) {
    atomic.AddInt64(&requestCount, 1)
})

c.OnResponse(func(r *colly.Response) {
    atomic.AddInt64(&responseCount, 1)
})

c.OnError(func(r *colly.Response, err error) {
    atomic.AddInt64(&errorCount, 1)
})

// Print stats periodically
go func() {
    for {
        time.Sleep(10 * time.Second)
        elapsed := time.Since(startTime)
        requests := atomic.LoadInt64(&requestCount)
        responses := atomic.LoadInt64(&responseCount)
        errors := atomic.LoadInt64(&errorCount)

        fmt.Printf("Stats: %d requests, %d responses, %d errors in %v\n",
            requests, responses, errors, elapsed)
        fmt.Printf("Rate: %.2f requests/second\n",
            float64(requests)/elapsed.Seconds())
    }
}()

Memory Profiling with Go Tools

Use Go's built-in profiling tools to identify performance bottlenecks:

# Add profiling to your application
go tool pprof http://localhost:6060/debug/pprof/heap

# Monitor CPU usage
go tool pprof http://localhost:6060/debug/pprof/profile

# Check goroutine usage
go tool pprof http://localhost:6060/debug/pprof/goroutine

Add profiling endpoints to your scraper:

import (
    _ "net/http/pprof"
    "net/http"
)

func main() {
    // Start profiling server
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // Your scraping code here
}

Comparison with Other Tools

When considering performance alternatives, you might want to explore other scraping solutions. For JavaScript-heavy sites that require browser automation, tools like Puppeteer offer different approaches to handling timeouts and running multiple pages in parallel that might be more suitable for certain use cases.

Resource Management Best Practices

CPU Optimization

import "runtime"

func optimizeForCPU() {
    // Set GOMAXPROCS to match available CPU cores
    numCPU := runtime.NumCPU()
    runtime.GOMAXPROCS(numCPU)

    // For I/O heavy workloads, you might want more goroutines
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: numCPU * 2, // 2x CPU cores for I/O bound tasks
    })
}

Network Optimization

// Configure network settings for better performance
transport := &http.Transport{
    MaxIdleConns:        100,
    MaxIdleConnsPerHost: 30,
    IdleConnTimeout:     90 * time.Second,
    DisableCompression:  false, // Enable compression
    DisableKeepAlives:   false, // Enable keep-alive
}

client := &http.Client{
    Transport: transport,
    Timeout:   30 * time.Second,
}

Best Practices Summary

Start Conservative: Begin with low parallelism (1-2) and gradually increase based on server response
Monitor Resource Usage: Track memory, CPU, and network usage during scraping
Implement Proper Error Handling: Use exponential backoff and reasonable retry limits
Cache Intelligently: Store responses when appropriate to avoid redundant requests
Process Data Immediately: Don't accumulate large amounts of data in memory
Respect Server Limits: Implement delays and respect robots.txt files
Use Appropriate Timeouts: Set reasonable connection and request timeouts
Profile Your Code: Use Go's built-in profiling tools to identify bottlenecks
Optimize Selectors: Use specific CSS selectors to minimize DOM parsing overhead
Handle Rate Limits: Implement adaptive rate limiting based on server responses

Common Performance Pitfalls

Memory Leaks

// Avoid: Storing references to large objects
var responses []*colly.Response

c.OnResponse(func(r *colly.Response) {
    responses = append(responses, r) // Memory leak!
})

// Better: Process and discard
c.OnResponse(func(r *colly.Response) {
    processResponse(r)
    // Response is garbage collected after callback
})

Inefficient Selectors

// Avoid: Broad selectors that match many elements
c.OnHTML("*", func(e *colly.HTMLElement) {
    // This is very slow!
})

// Better: Specific selectors
c.OnHTML("article.post h2.title", func(e *colly.HTMLElement) {
    // Much faster and more targeted
})

By following these performance considerations, you can build efficient, scalable, and respectful web scrapers with Colly that handle large-scale data extraction while maintaining optimal resource usage and server relationships.

Table of contents