Table of contents

What are the Performance Considerations When Using Colly?

Colly is a powerful and fast web scraping framework for Go, but achieving optimal performance requires understanding and implementing several key considerations. Whether you're scraping a few pages or millions of URLs, these performance optimizations can significantly impact your scraper's efficiency and resource usage.

Parallelism and Concurrency Control

Async Mode Configuration

One of Colly's most powerful features is its ability to handle concurrent requests. By default, Colly operates synchronously, but enabling async mode can dramatically improve performance:

package main

import (
    "fmt"
    "time"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Enable async mode with custom parallelism
    c.Async = true

    // Limit concurrent requests to avoid overwhelming servers
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2, // Number of concurrent requests
        Delay:       1 * time.Second, // Delay between requests
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.Visit("https://example.com")

    // Wait for all requests to complete
    c.Wait()
}

Optimal Parallelism Settings

Finding the right parallelism level is crucial for performance:

// Conservative approach for respectful scraping
c.Limit(&colly.LimitRule{
    DomainGlob:  "*httpbin.org*",
    Parallelism: 2,
    Delay:       500 * time.Millisecond,
})

// Aggressive approach for internal APIs or with permission
c.Limit(&colly.LimitRule{
    DomainGlob:  "*internal-api.com*",
    Parallelism: 10,
    Delay:       100 * time.Millisecond,
})

// Different rules for different domains
c.Limit(&colly.LimitRule{
    DomainGlob:  "*social-media.com*",
    Parallelism: 1, // Strict rate limiting
    Delay:       2 * time.Second,
})

Memory Management Optimization

Request Queue Management

Colly maintains an internal queue of requests that can consume significant memory for large scraping operations:

c := colly.NewCollector()

// Limit the number of threads to control memory usage
c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 4, // Balance between speed and memory
})

// Use callbacks to process data immediately rather than storing
c.OnHTML(".product", func(e *colly.HTMLElement) {
    // Process and save data immediately
    product := Product{
        Name:  e.ChildText(".name"),
        Price: e.ChildText(".price"),
    }

    // Save to database or file immediately
    saveProduct(product)

    // Don't store in memory for later processing
})

Efficient Data Processing

Process scraped data immediately to avoid memory accumulation:

// Bad: Storing all data in memory
var products []Product

c.OnHTML(".product", func(e *colly.HTMLElement) {
    products = append(products, Product{
        Name: e.ChildText(".name"),
    })
})

// Good: Process data immediately
c.OnHTML(".product", func(e *colly.HTMLElement) {
    product := Product{
        Name: e.ChildText(".name"),
    }

    // Stream to file or database
    writeToCSV(product)
    // or
    insertToDatabase(product)
})

HTTP Connection Optimization

Connection Pooling and Keep-Alive

Configure HTTP transport settings for better connection reuse:

import (
    "net/http"
    "time"
)

c := colly.NewCollector()

// Configure custom HTTP transport
c.OnRequest(func(r *colly.Request) {
    transport := &http.Transport{
        MaxIdleConns:        100,              // Total idle connections
        MaxIdleConnsPerHost: 10,               // Idle connections per host
        IdleConnTimeout:     90 * time.Second, // How long to keep idle connections
        DisableKeepAlives:   false,            // Enable keep-alive
    }

    r.Headers.Set("Connection", "keep-alive")
})

Timeout Configuration

Set appropriate timeouts to prevent hanging requests:

c := colly.NewCollector()

// Set request timeout
c.SetRequestTimeout(30 * time.Second)

// Configure transport timeouts
transport := &http.Transport{
    DialTimeout:         10 * time.Second,
    TLSHandshakeTimeout: 10 * time.Second,
}

Rate Limiting and Respectful Scraping

Intelligent Rate Limiting

Implement smart rate limiting that adapts to server responses:

c := colly.NewCollector()

// Basic rate limiting
c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 2,
    Delay:       1 * time.Second,
})

// Handle rate limiting responses
c.OnResponse(func(r *colly.Response) {
    if r.StatusCode == 429 { // Too Many Requests
        // Exponentially increase delay
        time.Sleep(5 * time.Second)
    }
})

// Monitor response times and adjust
c.OnResponse(func(r *colly.Response) {
    responseTime := time.Since(r.Request.RequestTime)

    // Adjust delays based on response time
    if responseTime > 3*time.Second {
        // Server is slow, increase delays
        c.Limit(&colly.LimitRule{
            DomainGlob:  r.Request.URL.Host,
            Delay:       2 * time.Second,
        })
    }
})

Caching and Storage Optimization

Implementing Response Caching

Cache responses to avoid redundant requests:

import (
    "github.com/gocolly/colly/v2/storage"
)

c := colly.NewCollector()

// Use in-memory storage for small datasets
c.CacheDir = "./cache"

// Or use Redis for distributed caching
storage := &storage.RedisStorage{
    Address:  "127.0.0.1:6379",
    Password: "",
    DB:       0,
    Prefix:   "colly_cache",
}

err := c.SetStorage(storage)
if err != nil {
    panic(err)
}

Selective Content Processing

Only process the content you need to improve performance:

import "strings"

c := colly.NewCollector()

// Only download HTML content, skip images/CSS/JS
c.OnRequest(func(r *colly.Request) {
    contentType := r.Headers.Get("Content-Type")
    if strings.Contains(contentType, "image/") ||
       strings.Contains(contentType, "text/css") ||
       strings.Contains(contentType, "application/javascript") {
        r.Abort()
    }
})

// Use specific selectors to minimize DOM parsing
c.OnHTML("div.content a[href]", func(e *colly.HTMLElement) {
    // More specific selector = faster parsing
    link := e.Attr("href")
    e.Request.Visit(link)
})

Error Handling and Retry Logic

Efficient Error Recovery

Implement smart retry mechanisms to handle temporary failures:

c := colly.NewCollector()

// Track retry attempts
retryCount := make(map[string]int)

c.OnError(func(r *colly.Response, err error) {
    url := r.Request.URL.String()
    retryCount[url]++

    // Retry with exponential backoff
    if retryCount[url] < 3 {
        delay := time.Duration(retryCount[url]) * time.Second
        time.Sleep(delay)
        r.Request.Retry()
    } else {
        fmt.Printf("Failed after 3 retries: %s\n", url)
    }
})

Monitoring and Profiling

Performance Metrics Collection

Monitor your scraper's performance in real-time:

import (
    "sync/atomic"
    "time"
)

var (
    requestCount  int64
    responseCount int64
    errorCount    int64
    startTime     = time.Now()
)

c := colly.NewCollector()

c.OnRequest(func(r *colly.Request) {
    atomic.AddInt64(&requestCount, 1)
})

c.OnResponse(func(r *colly.Response) {
    atomic.AddInt64(&responseCount, 1)
})

c.OnError(func(r *colly.Response, err error) {
    atomic.AddInt64(&errorCount, 1)
})

// Print stats periodically
go func() {
    for {
        time.Sleep(10 * time.Second)
        elapsed := time.Since(startTime)
        requests := atomic.LoadInt64(&requestCount)
        responses := atomic.LoadInt64(&responseCount)
        errors := atomic.LoadInt64(&errorCount)

        fmt.Printf("Stats: %d requests, %d responses, %d errors in %v\n",
            requests, responses, errors, elapsed)
        fmt.Printf("Rate: %.2f requests/second\n",
            float64(requests)/elapsed.Seconds())
    }
}()

Memory Profiling with Go Tools

Use Go's built-in profiling tools to identify performance bottlenecks:

# Add profiling to your application
go tool pprof http://localhost:6060/debug/pprof/heap

# Monitor CPU usage
go tool pprof http://localhost:6060/debug/pprof/profile

# Check goroutine usage
go tool pprof http://localhost:6060/debug/pprof/goroutine

Add profiling endpoints to your scraper:

import (
    _ "net/http/pprof"
    "net/http"
)

func main() {
    // Start profiling server
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // Your scraping code here
}

Comparison with Other Tools

When considering performance alternatives, you might want to explore other scraping solutions. For JavaScript-heavy sites that require browser automation, tools like Puppeteer offer different approaches to handling timeouts and running multiple pages in parallel that might be more suitable for certain use cases.

Resource Management Best Practices

CPU Optimization

import "runtime"

func optimizeForCPU() {
    // Set GOMAXPROCS to match available CPU cores
    numCPU := runtime.NumCPU()
    runtime.GOMAXPROCS(numCPU)

    // For I/O heavy workloads, you might want more goroutines
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: numCPU * 2, // 2x CPU cores for I/O bound tasks
    })
}

Network Optimization

// Configure network settings for better performance
transport := &http.Transport{
    MaxIdleConns:        100,
    MaxIdleConnsPerHost: 30,
    IdleConnTimeout:     90 * time.Second,
    DisableCompression:  false, // Enable compression
    DisableKeepAlives:   false, // Enable keep-alive
}

client := &http.Client{
    Transport: transport,
    Timeout:   30 * time.Second,
}

Best Practices Summary

  1. Start Conservative: Begin with low parallelism (1-2) and gradually increase based on server response
  2. Monitor Resource Usage: Track memory, CPU, and network usage during scraping
  3. Implement Proper Error Handling: Use exponential backoff and reasonable retry limits
  4. Cache Intelligently: Store responses when appropriate to avoid redundant requests
  5. Process Data Immediately: Don't accumulate large amounts of data in memory
  6. Respect Server Limits: Implement delays and respect robots.txt files
  7. Use Appropriate Timeouts: Set reasonable connection and request timeouts
  8. Profile Your Code: Use Go's built-in profiling tools to identify bottlenecks
  9. Optimize Selectors: Use specific CSS selectors to minimize DOM parsing overhead
  10. Handle Rate Limits: Implement adaptive rate limiting based on server responses

Common Performance Pitfalls

Memory Leaks

// Avoid: Storing references to large objects
var responses []*colly.Response

c.OnResponse(func(r *colly.Response) {
    responses = append(responses, r) // Memory leak!
})

// Better: Process and discard
c.OnResponse(func(r *colly.Response) {
    processResponse(r)
    // Response is garbage collected after callback
})

Inefficient Selectors

// Avoid: Broad selectors that match many elements
c.OnHTML("*", func(e *colly.HTMLElement) {
    // This is very slow!
})

// Better: Specific selectors
c.OnHTML("article.post h2.title", func(e *colly.HTMLElement) {
    // Much faster and more targeted
})

By following these performance considerations, you can build efficient, scalable, and respectful web scrapers with Colly that handle large-scale data extraction while maintaining optimal resource usage and server relationships.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon