Table of contents

What are the Memory Usage Patterns of Colly for Large-Scale Scraping?

Understanding Colly's memory usage patterns is crucial for building efficient large-scale web scraping applications in Go. This comprehensive guide explores how Colly manages memory, common memory bottlenecks, and optimization strategies for production deployments.

Colly's Core Memory Architecture

Colly is designed with memory efficiency in mind, but its memory usage patterns depend heavily on configuration and usage patterns. The framework uses several key components that affect memory consumption:

Request Queue Management

Colly maintains an internal request queue that can grow significantly during large-scale operations:

package main

import (
    "fmt"
    "runtime"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Monitor memory usage
    go monitorMemory()

    // Set reasonable limits
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       100 * time.Millisecond,
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Visiting: %s\n", r.URL.String())
    })

    c.Visit("https://example.com")
    c.Wait()
}

func monitorMemory() {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        var m runtime.MemStats
        runtime.ReadMemStats(&m)
        fmt.Printf("Memory: Alloc=%d KB, Sys=%d KB, NumGC=%d\n",
            m.Alloc/1024, m.Sys/1024, m.NumGC)
    }
}

DOM Storage and Processing

Each scraped page creates DOM objects that consume memory proportional to page size:

// Memory-efficient HTML processing
c.OnHTML("div.content", func(e *colly.HTMLElement) {
    // Extract only necessary data immediately
    title := e.ChildText("h1")
    description := e.ChildText("p.description")

    // Process and store data immediately
    processData(title, description)

    // Don't store large DOM elements in memory
})

// Avoid storing entire elements
var elements []*colly.HTMLElement // This can cause memory leaks

// Instead, extract data immediately
type ScrapedData struct {
    Title       string
    Description string
    URL         string
}

var results []ScrapedData

Memory Usage Patterns by Scale

Small-Scale Scraping (< 1000 pages)

For small-scale operations, Colly's default configuration typically uses 10-50 MB of memory:

func smallScaleScraper() {
    c := colly.NewCollector()

    var pageCount int
    c.OnResponse(func(r *colly.Response) {
        pageCount++
        if pageCount%100 == 0 {
            runtime.GC() // Optional garbage collection
        }
    })

    // Memory usage remains stable
    urls := []string{
        "https://example1.com",
        "https://example2.com",
        // ... up to 1000 URLs
    }

    for _, url := range urls {
        c.Visit(url)
    }
}

Medium-Scale Scraping (1,000 - 100,000 pages)

Medium-scale operations require careful memory management:

func mediumScaleScraper() {
    c := colly.NewCollector()

    // Implement request limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 5,
        Delay:       200 * time.Millisecond,
    })

    // Set reasonable timeouts
    c.SetRequestTimeout(30 * time.Second)

    // Use caching with size limits
    c.CacheDir = "./colly_cache"

    // Process data in batches
    var batch []ScrapedData
    const batchSize = 1000

    c.OnHTML("div.item", func(e *colly.HTMLElement) {
        data := ScrapedData{
            Title: e.ChildText("h2"),
            URL:   e.Request.URL.String(),
        }

        batch = append(batch, data)

        if len(batch) >= batchSize {
            processBatch(batch)
            batch = batch[:0] // Reset slice but keep capacity
        }
    })
}

Large-Scale Scraping (> 100,000 pages)

Large-scale operations demand sophisticated memory management strategies:

func largeScaleScraper() {
    c := colly.NewCollector()

    // Aggressive memory optimization
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 10,
        Delay:       100 * time.Millisecond,
    })

    // Disable automatic request deduplication for very large scales
    c.AllowURLRevisit = false

    // Use streaming data processing
    dataChan := make(chan ScrapedData, 1000)

    // Background data processor
    go func() {
        for data := range dataChan {
            // Process immediately, don't accumulate in memory
            saveToDatabase(data)
        }
    }()

    c.OnHTML("div.content", func(e *colly.HTMLElement) {
        data := ScrapedData{
            Title:       e.ChildText("h1"),
            Description: e.ChildText("p"),
            URL:         e.Request.URL.String(),
        }

        select {
        case dataChan <- data:
        default:
            // Channel full, apply backpressure
            time.Sleep(10 * time.Millisecond)
            dataChan <- data
        }
    })

    // Implement periodic garbage collection
    go func() {
        ticker := time.NewTicker(30 * time.Second)
        defer ticker.Stop()
        for range ticker.C {
            runtime.GC()
            debug.FreeOSMemory()
        }
    }()
}

Common Memory Bottlenecks

Request Queue Overflow

The most common memory issue occurs when the request queue grows faster than requests are processed:

// Problem: Unbounded request generation
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    // This can generate thousands of requests instantly
    e.Request.Visit(e.Attr("href"))
})

// Solution: Implement queue size limits
type LimitedCollector struct {
    *colly.Collector
    queueSize int32
    maxQueue  int32
}

func (lc *LimitedCollector) LimitedVisit(url string) error {
    if atomic.LoadInt32(&lc.queueSize) >= lc.maxQueue {
        return fmt.Errorf("queue full")
    }

    atomic.AddInt32(&lc.queueSize, 1)
    return lc.Visit(url)
}

DOM Element Accumulation

Storing references to DOM elements causes memory leaks:

// Memory leak: storing DOM elements
var allElements []*colly.HTMLElement

c.OnHTML("div", func(e *colly.HTMLElement) {
    allElements = append(allElements, e) // Don't do this!
})

// Correct approach: extract data immediately
var extractedData []string

c.OnHTML("div", func(e *colly.HTMLElement) {
    data := e.Text
    extractedData = append(extractedData, data) // Extract data only
})

Response Body Caching

Colly's response caching can consume significant memory:

// Configure cache with size limits
c.CacheDir = "./cache"

// Or disable caching for large-scale operations
c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("Cache-Control", "no-cache")
})

// Implement custom caching with LRU eviction
import "github.com/hashicorp/golang-lru"

cache, _ := lru.New(1000) // Limit to 1000 cached responses

c.OnResponse(func(r *colly.Response) {
    if len(r.Body) < 1024*1024 { // Only cache responses < 1MB
        cache.Add(r.Request.URL.String(), r.Body)
    }
})

Memory Optimization Strategies

Streaming Data Processing

Process data as it arrives rather than accumulating in memory:

func streamingProcessor() {
    c := colly.NewCollector()

    // Create buffered channel for streaming
    type StreamData struct {
        URL   string
        Title string
        Body  []byte
    }

    stream := make(chan StreamData, 100)

    // Background processor
    go func() {
        for data := range stream {
            // Process immediately
            processAndSave(data)

            // Optional: implement backpressure
            if len(stream) > 80 {
                time.Sleep(100 * time.Millisecond)
            }
        }
    }()

    c.OnResponse(func(r *colly.Response) {
        stream <- StreamData{
            URL:  r.Request.URL.String(),
            Body: r.Body,
        }
    })
}

Memory Pool Usage

Implement object pooling for frequently allocated objects:

import "sync"

var dataPool = sync.Pool{
    New: func() interface{} {
        return &ScrapedData{}
    },
}

c.OnHTML("div.item", func(e *colly.HTMLElement) {
    data := dataPool.Get().(*ScrapedData)
    defer dataPool.Put(data)

    // Reset data
    *data = ScrapedData{}

    // Populate data
    data.Title = e.ChildText("h1")
    data.URL = e.Request.URL.String()

    // Process immediately
    processData(data)
})

Garbage Collection Tuning

Optimize Go's garbage collector for scraping workloads:

import "runtime/debug"

func optimizeGC() {
    // Increase GC target percentage for better performance
    debug.SetGCPercent(200)

    // Set memory limit (Go 1.19+)
    debug.SetMemoryLimit(2 << 30) // 2GB limit

    // Periodic forced collection for long-running scrapers
    go func() {
        ticker := time.NewTicker(5 * time.Minutes)
        defer ticker.Stop()
        for range ticker.C {
            runtime.GC()
            debug.FreeOSMemory()
        }
    }()
}

Production Deployment Considerations

Container Memory Limits

When deploying in containers, configure appropriate memory limits:

# Dockerfile
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY . .
RUN go build -o scraper

FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/scraper .

# Set memory limit
ENV GOGC=200
ENV GOMEMLIMIT=1GiB

CMD ["./scraper"]
# Kubernetes deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: colly-scraper
spec:
  template:
    spec:
      containers:
      - name: scraper
        image: colly-scraper:latest
        resources:
          requests:
            memory: "512Mi"
          limits:
            memory: "2Gi"
        env:
        - name: GOGC
          value: "200"
        - name: GOMEMLIMIT
          value: "1800MiB"

Monitoring and Alerting

Implement comprehensive memory monitoring:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    memoryUsage = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "colly_memory_usage_bytes",
            Help: "Current memory usage in bytes",
        },
        []string{"type"},
    )

    pagesProcessed = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "colly_pages_processed_total",
            Help: "Total number of pages processed",
        },
        []string{"status"},
    )
)

func startMetricsServer() {
    prometheus.MustRegister(memoryUsage, pagesProcessed)

    go func() {
        ticker := time.NewTicker(10 * time.Second)
        defer ticker.Stop()

        for range ticker.C {
            var m runtime.MemStats
            runtime.ReadMemStats(&m)

            memoryUsage.WithLabelValues("alloc").Set(float64(m.Alloc))
            memoryUsage.WithLabelValues("sys").Set(float64(m.Sys))
            memoryUsage.WithLabelValues("heap").Set(float64(m.HeapAlloc))
        }
    }()

    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Console Commands for Memory Profiling

Monitor and analyze memory usage during development:

# Run your Colly scraper with memory profiling
go run -memprofile=mem.prof main.go

# Analyze memory profile
go tool pprof mem.prof

# View top memory consumers
(pprof) top

# View memory allocations by function
(pprof) list main.main

# Generate memory usage graph
go tool pprof -png mem.prof > memory_usage.png

Monitor runtime memory statistics:

# Use runtime monitoring tools
watch -n 1 'ps aux | grep your_scraper | grep -v grep'

# Monitor with htop for detailed process information
htop -p $(pgrep your_scraper)

# Use Docker stats for containerized scrapers
docker stats colly-scraper

Testing Memory Performance

Create benchmarks to measure memory usage patterns:

func BenchmarkCollyMemory(b *testing.B) {
    for i := 0; i < b.N; i++ {
        c := colly.NewCollector()

        var data []ScrapedData
        c.OnHTML("div", func(e *colly.HTMLElement) {
            data = append(data, ScrapedData{
                Title: e.Text,
                URL:   e.Request.URL.String(),
            })
        })

        c.Visit("https://httpbin.org/html")
        c.Wait()
    }
}

// Run benchmark with memory profiling
// go test -bench=BenchmarkCollyMemory -memprofile=bench_mem.prof

Conclusion

Colly's memory usage patterns are highly dependent on your scraping strategy and configuration. For large-scale operations, focus on streaming data processing, proper resource limits, and continuous monitoring. Key strategies include implementing request queue limits, avoiding DOM element accumulation, using object pools, and optimizing garbage collection settings.

By understanding these patterns and implementing the optimization techniques discussed, you can build efficient, scalable web scraping applications that maintain stable memory usage even when processing millions of pages. Remember to always test your scraper's memory behavior under production-like conditions and implement proper monitoring to catch memory issues early.

For complex JavaScript-heavy sites that require more sophisticated handling, consider exploring browser automation tools for comprehensive scraping workflows as an alternative approach when Colly's static HTML parsing limitations become apparent.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon