What are the Performance Considerations When Using Colly?
Colly is a powerful and fast web scraping framework for Go, but achieving optimal performance requires understanding and implementing several key considerations. Whether you're scraping a few pages or millions of URLs, these performance optimizations can significantly impact your scraper's efficiency and resource usage.
Parallelism and Concurrency Control
Async Mode Configuration
One of Colly's most powerful features is its ability to handle concurrent requests. By default, Colly operates synchronously, but enabling async mode can dramatically improve performance:
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
// Enable async mode with custom parallelism
c.Async = true
// Limit concurrent requests to avoid overwhelming servers
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2, // Number of concurrent requests
Delay: 1 * time.Second, // Delay between requests
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
// Wait for all requests to complete
c.Wait()
}
Optimal Parallelism Settings
Finding the right parallelism level is crucial for performance:
// Conservative approach for respectful scraping
c.Limit(&colly.LimitRule{
DomainGlob: "*httpbin.org*",
Parallelism: 2,
Delay: 500 * time.Millisecond,
})
// Aggressive approach for internal APIs or with permission
c.Limit(&colly.LimitRule{
DomainGlob: "*internal-api.com*",
Parallelism: 10,
Delay: 100 * time.Millisecond,
})
// Different rules for different domains
c.Limit(&colly.LimitRule{
DomainGlob: "*social-media.com*",
Parallelism: 1, // Strict rate limiting
Delay: 2 * time.Second,
})
Memory Management Optimization
Request Queue Management
Colly maintains an internal queue of requests that can consume significant memory for large scraping operations:
c := colly.NewCollector()
// Limit the number of threads to control memory usage
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 4, // Balance between speed and memory
})
// Use callbacks to process data immediately rather than storing
c.OnHTML(".product", func(e *colly.HTMLElement) {
// Process and save data immediately
product := Product{
Name: e.ChildText(".name"),
Price: e.ChildText(".price"),
}
// Save to database or file immediately
saveProduct(product)
// Don't store in memory for later processing
})
Efficient Data Processing
Process scraped data immediately to avoid memory accumulation:
// Bad: Storing all data in memory
var products []Product
c.OnHTML(".product", func(e *colly.HTMLElement) {
products = append(products, Product{
Name: e.ChildText(".name"),
})
})
// Good: Process data immediately
c.OnHTML(".product", func(e *colly.HTMLElement) {
product := Product{
Name: e.ChildText(".name"),
}
// Stream to file or database
writeToCSV(product)
// or
insertToDatabase(product)
})
HTTP Connection Optimization
Connection Pooling and Keep-Alive
Configure HTTP transport settings for better connection reuse:
import (
"net/http"
"time"
)
c := colly.NewCollector()
// Configure custom HTTP transport
c.OnRequest(func(r *colly.Request) {
transport := &http.Transport{
MaxIdleConns: 100, // Total idle connections
MaxIdleConnsPerHost: 10, // Idle connections per host
IdleConnTimeout: 90 * time.Second, // How long to keep idle connections
DisableKeepAlives: false, // Enable keep-alive
}
r.Headers.Set("Connection", "keep-alive")
})
Timeout Configuration
Set appropriate timeouts to prevent hanging requests:
c := colly.NewCollector()
// Set request timeout
c.SetRequestTimeout(30 * time.Second)
// Configure transport timeouts
transport := &http.Transport{
DialTimeout: 10 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
}
Rate Limiting and Respectful Scraping
Intelligent Rate Limiting
Implement smart rate limiting that adapts to server responses:
c := colly.NewCollector()
// Basic rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
// Handle rate limiting responses
c.OnResponse(func(r *colly.Response) {
if r.StatusCode == 429 { // Too Many Requests
// Exponentially increase delay
time.Sleep(5 * time.Second)
}
})
// Monitor response times and adjust
c.OnResponse(func(r *colly.Response) {
responseTime := time.Since(r.Request.RequestTime)
// Adjust delays based on response time
if responseTime > 3*time.Second {
// Server is slow, increase delays
c.Limit(&colly.LimitRule{
DomainGlob: r.Request.URL.Host,
Delay: 2 * time.Second,
})
}
})
Caching and Storage Optimization
Implementing Response Caching
Cache responses to avoid redundant requests:
import (
"github.com/gocolly/colly/v2/storage"
)
c := colly.NewCollector()
// Use in-memory storage for small datasets
c.CacheDir = "./cache"
// Or use Redis for distributed caching
storage := &storage.RedisStorage{
Address: "127.0.0.1:6379",
Password: "",
DB: 0,
Prefix: "colly_cache",
}
err := c.SetStorage(storage)
if err != nil {
panic(err)
}
Selective Content Processing
Only process the content you need to improve performance:
import "strings"
c := colly.NewCollector()
// Only download HTML content, skip images/CSS/JS
c.OnRequest(func(r *colly.Request) {
contentType := r.Headers.Get("Content-Type")
if strings.Contains(contentType, "image/") ||
strings.Contains(contentType, "text/css") ||
strings.Contains(contentType, "application/javascript") {
r.Abort()
}
})
// Use specific selectors to minimize DOM parsing
c.OnHTML("div.content a[href]", func(e *colly.HTMLElement) {
// More specific selector = faster parsing
link := e.Attr("href")
e.Request.Visit(link)
})
Error Handling and Retry Logic
Efficient Error Recovery
Implement smart retry mechanisms to handle temporary failures:
c := colly.NewCollector()
// Track retry attempts
retryCount := make(map[string]int)
c.OnError(func(r *colly.Response, err error) {
url := r.Request.URL.String()
retryCount[url]++
// Retry with exponential backoff
if retryCount[url] < 3 {
delay := time.Duration(retryCount[url]) * time.Second
time.Sleep(delay)
r.Request.Retry()
} else {
fmt.Printf("Failed after 3 retries: %s\n", url)
}
})
Monitoring and Profiling
Performance Metrics Collection
Monitor your scraper's performance in real-time:
import (
"sync/atomic"
"time"
)
var (
requestCount int64
responseCount int64
errorCount int64
startTime = time.Now()
)
c := colly.NewCollector()
c.OnRequest(func(r *colly.Request) {
atomic.AddInt64(&requestCount, 1)
})
c.OnResponse(func(r *colly.Response) {
atomic.AddInt64(&responseCount, 1)
})
c.OnError(func(r *colly.Response, err error) {
atomic.AddInt64(&errorCount, 1)
})
// Print stats periodically
go func() {
for {
time.Sleep(10 * time.Second)
elapsed := time.Since(startTime)
requests := atomic.LoadInt64(&requestCount)
responses := atomic.LoadInt64(&responseCount)
errors := atomic.LoadInt64(&errorCount)
fmt.Printf("Stats: %d requests, %d responses, %d errors in %v\n",
requests, responses, errors, elapsed)
fmt.Printf("Rate: %.2f requests/second\n",
float64(requests)/elapsed.Seconds())
}
}()
Memory Profiling with Go Tools
Use Go's built-in profiling tools to identify performance bottlenecks:
# Add profiling to your application
go tool pprof http://localhost:6060/debug/pprof/heap
# Monitor CPU usage
go tool pprof http://localhost:6060/debug/pprof/profile
# Check goroutine usage
go tool pprof http://localhost:6060/debug/pprof/goroutine
Add profiling endpoints to your scraper:
import (
_ "net/http/pprof"
"net/http"
)
func main() {
// Start profiling server
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
// Your scraping code here
}
Comparison with Other Tools
When considering performance alternatives, you might want to explore other scraping solutions. For JavaScript-heavy sites that require browser automation, tools like Puppeteer offer different approaches to handling timeouts and running multiple pages in parallel that might be more suitable for certain use cases.
Resource Management Best Practices
CPU Optimization
import "runtime"
func optimizeForCPU() {
// Set GOMAXPROCS to match available CPU cores
numCPU := runtime.NumCPU()
runtime.GOMAXPROCS(numCPU)
// For I/O heavy workloads, you might want more goroutines
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: numCPU * 2, // 2x CPU cores for I/O bound tasks
})
}
Network Optimization
// Configure network settings for better performance
transport := &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 30,
IdleConnTimeout: 90 * time.Second,
DisableCompression: false, // Enable compression
DisableKeepAlives: false, // Enable keep-alive
}
client := &http.Client{
Transport: transport,
Timeout: 30 * time.Second,
}
Best Practices Summary
- Start Conservative: Begin with low parallelism (1-2) and gradually increase based on server response
- Monitor Resource Usage: Track memory, CPU, and network usage during scraping
- Implement Proper Error Handling: Use exponential backoff and reasonable retry limits
- Cache Intelligently: Store responses when appropriate to avoid redundant requests
- Process Data Immediately: Don't accumulate large amounts of data in memory
- Respect Server Limits: Implement delays and respect robots.txt files
- Use Appropriate Timeouts: Set reasonable connection and request timeouts
- Profile Your Code: Use Go's built-in profiling tools to identify bottlenecks
- Optimize Selectors: Use specific CSS selectors to minimize DOM parsing overhead
- Handle Rate Limits: Implement adaptive rate limiting based on server responses
Common Performance Pitfalls
Memory Leaks
// Avoid: Storing references to large objects
var responses []*colly.Response
c.OnResponse(func(r *colly.Response) {
responses = append(responses, r) // Memory leak!
})
// Better: Process and discard
c.OnResponse(func(r *colly.Response) {
processResponse(r)
// Response is garbage collected after callback
})
Inefficient Selectors
// Avoid: Broad selectors that match many elements
c.OnHTML("*", func(e *colly.HTMLElement) {
// This is very slow!
})
// Better: Specific selectors
c.OnHTML("article.post h2.title", func(e *colly.HTMLElement) {
// Much faster and more targeted
})
By following these performance considerations, you can build efficient, scalable, and respectful web scrapers with Colly that handle large-scale data extraction while maintaining optimal resource usage and server relationships.