How do I implement retry logic for failed requests in Colly?
Implementing retry logic is crucial for building robust web scrapers with Colly. Network failures, temporary server errors, and rate limiting can cause requests to fail, but with proper retry mechanisms, your scraper can automatically recover from these issues and continue operating reliably.
Understanding Retry Logic in Colly
Colly provides built-in support for retry logic through its callback system and configuration options. The framework allows you to implement custom retry strategies using the OnError
callback and the Retry()
method, giving you fine-grained control over when and how failed requests should be retried.
Basic Retry Implementation
Here's a simple implementation of retry logic in Colly:
package main
import (
"fmt"
"log"
"net/http"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
// Set request timeout
c.SetRequestTimeout(30 * time.Second)
// Implement retry logic
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Request failed: %s, retrying...\n", err.Error())
r.Request.Retry()
})
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
c.OnRequest(func(r *colly.Request) {
fmt.Printf("Visiting: %s\n", r.URL.String())
})
c.Visit("https://httpbin.org/status/500") // This will fail and retry
}
Advanced Retry Logic with Conditions
For production applications, you'll want more sophisticated retry logic that considers the type of error and implements exponential backoff:
package main
import (
"fmt"
"log"
"net/http"
"time"
"github.com/gocolly/colly/v2"
)
const (
maxRetries = 3
baseDelay = 1 * time.Second
)
func main() {
c := colly.NewCollector()
// Track retry attempts
retryCount := make(map[string]int)
c.OnError(func(r *colly.Response, err error) {
url := r.Request.URL.String()
// Check if we should retry based on error type
if shouldRetry(r, err) {
count := retryCount[url]
if count < maxRetries {
retryCount[url] = count + 1
// Implement exponential backoff
delay := time.Duration(count+1) * baseDelay
fmt.Printf("Retrying %s (attempt %d/%d) after %v\n",
url, count+1, maxRetries, delay)
time.Sleep(delay)
r.Request.Retry()
} else {
fmt.Printf("Max retries exceeded for %s\n", url)
delete(retryCount, url)
}
} else {
fmt.Printf("Non-retryable error for %s: %s\n", url, err.Error())
}
})
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
c.Visit("https://httpbin.org/status/502")
}
func shouldRetry(r *colly.Response, err error) bool {
// Retry on network errors
if r == nil {
return true
}
// Retry on specific HTTP status codes
switch r.StatusCode {
case http.StatusTooManyRequests, // 429
http.StatusInternalServerError, // 500
http.StatusBadGateway, // 502
http.StatusServiceUnavailable, // 503
http.StatusGatewayTimeout: // 504
return true
}
return false
}
Retry Logic with Custom Delay Strategies
You can implement different delay strategies for retries:
package main
import (
"fmt"
"math"
"math/rand"
"time"
"github.com/gocolly/colly/v2"
)
type RetryConfig struct {
MaxRetries int
BaseDelay time.Duration
MaxDelay time.Duration
Strategy string // "exponential", "linear", "fixed"
}
func main() {
c := colly.NewCollector()
config := RetryConfig{
MaxRetries: 5,
BaseDelay: time.Second,
MaxDelay: 30 * time.Second,
Strategy: "exponential",
}
retryTracker := make(map[string]int)
c.OnError(func(r *colly.Response, err error) {
url := r.Request.URL.String()
if shouldRetry(r, err) {
count := retryTracker[url]
if count < config.MaxRetries {
retryTracker[url] = count + 1
delay := calculateDelay(config, count)
fmt.Printf("Retrying %s (attempt %d/%d) after %v\n",
url, count+1, config.MaxRetries, delay)
time.Sleep(delay)
r.Request.Retry()
} else {
fmt.Printf("Max retries exceeded for %s\n", url)
delete(retryTracker, url)
}
}
})
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Successfully scraped:", e.Text)
// Clean up successful requests from retry tracker
delete(retryTracker, e.Request.URL.String())
})
c.Visit("https://httpbin.org/status/503")
}
func calculateDelay(config RetryConfig, attempt int) time.Duration {
var delay time.Duration
switch config.Strategy {
case "exponential":
// Exponential backoff with jitter
delay = time.Duration(math.Pow(2, float64(attempt))) * config.BaseDelay
jitter := time.Duration(rand.Intn(1000)) * time.Millisecond
delay += jitter
case "linear":
// Linear backoff
delay = time.Duration(attempt+1) * config.BaseDelay
case "fixed":
// Fixed delay
delay = config.BaseDelay
default:
delay = config.BaseDelay
}
// Cap the delay at maximum
if delay > config.MaxDelay {
delay = config.MaxDelay
}
return delay
}
func shouldRetry(r *colly.Response, err error) bool {
if r == nil {
return true
}
retryableStatusCodes := map[int]bool{
429: true, // Too Many Requests
500: true, // Internal Server Error
502: true, // Bad Gateway
503: true, // Service Unavailable
504: true, // Gateway Timeout
}
return retryableStatusCodes[r.StatusCode]
}
Implementing Circuit Breaker Pattern
For high-volume scraping, you might want to implement a circuit breaker pattern to temporarily stop retrying a failing endpoint:
package main
import (
"fmt"
"sync"
"time"
"github.com/gocolly/colly/v2"
)
type CircuitBreaker struct {
mutex sync.RWMutex
failureCount int
lastFailure time.Time
threshold int
timeout time.Duration
state string // "closed", "open", "half-open"
}
func NewCircuitBreaker(threshold int, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
threshold: threshold,
timeout: timeout,
state: "closed",
}
}
func (cb *CircuitBreaker) CanExecute() bool {
cb.mutex.RLock()
defer cb.mutex.RUnlock()
switch cb.state {
case "open":
if time.Since(cb.lastFailure) > cb.timeout {
cb.mutex.RUnlock()
cb.mutex.Lock()
cb.state = "half-open"
cb.mutex.Unlock()
cb.mutex.RLock()
return true
}
return false
case "half-open", "closed":
return true
default:
return false
}
}
func (cb *CircuitBreaker) RecordSuccess() {
cb.mutex.Lock()
defer cb.mutex.Unlock()
cb.failureCount = 0
cb.state = "closed"
}
func (cb *CircuitBreaker) RecordFailure() {
cb.mutex.Lock()
defer cb.mutex.Unlock()
cb.failureCount++
cb.lastFailure = time.Now()
if cb.failureCount >= cb.threshold {
cb.state = "open"
}
}
func main() {
c := colly.NewCollector()
// Create circuit breakers for different domains
circuitBreakers := make(map[string]*CircuitBreaker)
retryTracker := make(map[string]int)
c.OnError(func(r *colly.Response, err error) {
url := r.Request.URL.String()
domain := r.Request.URL.Host
// Initialize circuit breaker if not exists
if _, exists := circuitBreakers[domain]; !exists {
circuitBreakers[domain] = NewCircuitBreaker(5, 30*time.Second)
}
cb := circuitBreakers[domain]
cb.RecordFailure()
if shouldRetry(r, err) && cb.CanExecute() {
count := retryTracker[url]
if count < 3 {
retryTracker[url] = count + 1
fmt.Printf("Retrying %s (attempt %d/3)\n", url, count+1)
time.Sleep(time.Duration(count+1) * time.Second)
r.Request.Retry()
}
} else {
fmt.Printf("Circuit breaker open for %s or max retries exceeded\n", domain)
}
})
c.OnHTML("title", func(e *colly.HTMLElement) {
domain := e.Request.URL.Host
if cb, exists := circuitBreakers[domain]; exists {
cb.RecordSuccess()
}
delete(retryTracker, e.Request.URL.String())
fmt.Println("Success:", e.Text)
})
c.Visit("https://httpbin.org/status/502")
}
Retry Logic with Rate Limiting
Combine retry logic with rate limiting to avoid overwhelming servers:
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
"golang.org/x/time/rate"
)
func main() {
c := colly.NewCollector()
// Create rate limiter (1 request per second)
limiter := rate.NewLimiter(1, 1)
retryTracker := make(map[string]int)
c.OnRequest(func(r *colly.Request) {
// Wait for rate limiter
limiter.Wait(r.Ctx)
fmt.Printf("Requesting: %s\n", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
url := r.Request.URL.String()
if shouldRetry(r, err) {
count := retryTracker[url]
if count < 3 {
retryTracker[url] = count + 1
// Exponential backoff with rate limiting
delay := time.Duration(count+1) * 2 * time.Second
fmt.Printf("Retrying %s after %v\n", url, delay)
time.Sleep(delay)
r.Request.Retry()
}
}
})
c.OnHTML("title", func(e *colly.HTMLElement) {
delete(retryTracker, e.Request.URL.String())
fmt.Println("Title:", e.Text)
})
c.Visit("https://httpbin.org/delay/2")
}
Best Practices for Retry Logic
1. Implement Proper Error Classification
Not all errors should trigger retries. Network timeouts and server errors (5xx) are good candidates for retries, while client errors (4xx) typically are not.
2. Use Exponential Backoff with Jitter
This prevents the "thundering herd" problem where multiple clients retry at the same time.
3. Set Maximum Retry Limits
Always limit the number of retry attempts to prevent infinite loops.
4. Monitor and Log Retry Attempts
Keep track of retry patterns to identify problematic endpoints or network issues.
5. Respect Rate Limits
Combine retry logic with rate limiting to avoid overwhelming target servers, similar to how you handle timeouts in Puppeteer for browser-based scraping.
6. Implement Circuit Breakers
For high-volume applications, use circuit breakers to temporarily stop retrying consistently failing endpoints.
Common Pitfalls to Avoid
- Retrying non-idempotent operations: Be careful when retrying POST requests that might create duplicate data
- Ignoring response bodies: Sometimes servers return useful information in error responses
- Not implementing proper cleanup: Clean up retry tracking data for successful requests
- Overly aggressive retries: This can lead to being blocked by target websites
Conclusion
Implementing robust retry logic in Colly is essential for building reliable web scrapers. By combining proper error handling, exponential backoff, rate limiting, and circuit breaker patterns, you can create scrapers that gracefully handle temporary failures while respecting server resources. Remember to always monitor your retry patterns and adjust your strategy based on the specific requirements of your target websites.
For handling errors in other web scraping contexts, you might also want to learn about handling errors in Puppeteer for browser-based scraping scenarios.