How do I implement retry logic for failed requests in Colly?

Implementing retry logic is crucial for building robust web scrapers with Colly. Network failures, temporary server errors, and rate limiting can cause requests to fail, but with proper retry mechanisms, your scraper can automatically recover from these issues and continue operating reliably.

Understanding Retry Logic in Colly

Colly provides built-in support for retry logic through its callback system and configuration options. The framework allows you to implement custom retry strategies using the OnError callback and the Retry() method, giving you fine-grained control over when and how failed requests should be retried.

Basic Retry Implementation

Here's a simple implementation of retry logic in Colly:

package main

import (
    "fmt"
    "log"
    "net/http"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Set request timeout
    c.SetRequestTimeout(30 * time.Second)

    // Implement retry logic
    c.OnError(func(r *colly.Response, err error) {
        fmt.Printf("Request failed: %s, retrying...\n", err.Error())
        r.Request.Retry()
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Visiting: %s\n", r.URL.String())
    })

    c.Visit("https://httpbin.org/status/500") // This will fail and retry
}

Advanced Retry Logic with Conditions

For production applications, you'll want more sophisticated retry logic that considers the type of error and implements exponential backoff:

package main

import (
    "fmt"
    "log"
    "net/http"
    "time"

    "github.com/gocolly/colly/v2"
)

const (
    maxRetries = 3
    baseDelay  = 1 * time.Second
)

func main() {
    c := colly.NewCollector()

    // Track retry attempts
    retryCount := make(map[string]int)

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()

        // Check if we should retry based on error type
        if shouldRetry(r, err) {
            count := retryCount[url]
            if count < maxRetries {
                retryCount[url] = count + 1

                // Implement exponential backoff
                delay := time.Duration(count+1) * baseDelay
                fmt.Printf("Retrying %s (attempt %d/%d) after %v\n", 
                    url, count+1, maxRetries, delay)

                time.Sleep(delay)
                r.Request.Retry()
            } else {
                fmt.Printf("Max retries exceeded for %s\n", url)
                delete(retryCount, url)
            }
        } else {
            fmt.Printf("Non-retryable error for %s: %s\n", url, err.Error())
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    c.Visit("https://httpbin.org/status/502")
}

func shouldRetry(r *colly.Response, err error) bool {
    // Retry on network errors
    if r == nil {
        return true
    }

    // Retry on specific HTTP status codes
    switch r.StatusCode {
    case http.StatusTooManyRequests, // 429
         http.StatusInternalServerError, // 500
         http.StatusBadGateway,         // 502
         http.StatusServiceUnavailable, // 503
         http.StatusGatewayTimeout:     // 504
        return true
    }

    return false
}

Retry Logic with Custom Delay Strategies

You can implement different delay strategies for retries:

package main

import (
    "fmt"
    "math"
    "math/rand"
    "time"

    "github.com/gocolly/colly/v2"
)

type RetryConfig struct {
    MaxRetries int
    BaseDelay  time.Duration
    MaxDelay   time.Duration
    Strategy   string // "exponential", "linear", "fixed"
}

func main() {
    c := colly.NewCollector()

    config := RetryConfig{
        MaxRetries: 5,
        BaseDelay:  time.Second,
        MaxDelay:   30 * time.Second,
        Strategy:   "exponential",
    }

    retryTracker := make(map[string]int)

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()

        if shouldRetry(r, err) {
            count := retryTracker[url]
            if count < config.MaxRetries {
                retryTracker[url] = count + 1

                delay := calculateDelay(config, count)
                fmt.Printf("Retrying %s (attempt %d/%d) after %v\n", 
                    url, count+1, config.MaxRetries, delay)

                time.Sleep(delay)
                r.Request.Retry()
            } else {
                fmt.Printf("Max retries exceeded for %s\n", url)
                delete(retryTracker, url)
            }
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Successfully scraped:", e.Text)
        // Clean up successful requests from retry tracker
        delete(retryTracker, e.Request.URL.String())
    })

    c.Visit("https://httpbin.org/status/503")
}

func calculateDelay(config RetryConfig, attempt int) time.Duration {
    var delay time.Duration

    switch config.Strategy {
    case "exponential":
        // Exponential backoff with jitter
        delay = time.Duration(math.Pow(2, float64(attempt))) * config.BaseDelay
        jitter := time.Duration(rand.Intn(1000)) * time.Millisecond
        delay += jitter
    case "linear":
        // Linear backoff
        delay = time.Duration(attempt+1) * config.BaseDelay
    case "fixed":
        // Fixed delay
        delay = config.BaseDelay
    default:
        delay = config.BaseDelay
    }

    // Cap the delay at maximum
    if delay > config.MaxDelay {
        delay = config.MaxDelay
    }

    return delay
}

func shouldRetry(r *colly.Response, err error) bool {
    if r == nil {
        return true
    }

    retryableStatusCodes := map[int]bool{
        429: true, // Too Many Requests
        500: true, // Internal Server Error
        502: true, // Bad Gateway
        503: true, // Service Unavailable
        504: true, // Gateway Timeout
    }

    return retryableStatusCodes[r.StatusCode]
}

Implementing Circuit Breaker Pattern

For high-volume scraping, you might want to implement a circuit breaker pattern to temporarily stop retrying a failing endpoint:

package main

import (
    "fmt"
    "sync"
    "time"

    "github.com/gocolly/colly/v2"
)

type CircuitBreaker struct {
    mutex        sync.RWMutex
    failureCount int
    lastFailure  time.Time
    threshold    int
    timeout      time.Duration
    state        string // "closed", "open", "half-open"
}

func NewCircuitBreaker(threshold int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        threshold: threshold,
        timeout:   timeout,
        state:     "closed",
    }
}

func (cb *CircuitBreaker) CanExecute() bool {
    cb.mutex.RLock()
    defer cb.mutex.RUnlock()

    switch cb.state {
    case "open":
        if time.Since(cb.lastFailure) > cb.timeout {
            cb.mutex.RUnlock()
            cb.mutex.Lock()
            cb.state = "half-open"
            cb.mutex.Unlock()
            cb.mutex.RLock()
            return true
        }
        return false
    case "half-open", "closed":
        return true
    default:
        return false
    }
}

func (cb *CircuitBreaker) RecordSuccess() {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()
    cb.failureCount = 0
    cb.state = "closed"
}

func (cb *CircuitBreaker) RecordFailure() {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()
    cb.failureCount++
    cb.lastFailure = time.Now()

    if cb.failureCount >= cb.threshold {
        cb.state = "open"
    }
}

func main() {
    c := colly.NewCollector()

    // Create circuit breakers for different domains
    circuitBreakers := make(map[string]*CircuitBreaker)
    retryTracker := make(map[string]int)

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()
        domain := r.Request.URL.Host

        // Initialize circuit breaker if not exists
        if _, exists := circuitBreakers[domain]; !exists {
            circuitBreakers[domain] = NewCircuitBreaker(5, 30*time.Second)
        }

        cb := circuitBreakers[domain]
        cb.RecordFailure()

        if shouldRetry(r, err) && cb.CanExecute() {
            count := retryTracker[url]
            if count < 3 {
                retryTracker[url] = count + 1
                fmt.Printf("Retrying %s (attempt %d/3)\n", url, count+1)
                time.Sleep(time.Duration(count+1) * time.Second)
                r.Request.Retry()
            }
        } else {
            fmt.Printf("Circuit breaker open for %s or max retries exceeded\n", domain)
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        domain := e.Request.URL.Host
        if cb, exists := circuitBreakers[domain]; exists {
            cb.RecordSuccess()
        }
        delete(retryTracker, e.Request.URL.String())
        fmt.Println("Success:", e.Text)
    })

    c.Visit("https://httpbin.org/status/502")
}

Retry Logic with Rate Limiting

Combine retry logic with rate limiting to avoid overwhelming servers:

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly/v2"
    "golang.org/x/time/rate"
)

func main() {
    c := colly.NewCollector()

    // Create rate limiter (1 request per second)
    limiter := rate.NewLimiter(1, 1)

    retryTracker := make(map[string]int)

    c.OnRequest(func(r *colly.Request) {
        // Wait for rate limiter
        limiter.Wait(r.Ctx)
        fmt.Printf("Requesting: %s\n", r.URL.String())
    })

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()

        if shouldRetry(r, err) {
            count := retryTracker[url]
            if count < 3 {
                retryTracker[url] = count + 1

                // Exponential backoff with rate limiting
                delay := time.Duration(count+1) * 2 * time.Second
                fmt.Printf("Retrying %s after %v\n", url, delay)

                time.Sleep(delay)
                r.Request.Retry()
            }
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        delete(retryTracker, e.Request.URL.String())
        fmt.Println("Title:", e.Text)
    })

    c.Visit("https://httpbin.org/delay/2")
}

Best Practices for Retry Logic

1. Implement Proper Error Classification

Not all errors should trigger retries. Network timeouts and server errors (5xx) are good candidates for retries, while client errors (4xx) typically are not.

2. Use Exponential Backoff with Jitter

This prevents the "thundering herd" problem where multiple clients retry at the same time.

3. Set Maximum Retry Limits

Always limit the number of retry attempts to prevent infinite loops.

4. Monitor and Log Retry Attempts

Keep track of retry patterns to identify problematic endpoints or network issues.

5. Respect Rate Limits

Combine retry logic with rate limiting to avoid overwhelming target servers, similar to how you handle timeouts in Puppeteer for browser-based scraping.

6. Implement Circuit Breakers

For high-volume applications, use circuit breakers to temporarily stop retrying consistently failing endpoints.

Common Pitfalls to Avoid

Retrying non-idempotent operations: Be careful when retrying POST requests that might create duplicate data
Ignoring response bodies: Sometimes servers return useful information in error responses
Not implementing proper cleanup: Clean up retry tracking data for successful requests
Overly aggressive retries: This can lead to being blocked by target websites

Conclusion

Implementing robust retry logic in Colly is essential for building reliable web scrapers. By combining proper error handling, exponential backoff, rate limiting, and circuit breaker patterns, you can create scrapers that gracefully handle temporary failures while respecting server resources. Remember to always monitor your retry patterns and adjust your strategy based on the specific requirements of your target websites.

For handling errors in other web scraping contexts, you might also want to learn about handling errors in Puppeteer for browser-based scraping scenarios.

Table of contents

How do I implement retry logic for failed requests in Colly?

Understanding Retry Logic in Colly

Basic Retry Implementation

Advanced Retry Logic with Conditions

Retry Logic with Custom Delay Strategies

Implementing Circuit Breaker Pattern

Retry Logic with Rate Limiting

Best Practices for Retry Logic

1. Implement Proper Error Classification

2. Use Exponential Backoff with Jitter

3. Set Maximum Retry Limits

4. Monitor and Log Retry Attempts

5. Respect Rate Limits

6. Implement Circuit Breakers

Common Pitfalls to Avoid

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are the performance considerations when using Colly?

How do I debug Colly scrapers and log requests?

Can I use Colly to scrape APIs that return JSON responses?

Get Started Now

Support