Table of contents

How do I implement retry logic for failed requests in Colly?

Implementing retry logic is crucial for building robust web scrapers with Colly. Network failures, temporary server errors, and rate limiting can cause requests to fail, but with proper retry mechanisms, your scraper can automatically recover from these issues and continue operating reliably.

Understanding Retry Logic in Colly

Colly provides built-in support for retry logic through its callback system and configuration options. The framework allows you to implement custom retry strategies using the OnError callback and the Retry() method, giving you fine-grained control over when and how failed requests should be retried.

Basic Retry Implementation

Here's a simple implementation of retry logic in Colly:

package main

import (
    "fmt"
    "log"
    "net/http"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Set request timeout
    c.SetRequestTimeout(30 * time.Second)

    // Implement retry logic
    c.OnError(func(r *colly.Response, err error) {
        fmt.Printf("Request failed: %s, retrying...\n", err.Error())
        r.Request.Retry()
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Visiting: %s\n", r.URL.String())
    })

    c.Visit("https://httpbin.org/status/500") // This will fail and retry
}

Advanced Retry Logic with Conditions

For production applications, you'll want more sophisticated retry logic that considers the type of error and implements exponential backoff:

package main

import (
    "fmt"
    "log"
    "net/http"
    "time"

    "github.com/gocolly/colly/v2"
)

const (
    maxRetries = 3
    baseDelay  = 1 * time.Second
)

func main() {
    c := colly.NewCollector()

    // Track retry attempts
    retryCount := make(map[string]int)

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()

        // Check if we should retry based on error type
        if shouldRetry(r, err) {
            count := retryCount[url]
            if count < maxRetries {
                retryCount[url] = count + 1

                // Implement exponential backoff
                delay := time.Duration(count+1) * baseDelay
                fmt.Printf("Retrying %s (attempt %d/%d) after %v\n", 
                    url, count+1, maxRetries, delay)

                time.Sleep(delay)
                r.Request.Retry()
            } else {
                fmt.Printf("Max retries exceeded for %s\n", url)
                delete(retryCount, url)
            }
        } else {
            fmt.Printf("Non-retryable error for %s: %s\n", url, err.Error())
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    c.Visit("https://httpbin.org/status/502")
}

func shouldRetry(r *colly.Response, err error) bool {
    // Retry on network errors
    if r == nil {
        return true
    }

    // Retry on specific HTTP status codes
    switch r.StatusCode {
    case http.StatusTooManyRequests, // 429
         http.StatusInternalServerError, // 500
         http.StatusBadGateway,         // 502
         http.StatusServiceUnavailable, // 503
         http.StatusGatewayTimeout:     // 504
        return true
    }

    return false
}

Retry Logic with Custom Delay Strategies

You can implement different delay strategies for retries:

package main

import (
    "fmt"
    "math"
    "math/rand"
    "time"

    "github.com/gocolly/colly/v2"
)

type RetryConfig struct {
    MaxRetries int
    BaseDelay  time.Duration
    MaxDelay   time.Duration
    Strategy   string // "exponential", "linear", "fixed"
}

func main() {
    c := colly.NewCollector()

    config := RetryConfig{
        MaxRetries: 5,
        BaseDelay:  time.Second,
        MaxDelay:   30 * time.Second,
        Strategy:   "exponential",
    }

    retryTracker := make(map[string]int)

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()

        if shouldRetry(r, err) {
            count := retryTracker[url]
            if count < config.MaxRetries {
                retryTracker[url] = count + 1

                delay := calculateDelay(config, count)
                fmt.Printf("Retrying %s (attempt %d/%d) after %v\n", 
                    url, count+1, config.MaxRetries, delay)

                time.Sleep(delay)
                r.Request.Retry()
            } else {
                fmt.Printf("Max retries exceeded for %s\n", url)
                delete(retryTracker, url)
            }
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Successfully scraped:", e.Text)
        // Clean up successful requests from retry tracker
        delete(retryTracker, e.Request.URL.String())
    })

    c.Visit("https://httpbin.org/status/503")
}

func calculateDelay(config RetryConfig, attempt int) time.Duration {
    var delay time.Duration

    switch config.Strategy {
    case "exponential":
        // Exponential backoff with jitter
        delay = time.Duration(math.Pow(2, float64(attempt))) * config.BaseDelay
        jitter := time.Duration(rand.Intn(1000)) * time.Millisecond
        delay += jitter
    case "linear":
        // Linear backoff
        delay = time.Duration(attempt+1) * config.BaseDelay
    case "fixed":
        // Fixed delay
        delay = config.BaseDelay
    default:
        delay = config.BaseDelay
    }

    // Cap the delay at maximum
    if delay > config.MaxDelay {
        delay = config.MaxDelay
    }

    return delay
}

func shouldRetry(r *colly.Response, err error) bool {
    if r == nil {
        return true
    }

    retryableStatusCodes := map[int]bool{
        429: true, // Too Many Requests
        500: true, // Internal Server Error
        502: true, // Bad Gateway
        503: true, // Service Unavailable
        504: true, // Gateway Timeout
    }

    return retryableStatusCodes[r.StatusCode]
}

Implementing Circuit Breaker Pattern

For high-volume scraping, you might want to implement a circuit breaker pattern to temporarily stop retrying a failing endpoint:

package main

import (
    "fmt"
    "sync"
    "time"

    "github.com/gocolly/colly/v2"
)

type CircuitBreaker struct {
    mutex        sync.RWMutex
    failureCount int
    lastFailure  time.Time
    threshold    int
    timeout      time.Duration
    state        string // "closed", "open", "half-open"
}

func NewCircuitBreaker(threshold int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        threshold: threshold,
        timeout:   timeout,
        state:     "closed",
    }
}

func (cb *CircuitBreaker) CanExecute() bool {
    cb.mutex.RLock()
    defer cb.mutex.RUnlock()

    switch cb.state {
    case "open":
        if time.Since(cb.lastFailure) > cb.timeout {
            cb.mutex.RUnlock()
            cb.mutex.Lock()
            cb.state = "half-open"
            cb.mutex.Unlock()
            cb.mutex.RLock()
            return true
        }
        return false
    case "half-open", "closed":
        return true
    default:
        return false
    }
}

func (cb *CircuitBreaker) RecordSuccess() {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()
    cb.failureCount = 0
    cb.state = "closed"
}

func (cb *CircuitBreaker) RecordFailure() {
    cb.mutex.Lock()
    defer cb.mutex.Unlock()
    cb.failureCount++
    cb.lastFailure = time.Now()

    if cb.failureCount >= cb.threshold {
        cb.state = "open"
    }
}

func main() {
    c := colly.NewCollector()

    // Create circuit breakers for different domains
    circuitBreakers := make(map[string]*CircuitBreaker)
    retryTracker := make(map[string]int)

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()
        domain := r.Request.URL.Host

        // Initialize circuit breaker if not exists
        if _, exists := circuitBreakers[domain]; !exists {
            circuitBreakers[domain] = NewCircuitBreaker(5, 30*time.Second)
        }

        cb := circuitBreakers[domain]
        cb.RecordFailure()

        if shouldRetry(r, err) && cb.CanExecute() {
            count := retryTracker[url]
            if count < 3 {
                retryTracker[url] = count + 1
                fmt.Printf("Retrying %s (attempt %d/3)\n", url, count+1)
                time.Sleep(time.Duration(count+1) * time.Second)
                r.Request.Retry()
            }
        } else {
            fmt.Printf("Circuit breaker open for %s or max retries exceeded\n", domain)
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        domain := e.Request.URL.Host
        if cb, exists := circuitBreakers[domain]; exists {
            cb.RecordSuccess()
        }
        delete(retryTracker, e.Request.URL.String())
        fmt.Println("Success:", e.Text)
    })

    c.Visit("https://httpbin.org/status/502")
}

Retry Logic with Rate Limiting

Combine retry logic with rate limiting to avoid overwhelming servers:

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly/v2"
    "golang.org/x/time/rate"
)

func main() {
    c := colly.NewCollector()

    // Create rate limiter (1 request per second)
    limiter := rate.NewLimiter(1, 1)

    retryTracker := make(map[string]int)

    c.OnRequest(func(r *colly.Request) {
        // Wait for rate limiter
        limiter.Wait(r.Ctx)
        fmt.Printf("Requesting: %s\n", r.URL.String())
    })

    c.OnError(func(r *colly.Response, err error) {
        url := r.Request.URL.String()

        if shouldRetry(r, err) {
            count := retryTracker[url]
            if count < 3 {
                retryTracker[url] = count + 1

                // Exponential backoff with rate limiting
                delay := time.Duration(count+1) * 2 * time.Second
                fmt.Printf("Retrying %s after %v\n", url, delay)

                time.Sleep(delay)
                r.Request.Retry()
            }
        }
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        delete(retryTracker, e.Request.URL.String())
        fmt.Println("Title:", e.Text)
    })

    c.Visit("https://httpbin.org/delay/2")
}

Best Practices for Retry Logic

1. Implement Proper Error Classification

Not all errors should trigger retries. Network timeouts and server errors (5xx) are good candidates for retries, while client errors (4xx) typically are not.

2. Use Exponential Backoff with Jitter

This prevents the "thundering herd" problem where multiple clients retry at the same time.

3. Set Maximum Retry Limits

Always limit the number of retry attempts to prevent infinite loops.

4. Monitor and Log Retry Attempts

Keep track of retry patterns to identify problematic endpoints or network issues.

5. Respect Rate Limits

Combine retry logic with rate limiting to avoid overwhelming target servers, similar to how you handle timeouts in Puppeteer for browser-based scraping.

6. Implement Circuit Breakers

For high-volume applications, use circuit breakers to temporarily stop retrying consistently failing endpoints.

Common Pitfalls to Avoid

  • Retrying non-idempotent operations: Be careful when retrying POST requests that might create duplicate data
  • Ignoring response bodies: Sometimes servers return useful information in error responses
  • Not implementing proper cleanup: Clean up retry tracking data for successful requests
  • Overly aggressive retries: This can lead to being blocked by target websites

Conclusion

Implementing robust retry logic in Colly is essential for building reliable web scrapers. By combining proper error handling, exponential backoff, rate limiting, and circuit breaker patterns, you can create scrapers that gracefully handle temporary failures while respecting server resources. Remember to always monitor your retry patterns and adjust your strategy based on the specific requirements of your target websites.

For handling errors in other web scraping contexts, you might also want to learn about handling errors in Puppeteer for browser-based scraping scenarios.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon