Table of contents

How do I implement rate limiting in Colly to avoid being blocked?

Rate limiting is crucial when using Colly for web scraping to avoid overwhelming target servers and prevent your scraper from being blocked. Colly provides several built-in mechanisms to control request frequency and implement effective rate limiting strategies.

Understanding Rate Limiting in Web Scraping

Rate limiting controls how frequently your scraper sends requests to a target website. Without proper rate limiting, you risk:

  • Getting your IP address blocked
  • Triggering anti-bot measures
  • Overloading the target server
  • Violating website terms of service

Basic Rate Limiting with Delays

Using Limit() for Simple Rate Limiting

The most straightforward approach is using Colly's Limit() method to set a delay between requests:

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Set a 2-second delay between requests
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       2 * time.Second,
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        fmt.Printf("Found link: %s\n", link)
        e.Request.Visit(link)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Visiting: %s\n", r.URL.String())
    })

    c.Visit("https://example.com")
    c.Wait()
}

Domain-Specific Rate Limiting

You can set different rate limits for different domains:

func setupDomainSpecificLimits(c *colly.Collector) {
    // Slower rate for main domain
    c.Limit(&colly.LimitRule{
        DomainGlob:  "example.com",
        Parallelism: 1,
        Delay:       3 * time.Second,
    })

    // Faster rate for API endpoints
    c.Limit(&colly.LimitRule{
        DomainGlob:  "api.example.com",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    // Very conservative for sensitive domains
    c.Limit(&colly.LimitRule{
        DomainGlob:  "sensitive-site.com",
        Parallelism: 1,
        Delay:       5 * time.Second,
    })
}

Advanced Rate Limiting Strategies

Random Delays to Mimic Human Behavior

Adding randomness to your delays makes your scraper appear more human-like:

package main

import (
    "math/rand"
    "time"

    "github.com/gocolly/colly/v2"
)

func randomDelay(min, max time.Duration) time.Duration {
    return min + time.Duration(rand.Int63n(int64(max-min)))
}

func main() {
    c := colly.NewCollector()

    // Base delay with random component
    baseDelay := 2 * time.Second
    maxDelay := 5 * time.Second

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       randomDelay(baseDelay, maxDelay),
    })

    // Update delay for each request
    c.OnRequest(func(r *colly.Request) {
        time.Sleep(randomDelay(500*time.Millisecond, 2*time.Second))
    })

    // Your scraping logic here
    c.Visit("https://example.com")
}

Adaptive Rate Limiting Based on Response

Implement adaptive rate limiting that adjusts based on server responses:

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly/v2"
)

type AdaptiveRateLimiter struct {
    currentDelay time.Duration
    minDelay     time.Duration
    maxDelay     time.Duration
    backoffRate  float64
}

func NewAdaptiveRateLimiter() *AdaptiveRateLimiter {
    return &AdaptiveRateLimiter{
        currentDelay: 1 * time.Second,
        minDelay:     500 * time.Millisecond,
        maxDelay:     10 * time.Second,
        backoffRate:  1.5,
    }
}

func (arl *AdaptiveRateLimiter) adjustDelay(statusCode int) {
    switch {
    case statusCode == 429: // Too Many Requests
        arl.currentDelay = time.Duration(float64(arl.currentDelay) * arl.backoffRate)
        if arl.currentDelay > arl.maxDelay {
            arl.currentDelay = arl.maxDelay
        }
    case statusCode >= 500: // Server errors
        arl.currentDelay = time.Duration(float64(arl.currentDelay) * 1.2)
    case statusCode == 200: // Success - can reduce delay
        arl.currentDelay = time.Duration(float64(arl.currentDelay) * 0.9)
        if arl.currentDelay < arl.minDelay {
            arl.currentDelay = arl.minDelay
        }
    }
}

func main() {
    c := colly.NewCollector()
    rateLimiter := NewAdaptiveRateLimiter()

    c.OnResponse(func(r *colly.Response) {
        rateLimiter.adjustDelay(r.StatusCode)
        fmt.Printf("Status: %d, Next delay: %v\n", r.StatusCode, rateLimiter.currentDelay)
    })

    c.OnRequest(func(r *colly.Request) {
        time.Sleep(rateLimiter.currentDelay)
    })

    // Your scraping logic
}

Implementing Request Queues with Rate Limiting

For more complex scenarios, implement a request queue with sophisticated rate limiting:

package main

import (
    "context"
    "sync"
    "time"

    "github.com/gocolly/colly/v2"
)

type RateLimitedQueue struct {
    requests chan *colly.Request
    done     chan bool
    delay    time.Duration
    mu       sync.RWMutex
}

func NewRateLimitedQueue(delay time.Duration, bufferSize int) *RateLimitedQueue {
    return &RateLimitedQueue{
        requests: make(chan *colly.Request, bufferSize),
        done:     make(chan bool),
        delay:    delay,
    }
}

func (rlq *RateLimitedQueue) Start(ctx context.Context, c *colly.Collector) {
    ticker := time.NewTicker(rlq.delay)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-rlq.done:
            return
        case <-ticker.C:
            select {
            case req := <-rlq.requests:
                c.Request(req.Method, req.URL.String(), req.Body, req.Ctx, req.Headers)
            default:
                // No requests in queue
            }
        }
    }
}

func (rlq *RateLimitedQueue) AddRequest(req *colly.Request) {
    select {
    case rlq.requests <- req:
    default:
        // Queue is full, handle accordingly
    }
}

func (rlq *RateLimitedQueue) SetDelay(delay time.Duration) {
    rlq.mu.Lock()
    defer rlq.mu.Unlock()
    rlq.delay = delay
}

func (rlq *RateLimitedQueue) Stop() {
    close(rlq.done)
}

Rate Limiting Best Practices

1. Respect robots.txt

Always check and respect the target website's robots.txt file:

func checkRobotsTxt(c *colly.Collector) {
    // Colly automatically respects robots.txt when configured
    c.UserAgent = "YourBot/1.0"

    // You can also manually check robots.txt
    c.OnRequest(func(r *colly.Request) {
        // Check if the URL is allowed by robots.txt
        // Implementation depends on your needs
    })
}

2. Monitor Response Codes

Implement monitoring to detect when you're being rate limited:

func monitorResponseCodes(c *colly.Collector) {
    var (
        successCount int
        errorCount   int
        rateLimited  int
    )

    c.OnResponse(func(r *colly.Response) {
        switch r.StatusCode {
        case 200:
            successCount++
        case 429:
            rateLimited++
            fmt.Printf("Rate limited! Total: %d\n", rateLimited)
            // Increase delay
        case 403, 503:
            errorCount++
            fmt.Printf("Possible blocking detected: %d\n", r.StatusCode)
        }

        // Log statistics periodically
        if (successCount+errorCount+rateLimited)%100 == 0 {
            fmt.Printf("Stats - Success: %d, Errors: %d, Rate Limited: %d\n", 
                successCount, errorCount, rateLimited)
        }
    })
}

3. Use Proxy Rotation

For additional protection, combine rate limiting with proxy rotation:

import (
    "github.com/gocolly/colly/v2/proxy"
)

func setupProxyRotation(c *colly.Collector) error {
    rp, err := proxy.RoundRobinProxySwitcher(
        "http://proxy1:8080",
        "http://proxy2:8080",
        "http://proxy3:8080",
    )
    if err != nil {
        return err
    }

    c.SetProxyFunc(rp)

    // Combine with rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       2 * time.Second,
    })

    return nil
}

Testing Your Rate Limiting

Create a simple test to verify your rate limiting works:

# Monitor network traffic while running your scraper
sudo tcpdump -i any -n host example.com

# Use time command to measure execution
time go run your_scraper.go
// Test function to verify rate limiting
func testRateLimit() {
    start := time.Now()
    requestCount := 10
    expectedDuration := time.Duration(requestCount-1) * 2 * time.Second

    c := colly.NewCollector()
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       2 * time.Second,
    })

    var actualRequests int
    c.OnRequest(func(r *colly.Request) {
        actualRequests++
        fmt.Printf("Request %d at %v\n", actualRequests, time.Since(start))
    })

    // Make test requests
    for i := 0; i < requestCount; i++ {
        c.Visit(fmt.Sprintf("https://httpbin.org/delay/0?req=%d", i))
    }

    c.Wait()
    actualDuration := time.Since(start)

    fmt.Printf("Expected duration: %v, Actual: %v\n", expectedDuration, actualDuration)
}

Integration with Modern Scraping Solutions

While Colly provides excellent rate limiting capabilities, modern web scraping often requires more sophisticated approaches. For JavaScript-heavy sites or complex rate limiting scenarios, consider complementing Colly with browser automation tools like Puppeteer for handling dynamic content or managing timeouts effectively.

Conclusion

Implementing effective rate limiting in Colly is essential for successful web scraping. Start with simple delays using Limit(), then progressively implement more sophisticated strategies like adaptive delays, request queues, and response monitoring. Always respect the target website's resources and terms of service.

Key takeaways: - Use colly.LimitRule for basic rate limiting - Implement random delays to appear more human-like - Monitor response codes to detect blocking - Combine rate limiting with proxy rotation for better protection - Test your implementation to ensure it works as expected

Remember that effective rate limiting is not just about avoiding blocks—it's about being a responsible web scraper that respects server resources and maintains access to the data you need.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon