Table of contents

What are the common anti-scraping measures that affect Colly?

Web scraping with Colly, the powerful Go-based framework, can encounter various anti-scraping measures that websites implement to protect their content and resources. Understanding these measures and how to handle them is crucial for building robust scraping applications. This guide covers the most common anti-scraping techniques and provides practical solutions for overcoming them.

1. Rate Limiting and Request Throttling

Rate limiting is one of the most common anti-scraping measures that restricts the number of requests from a single IP address within a specific time frame.

How it affects Colly:

  • HTTP 429 (Too Many Requests) responses
  • Connection timeouts
  • Temporary IP bans

Solution:

Implement proper delays between requests using Colly's rate limiting features:

package main

import (
    "fmt"
    "time"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Set rate limiting: 1 request per second with random delay
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       1 * time.Second,
        RandomDelay: 500 * time.Millisecond,
    })

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        e.Request.Visit(e.Attr("href"))
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Visiting: %s\n", r.URL)
    })

    c.Visit("https://example.com")
    c.Wait()
}

2. IP Address Blocking

Websites may block specific IP addresses or IP ranges that exhibit suspicious behavior.

Detection methods:

  • Monitoring request frequency
  • Analyzing request patterns
  • Tracking user behavior anomalies

Solutions:

Using Proxy Rotation:

package main

import (
    "net/http"
    "net/url"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Configure proxy
    proxyURL, _ := url.Parse("http://proxy-server:8080")
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0...")
        // Set proxy for this request
        transport := &http.Transport{
            Proxy: http.ProxyURL(proxyURL),
        }
        c.WithTransport(transport)
    })

    // Your scraping logic here
    c.Visit("https://target-website.com")
}

Using Multiple Collectors with Different Configurations:

func createCollectorWithProxy(proxyURL string) *colly.Collector {
    c := colly.NewCollector()

    if proxyURL != "" {
        proxy, _ := url.Parse(proxyURL)
        c.OnRequest(func(r *colly.Request) {
            transport := &http.Transport{
                Proxy: http.ProxyURL(proxy),
            }
            c.WithTransport(transport)
        })
    }

    return c
}

// Use different proxies for different requests
proxies := []string{
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
}

for i, targetURL := range urls {
    c := createCollectorWithProxy(proxies[i%len(proxies)])
    c.Visit(targetURL)
}

3. User Agent Detection

Many websites block requests from known scraping tools by checking the User-Agent header.

Common blocked User-Agents:

  • Default Go HTTP client User-Agent
  • Obvious bot identifiers
  • Missing or malformed User-Agent strings

Solution:

Rotate realistic User-Agent strings:

package main

import (
    "math/rand"
    "time"
    "github.com/gocolly/colly/v2"
)

var userAgents = []string{
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15",
}

func main() {
    c := colly.NewCollector()

    c.OnRequest(func(r *colly.Request) {
        // Randomly select a User-Agent
        rand.Seed(time.Now().UnixNano())
        userAgent := userAgents[rand.Intn(len(userAgents))]
        r.Headers.Set("User-Agent", userAgent)

        // Add other realistic headers
        r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
        r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
        r.Headers.Set("Accept-Encoding", "gzip, deflate")
        r.Headers.Set("Connection", "keep-alive")
    })

    c.Visit("https://example.com")
}

4. JavaScript-Based Protection

Modern websites often rely on JavaScript to render content or implement anti-bot measures.

Common JavaScript protection methods:

  • Dynamic content loading
  • Browser fingerprinting
  • Challenge-response mechanisms
  • Client-side rendering

Limitations of Colly:

Colly cannot execute JavaScript, which limits its ability to scrape JavaScript-heavy websites. For such cases, consider using headless browsers like how to handle authentication in Puppeteer or combining Colly with browser automation tools.

Alternative approach:

// For JavaScript-heavy sites, you might need to:
// 1. Use a headless browser to render the page
// 2. Extract the rendered HTML
// 3. Parse it with Colly

func scrapeWithPreRendering(url string) {
    // This would require integration with tools like chromedp
    // or using an external service to pre-render pages
    renderedHTML := renderPageWithBrowser(url)

    c := colly.NewCollector()
    c.OnHTML("div.content", func(e *colly.HTMLElement) {
        // Process the pre-rendered content
        fmt.Println(e.Text)
    })

    c.OnHTML(renderedHTML, func(e *colly.HTMLElement) {
        // Parse the rendered HTML
    })
}

5. CAPTCHA Challenges

CAPTCHA systems are designed to differentiate between human users and automated bots.

Types of CAPTCHA:

  • Image-based puzzles
  • reCAPTCHA v2/v3
  • hCaptcha
  • Audio challenges

Handling CAPTCHA with Colly:

c.OnResponse(func(r *colly.Response) {
    // Detect CAPTCHA presence
    if strings.Contains(string(r.Body), "captcha") || 
       strings.Contains(string(r.Body), "recaptcha") {

        fmt.Printf("CAPTCHA detected on %s\n", r.Request.URL)

        // Options:
        // 1. Use CAPTCHA solving services
        // 2. Implement manual intervention
        // 3. Skip the request and try later
        // 4. Use alternative data sources

        handleCaptchaResponse(r)
    }
})

func handleCaptchaResponse(r *colly.Response) {
    // Implement your CAPTCHA handling strategy
    // This might involve:
    // - Pausing execution
    // - Switching to a different IP/proxy
    // - Using CAPTCHA solving services
    // - Manual intervention
}

6. Session and Cookie Management

Websites may track user sessions and detect bot-like behavior through cookie analysis.

Implementation:

import (
    "net/http"
    "net/http/cookiejar"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Enable cookie jar for session management
    jar, _ := cookiejar.New(nil)
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0...")
    })

    // Set up cookie jar
    transport := &http.Transport{}
    client := &http.Client{
        Transport: transport,
        Jar:       jar,
    }
    c.WithTransport(transport)

    // Visit pages that set session cookies first
    c.Visit("https://example.com/login")
    c.Visit("https://example.com/protected-content")
}

7. Behavioral Analysis and Fingerprinting

Advanced anti-scraping systems analyze browsing patterns to detect bots.

Detected behaviors:

  • Perfect timing between requests
  • Lack of mouse movements
  • Missing browser events
  • Unrealistic browsing patterns

Mitigation strategies:

func humanizeBehavior(c *colly.Collector) {
    // Add random delays
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       2 * time.Second,
        RandomDelay: 3 * time.Second, // Random delay up to 3 seconds
    })

    // Simulate realistic browsing patterns
    c.OnRequest(func(r *colly.Request) {
        // Add realistic referer headers
        if r.Depth > 0 {
            r.Headers.Set("Referer", r.Headers.Get("Host"))
        }

        // Simulate realistic request timing
        if r.Depth > 1 {
            time.Sleep(time.Duration(rand.Intn(2000)) * time.Millisecond)
        }
    })
}

8. SSL/TLS Certificate Validation

Some websites implement strict SSL certificate validation to detect automated tools.

Solution:

import (
    "crypto/tls"
    "net/http"
)

func main() {
    c := colly.NewCollector()

    // Configure TLS settings
    transport := &http.Transport{
        TLSClientConfig: &tls.Config{
            InsecureSkipVerify: false, // Set to true only for testing
            MinVersion:         tls.VersionTLS12,
        },
    }

    c.WithTransport(transport)
    c.Visit("https://secure-website.com")
}

Best Practices for Avoiding Detection

1. Respect robots.txt

c := colly.NewCollector(
    colly.Async(true),
)

// Enable robots.txt respect
c.IgnoreRobotsTxt = false

2. Implement Comprehensive Error Handling

c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error %d: %s\n", r.StatusCode, err.Error())

    switch r.StatusCode {
    case 429: // Too Many Requests
        // Implement exponential backoff
        time.Sleep(time.Duration(2^retryCount) * time.Second)
        r.Request.Retry()
    case 403, 401: // Forbidden/Unauthorized
        // Switch proxy or user agent
        switchProxy()
        r.Request.Retry()
    case 503: // Service Unavailable
        // Wait and retry
        time.Sleep(10 * time.Second)
        r.Request.Retry()
    }
})

3. Monitor and Adapt

func monitorRequests(c *colly.Collector) {
    requestCount := 0
    errorCount := 0

    c.OnRequest(func(r *colly.Request) {
        requestCount++
        if requestCount%100 == 0 {
            fmt.Printf("Sent %d requests, %d errors\n", requestCount, errorCount)
        }
    })

    c.OnError(func(r *colly.Response, err error) {
        errorCount++

        // If error rate is too high, adjust strategy
        if float64(errorCount)/float64(requestCount) > 0.1 {
            fmt.Println("High error rate detected, adjusting strategy...")
            adjustScrapingStrategy()
        }
    })
}

Conclusion

Successfully scraping websites with Colly requires understanding and adapting to various anti-scraping measures. The key is to make your scraper behave as much like a human user as possible while respecting website resources and terms of service. For JavaScript-heavy websites that Colly cannot handle effectively, consider complementing it with browser automation tools or exploring how to handle AJAX requests using Puppeteer for more complex scenarios.

Remember to always check a website's robots.txt file and terms of service before scraping, and implement proper rate limiting to avoid overwhelming the target servers. When anti-scraping measures become too sophisticated, consider using professional web scraping APIs that handle these challenges automatically.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon