Table of contents

How do I set up custom request middleware in Colly?

Custom request middleware in Colly allows you to intercept and modify HTTP requests before they're sent to the target server. This powerful feature enables you to implement authentication, add custom headers, implement retry logic, logging, and much more. Understanding how to create and use middleware is essential for building robust web scraping applications with Colly.

Understanding Colly Middleware

Middleware in Colly works through callback functions that are executed at specific points during the request lifecycle. The most common middleware types are:

  • OnRequest: Executed before sending the request
  • OnResponse: Executed after receiving the response
  • OnHTML: Executed when HTML content is found
  • OnError: Executed when an error occurs

Basic Request Middleware Setup

Here's how to set up basic request middleware in Colly:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Basic request middleware
    c.OnRequest(func(r *colly.Request) {
        fmt.Printf("Visiting: %s\n", r.URL.String())

        // Add custom headers
        r.Headers.Set("User-Agent", "MyBot/1.0")
        r.Headers.Set("Accept", "text/html,application/xhtml+xml")
    })

    // Response middleware
    c.OnResponse(func(r *colly.Response) {
        fmt.Printf("Response received: %d bytes from %s\n", 
            len(r.Body), r.Request.URL)
    })

    // Error handling middleware
    c.OnError(func(r *colly.Response, err error) {
        fmt.Printf("Error occurred: %s\n", err.Error())
    })

    c.Visit("https://example.com")
}

Advanced Authentication Middleware

For websites requiring authentication, you can create middleware to handle various authentication methods:

package main

import (
    "encoding/base64"
    "fmt"

    "github.com/gocolly/colly/v2"
)

// Basic Authentication Middleware
func basicAuthMiddleware(username, password string) func(*colly.Request) {
    return func(r *colly.Request) {
        auth := username + ":" + password
        encoded := base64.StdEncoding.EncodeToString([]byte(auth))
        r.Headers.Set("Authorization", "Basic "+encoded)
    }
}

// Bearer Token Middleware
func bearerTokenMiddleware(token string) func(*colly.Request) {
    return func(r *colly.Request) {
        r.Headers.Set("Authorization", "Bearer "+token)
    }
}

// API Key Middleware
func apiKeyMiddleware(keyName, keyValue string) func(*colly.Request) {
    return func(r *colly.Request) {
        r.Headers.Set(keyName, keyValue)
    }
}

func main() {
    c := colly.NewCollector()

    // Apply authentication middleware
    c.OnRequest(basicAuthMiddleware("myuser", "mypassword"))

    // Or use bearer token
    // c.OnRequest(bearerTokenMiddleware("your-jwt-token"))

    // Or use API key
    // c.OnRequest(apiKeyMiddleware("X-API-Key", "your-api-key"))

    c.Visit("https://protected-site.com")
}

Request Logging and Monitoring Middleware

Comprehensive logging middleware helps with debugging and monitoring your scraping operations:

package main

import (
    "fmt"
    "log"
    "time"

    "github.com/gocolly/colly/v2"
)

// Logging middleware with timing
func loggingMiddleware() func(*colly.Request) {
    return func(r *colly.Request) {
        start := time.Now()

        // Store start time in request context
        r.Ctx.Put("start_time", start)

        log.Printf("[REQUEST] %s %s", r.Method, r.URL.String())
        log.Printf("[HEADERS] %v", r.Headers)
    }
}

// Response timing middleware
func responseTimingMiddleware() func(*colly.Response) {
    return func(r *colly.Response) {
        if startTime := r.Ctx.Get("start_time"); startTime != nil {
            if start, ok := startTime.(time.Time); ok {
                duration := time.Since(start)
                log.Printf("[RESPONSE] %s completed in %v (Status: %d, Size: %d bytes)",
                    r.Request.URL.String(), duration, r.StatusCode, len(r.Body))
            }
        }
    }
}

func main() {
    c := colly.NewCollector()

    c.OnRequest(loggingMiddleware())
    c.OnResponse(responseTimingMiddleware())

    c.Visit("https://example.com")
}

Rate Limiting and Retry Middleware

Implement custom rate limiting and retry logic through middleware:

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly/v2"
)

// Rate limiting middleware
func rateLimitMiddleware(delay time.Duration) func(*colly.Request) {
    var lastRequest time.Time

    return func(r *colly.Request) {
        if time.Since(lastRequest) < delay {
            time.Sleep(delay - time.Since(lastRequest))
        }
        lastRequest = time.Now()
    }
}

// Retry middleware for failed requests
func retryMiddleware(maxRetries int) func(*colly.Response, error) {
    return func(r *colly.Response, err error) {
        retryCount := 0
        if val := r.Ctx.Get("retry_count"); val != nil {
            retryCount = val.(int)
        }

        if retryCount < maxRetries {
            fmt.Printf("Retrying request %s (attempt %d/%d)\n", 
                r.Request.URL.String(), retryCount+1, maxRetries)

            r.Request.Ctx.Put("retry_count", retryCount+1)
            r.Request.Retry()
        } else {
            fmt.Printf("Max retries exceeded for %s\n", r.Request.URL.String())
        }
    }
}

func main() {
    c := colly.NewCollector()

    // Apply rate limiting (1 second between requests)
    c.OnRequest(rateLimitMiddleware(1 * time.Second))

    // Apply retry logic (max 3 retries)
    c.OnError(retryMiddleware(3))

    c.Visit("https://example.com")
}

Custom Header Management Middleware

Create sophisticated header management for different scenarios:

package main

import (
    "math/rand"
    "time"

    "github.com/gocolly/colly/v2"
)

// User-Agent rotation middleware
func userAgentRotationMiddleware(userAgents []string) func(*colly.Request) {
    rand.Seed(time.Now().UnixNano())

    return func(r *colly.Request) {
        if len(userAgents) > 0 {
            ua := userAgents[rand.Intn(len(userAgents))]
            r.Headers.Set("User-Agent", ua)
        }
    }
}

// Referrer middleware
func referrerMiddleware(baseURL string) func(*colly.Request) {
    return func(r *colly.Request) {
        if r.URL.String() != baseURL {
            r.Headers.Set("Referer", baseURL)
        }
    }
}

// Browser-like headers middleware
func browserHeadersMiddleware() func(*colly.Request) {
    return func(r *colly.Request) {
        r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
        r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
        r.Headers.Set("Accept-Encoding", "gzip, deflate")
        r.Headers.Set("Connection", "keep-alive")
        r.Headers.Set("Upgrade-Insecure-Requests", "1")
    }
}

func main() {
    c := colly.NewCollector()

    userAgents := []string{
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    }

    c.OnRequest(userAgentRotationMiddleware(userAgents))
    c.OnRequest(browserHeadersMiddleware())
    c.OnRequest(referrerMiddleware("https://google.com"))

    c.Visit("https://example.com")
}

Session and Cookie Management Middleware

Handle sessions and cookies across multiple requests:

package main

import (
    "fmt"
    "net/http"
    "net/url"

    "github.com/gocolly/colly/v2"
)

// Session cookie middleware
func sessionCookieMiddleware(cookieJar http.CookieJar) func(*colly.Request) {
    return func(r *colly.Request) {
        if cookieJar != nil {
            cookies := cookieJar.Cookies(r.URL)
            for _, cookie := range cookies {
                r.Headers.Add("Cookie", fmt.Sprintf("%s=%s", cookie.Name, cookie.Value))
            }
        }
    }
}

// Custom cookie setter middleware
func customCookieMiddleware(cookies map[string]string) func(*colly.Request) {
    return func(r *colly.Request) {
        for name, value := range cookies {
            r.Headers.Add("Cookie", fmt.Sprintf("%s=%s", name, value))
        }
    }
}

func main() {
    c := colly.NewCollector()

    // Custom cookies
    customCookies := map[string]string{
        "session_id": "abc123",
        "preferences": "dark_mode=true",
    }

    c.OnRequest(customCookieMiddleware(customCookies))

    // Handle response cookies
    c.OnResponse(func(r *colly.Response) {
        fmt.Printf("Response cookies: %v\n", r.Headers.Get("Set-Cookie"))
    })

    c.Visit("https://example.com")
}

Conditional Middleware Application

Apply middleware conditionally based on URL patterns or other criteria:

package main

import (
    "strings"

    "github.com/gocolly/colly/v2"
)

// Conditional middleware wrapper
func conditionalMiddleware(condition func(*colly.Request) bool, middleware func(*colly.Request)) func(*colly.Request) {
    return func(r *colly.Request) {
        if condition(r) {
            middleware(r)
        }
    }
}

// URL pattern matcher
func urlContains(pattern string) func(*colly.Request) bool {
    return func(r *colly.Request) bool {
        return strings.Contains(r.URL.String(), pattern)
    }
}

func main() {
    c := colly.NewCollector()

    // Apply special headers only for API endpoints
    apiHeadersMiddleware := func(r *colly.Request) {
        r.Headers.Set("Content-Type", "application/json")
        r.Headers.Set("X-API-Version", "v2")
    }

    // Apply middleware conditionally
    c.OnRequest(conditionalMiddleware(
        urlContains("/api/"),
        apiHeadersMiddleware,
    ))

    // Regular headers for non-API requests
    c.OnRequest(conditionalMiddleware(
        func(r *colly.Request) bool { return !strings.Contains(r.URL.String(), "/api/") },
        func(r *colly.Request) {
            r.Headers.Set("User-Agent", "WebBrowser/1.0")
        },
    ))

    c.Visit("https://example.com/api/data")
    c.Visit("https://example.com/page")
}

Middleware Chaining and Order

The order of middleware registration matters. Middleware functions are executed in the order they were registered:

func main() {
    c := colly.NewCollector()

    // First middleware - executed first
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Middleware 1: Before request")
    })

    // Second middleware - executed second
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Middleware 2: Adding headers")
        r.Headers.Set("Custom-Header", "value")
    })

    // Third middleware - executed last
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Middleware 3: Final preparations")
    })

    c.Visit("https://example.com")
}

Error Handling in Middleware

Proper error handling within middleware is crucial for robust scraping:

func errorHandlingMiddleware() func(*colly.Request) {
    return func(r *colly.Request) {
        defer func() {
            if err := recover(); err != nil {
                fmt.Printf("Middleware panic recovered: %v\n", err)
            }
        }()

        // Your middleware logic here
        r.Headers.Set("Custom-Header", "value")
    }
}

Best Practices for Custom Middleware

  1. Keep middleware focused: Each middleware should have a single responsibility
  2. Handle errors gracefully: Use defer and recover for panic handling
  3. Consider performance: Minimize processing time in middleware
  4. Use context for data sharing: Pass data between middleware using r.Ctx
  5. Test middleware independently: Create unit tests for complex middleware logic

For more advanced web scraping scenarios that require JavaScript rendering, consider exploring browser automation tools that can handle dynamic content loading or manage complex authentication flows.

Conclusion

Custom request middleware in Colly provides powerful capabilities for modifying requests, implementing authentication, logging, rate limiting, and much more. By understanding how to create and chain middleware functions, you can build sophisticated web scraping applications that handle complex scenarios while maintaining clean, modular code. Remember to consider the order of middleware execution and implement proper error handling for production-ready applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon