How do I set up custom request middleware in Colly?
Custom request middleware in Colly allows you to intercept and modify HTTP requests before they're sent to the target server. This powerful feature enables you to implement authentication, add custom headers, implement retry logic, logging, and much more. Understanding how to create and use middleware is essential for building robust web scraping applications with Colly.
Understanding Colly Middleware
Middleware in Colly works through callback functions that are executed at specific points during the request lifecycle. The most common middleware types are:
- OnRequest: Executed before sending the request
- OnResponse: Executed after receiving the response
- OnHTML: Executed when HTML content is found
- OnError: Executed when an error occurs
Basic Request Middleware Setup
Here's how to set up basic request middleware in Colly:
package main
import (
"fmt"
"log"
"net/http"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Basic request middleware
c.OnRequest(func(r *colly.Request) {
fmt.Printf("Visiting: %s\n", r.URL.String())
// Add custom headers
r.Headers.Set("User-Agent", "MyBot/1.0")
r.Headers.Set("Accept", "text/html,application/xhtml+xml")
})
// Response middleware
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Response received: %d bytes from %s\n",
len(r.Body), r.Request.URL)
})
// Error handling middleware
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error occurred: %s\n", err.Error())
})
c.Visit("https://example.com")
}
Advanced Authentication Middleware
For websites requiring authentication, you can create middleware to handle various authentication methods:
package main
import (
"encoding/base64"
"fmt"
"github.com/gocolly/colly/v2"
)
// Basic Authentication Middleware
func basicAuthMiddleware(username, password string) func(*colly.Request) {
return func(r *colly.Request) {
auth := username + ":" + password
encoded := base64.StdEncoding.EncodeToString([]byte(auth))
r.Headers.Set("Authorization", "Basic "+encoded)
}
}
// Bearer Token Middleware
func bearerTokenMiddleware(token string) func(*colly.Request) {
return func(r *colly.Request) {
r.Headers.Set("Authorization", "Bearer "+token)
}
}
// API Key Middleware
func apiKeyMiddleware(keyName, keyValue string) func(*colly.Request) {
return func(r *colly.Request) {
r.Headers.Set(keyName, keyValue)
}
}
func main() {
c := colly.NewCollector()
// Apply authentication middleware
c.OnRequest(basicAuthMiddleware("myuser", "mypassword"))
// Or use bearer token
// c.OnRequest(bearerTokenMiddleware("your-jwt-token"))
// Or use API key
// c.OnRequest(apiKeyMiddleware("X-API-Key", "your-api-key"))
c.Visit("https://protected-site.com")
}
Request Logging and Monitoring Middleware
Comprehensive logging middleware helps with debugging and monitoring your scraping operations:
package main
import (
"fmt"
"log"
"time"
"github.com/gocolly/colly/v2"
)
// Logging middleware with timing
func loggingMiddleware() func(*colly.Request) {
return func(r *colly.Request) {
start := time.Now()
// Store start time in request context
r.Ctx.Put("start_time", start)
log.Printf("[REQUEST] %s %s", r.Method, r.URL.String())
log.Printf("[HEADERS] %v", r.Headers)
}
}
// Response timing middleware
func responseTimingMiddleware() func(*colly.Response) {
return func(r *colly.Response) {
if startTime := r.Ctx.Get("start_time"); startTime != nil {
if start, ok := startTime.(time.Time); ok {
duration := time.Since(start)
log.Printf("[RESPONSE] %s completed in %v (Status: %d, Size: %d bytes)",
r.Request.URL.String(), duration, r.StatusCode, len(r.Body))
}
}
}
}
func main() {
c := colly.NewCollector()
c.OnRequest(loggingMiddleware())
c.OnResponse(responseTimingMiddleware())
c.Visit("https://example.com")
}
Rate Limiting and Retry Middleware
Implement custom rate limiting and retry logic through middleware:
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
)
// Rate limiting middleware
func rateLimitMiddleware(delay time.Duration) func(*colly.Request) {
var lastRequest time.Time
return func(r *colly.Request) {
if time.Since(lastRequest) < delay {
time.Sleep(delay - time.Since(lastRequest))
}
lastRequest = time.Now()
}
}
// Retry middleware for failed requests
func retryMiddleware(maxRetries int) func(*colly.Response, error) {
return func(r *colly.Response, err error) {
retryCount := 0
if val := r.Ctx.Get("retry_count"); val != nil {
retryCount = val.(int)
}
if retryCount < maxRetries {
fmt.Printf("Retrying request %s (attempt %d/%d)\n",
r.Request.URL.String(), retryCount+1, maxRetries)
r.Request.Ctx.Put("retry_count", retryCount+1)
r.Request.Retry()
} else {
fmt.Printf("Max retries exceeded for %s\n", r.Request.URL.String())
}
}
}
func main() {
c := colly.NewCollector()
// Apply rate limiting (1 second between requests)
c.OnRequest(rateLimitMiddleware(1 * time.Second))
// Apply retry logic (max 3 retries)
c.OnError(retryMiddleware(3))
c.Visit("https://example.com")
}
Custom Header Management Middleware
Create sophisticated header management for different scenarios:
package main
import (
"math/rand"
"time"
"github.com/gocolly/colly/v2"
)
// User-Agent rotation middleware
func userAgentRotationMiddleware(userAgents []string) func(*colly.Request) {
rand.Seed(time.Now().UnixNano())
return func(r *colly.Request) {
if len(userAgents) > 0 {
ua := userAgents[rand.Intn(len(userAgents))]
r.Headers.Set("User-Agent", ua)
}
}
}
// Referrer middleware
func referrerMiddleware(baseURL string) func(*colly.Request) {
return func(r *colly.Request) {
if r.URL.String() != baseURL {
r.Headers.Set("Referer", baseURL)
}
}
}
// Browser-like headers middleware
func browserHeadersMiddleware() func(*colly.Request) {
return func(r *colly.Request) {
r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
r.Headers.Set("Accept-Encoding", "gzip, deflate")
r.Headers.Set("Connection", "keep-alive")
r.Headers.Set("Upgrade-Insecure-Requests", "1")
}
}
func main() {
c := colly.NewCollector()
userAgents := []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
}
c.OnRequest(userAgentRotationMiddleware(userAgents))
c.OnRequest(browserHeadersMiddleware())
c.OnRequest(referrerMiddleware("https://google.com"))
c.Visit("https://example.com")
}
Session and Cookie Management Middleware
Handle sessions and cookies across multiple requests:
package main
import (
"fmt"
"net/http"
"net/url"
"github.com/gocolly/colly/v2"
)
// Session cookie middleware
func sessionCookieMiddleware(cookieJar http.CookieJar) func(*colly.Request) {
return func(r *colly.Request) {
if cookieJar != nil {
cookies := cookieJar.Cookies(r.URL)
for _, cookie := range cookies {
r.Headers.Add("Cookie", fmt.Sprintf("%s=%s", cookie.Name, cookie.Value))
}
}
}
}
// Custom cookie setter middleware
func customCookieMiddleware(cookies map[string]string) func(*colly.Request) {
return func(r *colly.Request) {
for name, value := range cookies {
r.Headers.Add("Cookie", fmt.Sprintf("%s=%s", name, value))
}
}
}
func main() {
c := colly.NewCollector()
// Custom cookies
customCookies := map[string]string{
"session_id": "abc123",
"preferences": "dark_mode=true",
}
c.OnRequest(customCookieMiddleware(customCookies))
// Handle response cookies
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Response cookies: %v\n", r.Headers.Get("Set-Cookie"))
})
c.Visit("https://example.com")
}
Conditional Middleware Application
Apply middleware conditionally based on URL patterns or other criteria:
package main
import (
"strings"
"github.com/gocolly/colly/v2"
)
// Conditional middleware wrapper
func conditionalMiddleware(condition func(*colly.Request) bool, middleware func(*colly.Request)) func(*colly.Request) {
return func(r *colly.Request) {
if condition(r) {
middleware(r)
}
}
}
// URL pattern matcher
func urlContains(pattern string) func(*colly.Request) bool {
return func(r *colly.Request) bool {
return strings.Contains(r.URL.String(), pattern)
}
}
func main() {
c := colly.NewCollector()
// Apply special headers only for API endpoints
apiHeadersMiddleware := func(r *colly.Request) {
r.Headers.Set("Content-Type", "application/json")
r.Headers.Set("X-API-Version", "v2")
}
// Apply middleware conditionally
c.OnRequest(conditionalMiddleware(
urlContains("/api/"),
apiHeadersMiddleware,
))
// Regular headers for non-API requests
c.OnRequest(conditionalMiddleware(
func(r *colly.Request) bool { return !strings.Contains(r.URL.String(), "/api/") },
func(r *colly.Request) {
r.Headers.Set("User-Agent", "WebBrowser/1.0")
},
))
c.Visit("https://example.com/api/data")
c.Visit("https://example.com/page")
}
Middleware Chaining and Order
The order of middleware registration matters. Middleware functions are executed in the order they were registered:
func main() {
c := colly.NewCollector()
// First middleware - executed first
c.OnRequest(func(r *colly.Request) {
fmt.Println("Middleware 1: Before request")
})
// Second middleware - executed second
c.OnRequest(func(r *colly.Request) {
fmt.Println("Middleware 2: Adding headers")
r.Headers.Set("Custom-Header", "value")
})
// Third middleware - executed last
c.OnRequest(func(r *colly.Request) {
fmt.Println("Middleware 3: Final preparations")
})
c.Visit("https://example.com")
}
Error Handling in Middleware
Proper error handling within middleware is crucial for robust scraping:
func errorHandlingMiddleware() func(*colly.Request) {
return func(r *colly.Request) {
defer func() {
if err := recover(); err != nil {
fmt.Printf("Middleware panic recovered: %v\n", err)
}
}()
// Your middleware logic here
r.Headers.Set("Custom-Header", "value")
}
}
Best Practices for Custom Middleware
- Keep middleware focused: Each middleware should have a single responsibility
- Handle errors gracefully: Use defer and recover for panic handling
- Consider performance: Minimize processing time in middleware
- Use context for data sharing: Pass data between middleware using
r.Ctx
- Test middleware independently: Create unit tests for complex middleware logic
For more advanced web scraping scenarios that require JavaScript rendering, consider exploring browser automation tools that can handle dynamic content loading or manage complex authentication flows.
Conclusion
Custom request middleware in Colly provides powerful capabilities for modifying requests, implementing authentication, logging, rate limiting, and much more. By understanding how to create and chain middleware functions, you can build sophisticated web scraping applications that handle complex scenarios while maintaining clean, modular code. Remember to consider the order of middleware execution and implement proper error handling for production-ready applications.