How to Set Up a User Agent String in Colly
Setting up a proper user agent string is crucial for successful web scraping with Colly. Many websites check user agent headers to identify bots and may block requests with suspicious or missing user agent strings. This guide covers everything you need to know about configuring user agents in Colly.
Understanding User Agent Strings
A user agent string is an HTTP header that identifies the client making the request. It typically contains information about the browser, operating system, and device. Websites use this information to serve appropriate content and detect automated requests.
Basic User Agent Configuration
Setting a Single User Agent
The simplest way to set a user agent in Colly is using the UserAgent()
method:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Set a single user agent
c.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
c.Visit("https://example.com")
}
Using OnRequest for Dynamic User Agents
For more flexibility, you can set the user agent dynamically using the OnRequest
callback:
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Custom Bot 1.0")
})
Advanced User Agent Strategies
User Agent Rotation
Rotating user agents helps avoid detection and mimics real user behavior:
package main
import (
"math/rand"
"time"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// List of common user agents
userAgents := []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
}
// Seed random number generator
rand.Seed(time.Now().UnixNano())
c.OnRequest(func(r *colly.Request) {
// Select random user agent
randomUA := userAgents[rand.Intn(len(userAgents))]
r.Headers.Set("User-Agent", randomUA)
})
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Println("Title:", e.Text)
})
// Visit multiple pages
urls := []string{
"https://example.com",
"https://httpbin.org/user-agent",
}
for _, url := range urls {
c.Visit(url)
}
}
Mobile User Agents
When scraping mobile-specific content, use mobile user agents:
mobileUserAgents := []string{
"Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1",
"Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36",
"Mozilla/5.0 (iPad; CPU OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1",
}
c.OnRequest(func(r *colly.Request) {
randomMobileUA := mobileUserAgents[rand.Intn(len(mobileUserAgents))]
r.Headers.Set("User-Agent", randomMobileUA)
})
User Agent Best Practices
1. Use Recent and Realistic User Agents
Always use current, realistic user agent strings that match actual browsers:
// Good: Recent Chrome user agent
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
// Bad: Outdated or obviously fake user agent
"OldBot/1.0" or "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
2. Match User Agent with Other Headers
Ensure consistency between user agent and other headers:
c.OnRequest(func(r *colly.Request) {
// Set user agent
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
// Match with appropriate headers
r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
r.Headers.Set("Accept-Encoding", "gzip, deflate")
r.Headers.Set("Connection", "keep-alive")
r.Headers.Set("Upgrade-Insecure-Requests", "1")
})
3. Implement User Agent Pools
Create structured user agent management:
type UserAgentPool struct {
agents []string
index int
}
func NewUserAgentPool() *UserAgentPool {
return &UserAgentPool{
agents: []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
},
}
}
func (uap *UserAgentPool) GetRandom() string {
return uap.agents[rand.Intn(len(uap.agents))]
}
func (uap *UserAgentPool) GetNext() string {
ua := uap.agents[uap.index]
uap.index = (uap.index + 1) % len(uap.agents)
return ua
}
// Usage
pool := NewUserAgentPool()
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", pool.GetRandom())
})
Debugging User Agent Issues
Verify User Agent is Set
Test your user agent configuration:
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Request sent with User-Agent: %s\n", r.Request.Headers.Get("User-Agent"))
})
Use Online Tools
Test user agent detection with services like httpbin.org
:
c.OnHTML("body", func(e *colly.HTMLElement) {
fmt.Println("Response:", e.Text)
})
c.Visit("https://httpbin.org/user-agent")
Integration with Other Colly Features
Combining with Delays and Rate Limiting
When using user agent rotation, combine it with proper rate limiting similar to handling timeouts in web scraping frameworks:
import "github.com/gocolly/colly/v2/debug"
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
// Add rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", pool.GetRandom())
})
With Authentication Headers
When scraping authenticated content, ensure user agents work with authentication mechanisms:
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
r.Headers.Set("Authorization", "Bearer your-token-here")
})
Common Pitfalls to Avoid
1. Don't Use Default User Agent
Never rely on Colly's default user agent:
// Bad: Using default user agent
c := colly.NewCollector()
c.Visit("https://example.com")
// Good: Always set a custom user agent
c := colly.NewCollector()
c.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
c.Visit("https://example.com")
2. Avoid Suspicious Patterns
Don't use the same user agent for all requests from the same IP:
// Bad: Same user agent for all requests
c.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
// Good: Rotate user agents
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", pool.GetRandom())
})
3. Keep User Agents Updated
Regularly update your user agent strings to match current browser versions:
# Check current Chrome version
google-chrome --version
# Check current Firefox version
firefox --version
Testing User Agent Configuration
Create a simple test to verify your setup:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func testUserAgent() {
c := colly.NewCollector()
// Set test user agent
testUA := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
c.UserAgent(testUA)
c.OnHTML("body", func(e *colly.HTMLElement) {
fmt.Printf("User agent test result: %s\n", e.Text)
})
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error: %s\n", err.Error())
})
c.Visit("https://httpbin.org/user-agent")
}
Conclusion
Setting up proper user agent strings in Colly is essential for successful web scraping. By implementing rotation strategies, using realistic user agents, and following best practices, you can significantly improve your scraper's success rate while avoiding detection. Remember to always respect robots.txt files and website terms of service when scraping.
The key is to make your requests appear as natural as possible by using current browser user agents and rotating them appropriately. Combined with proper rate limiting and header management, a well-configured user agent strategy will make your Colly scraper much more effective and reliable.