Table of contents

How to Set Up a User Agent String in Colly

Setting up a proper user agent string is crucial for successful web scraping with Colly. Many websites check user agent headers to identify bots and may block requests with suspicious or missing user agent strings. This guide covers everything you need to know about configuring user agents in Colly.

Understanding User Agent Strings

A user agent string is an HTTP header that identifies the client making the request. It typically contains information about the browser, operating system, and device. Websites use this information to serve appropriate content and detect automated requests.

Basic User Agent Configuration

Setting a Single User Agent

The simplest way to set a user agent in Colly is using the UserAgent() method:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Set a single user agent
    c.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    c.Visit("https://example.com")
}

Using OnRequest for Dynamic User Agents

For more flexibility, you can set the user agent dynamically using the OnRequest callback:

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", "Custom Bot 1.0")
})

Advanced User Agent Strategies

User Agent Rotation

Rotating user agents helps avoid detection and mimics real user behavior:

package main

import (
    "math/rand"
    "time"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // List of common user agents
    userAgents := []string{
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0",
    }

    // Seed random number generator
    rand.Seed(time.Now().UnixNano())

    c.OnRequest(func(r *colly.Request) {
        // Select random user agent
        randomUA := userAgents[rand.Intn(len(userAgents))]
        r.Headers.Set("User-Agent", randomUA)
    })

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println("Title:", e.Text)
    })

    // Visit multiple pages
    urls := []string{
        "https://example.com",
        "https://httpbin.org/user-agent",
    }

    for _, url := range urls {
        c.Visit(url)
    }
}

Mobile User Agents

When scraping mobile-specific content, use mobile user agents:

mobileUserAgents := []string{
    "Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1",
    "Mozilla/5.0 (Linux; Android 11; SM-G991B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.120 Mobile Safari/537.36",
    "Mozilla/5.0 (iPad; CPU OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1",
}

c.OnRequest(func(r *colly.Request) {
    randomMobileUA := mobileUserAgents[rand.Intn(len(mobileUserAgents))]
    r.Headers.Set("User-Agent", randomMobileUA)
})

User Agent Best Practices

1. Use Recent and Realistic User Agents

Always use current, realistic user agent strings that match actual browsers:

// Good: Recent Chrome user agent
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"

// Bad: Outdated or obviously fake user agent
"OldBot/1.0" or "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

2. Match User Agent with Other Headers

Ensure consistency between user agent and other headers:

c.OnRequest(func(r *colly.Request) {
    // Set user agent
    r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

    // Match with appropriate headers
    r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8")
    r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
    r.Headers.Set("Accept-Encoding", "gzip, deflate")
    r.Headers.Set("Connection", "keep-alive")
    r.Headers.Set("Upgrade-Insecure-Requests", "1")
})

3. Implement User Agent Pools

Create structured user agent management:

type UserAgentPool struct {
    agents []string
    index  int
}

func NewUserAgentPool() *UserAgentPool {
    return &UserAgentPool{
        agents: []string{
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        },
    }
}

func (uap *UserAgentPool) GetRandom() string {
    return uap.agents[rand.Intn(len(uap.agents))]
}

func (uap *UserAgentPool) GetNext() string {
    ua := uap.agents[uap.index]
    uap.index = (uap.index + 1) % len(uap.agents)
    return ua
}

// Usage
pool := NewUserAgentPool()
c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", pool.GetRandom())
})

Debugging User Agent Issues

Verify User Agent is Set

Test your user agent configuration:

c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Request sent with User-Agent: %s\n", r.Request.Headers.Get("User-Agent"))
})

Use Online Tools

Test user agent detection with services like httpbin.org:

c.OnHTML("body", func(e *colly.HTMLElement) {
    fmt.Println("Response:", e.Text)
})

c.Visit("https://httpbin.org/user-agent")

Integration with Other Colly Features

Combining with Delays and Rate Limiting

When using user agent rotation, combine it with proper rate limiting similar to handling timeouts in web scraping frameworks:

import "github.com/gocolly/colly/v2/debug"

c := colly.NewCollector(
    colly.Debugger(&debug.LogDebugger{}),
)

// Add rate limiting
c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 2,
    Delay:       1 * time.Second,
})

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", pool.GetRandom())
})

With Authentication Headers

When scraping authenticated content, ensure user agents work with authentication mechanisms:

c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
    r.Headers.Set("Authorization", "Bearer your-token-here")
})

Common Pitfalls to Avoid

1. Don't Use Default User Agent

Never rely on Colly's default user agent:

// Bad: Using default user agent
c := colly.NewCollector()
c.Visit("https://example.com")

// Good: Always set a custom user agent
c := colly.NewCollector()
c.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
c.Visit("https://example.com")

2. Avoid Suspicious Patterns

Don't use the same user agent for all requests from the same IP:

// Bad: Same user agent for all requests
c.UserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")

// Good: Rotate user agents
c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", pool.GetRandom())
})

3. Keep User Agents Updated

Regularly update your user agent strings to match current browser versions:

# Check current Chrome version
google-chrome --version

# Check current Firefox version
firefox --version

Testing User Agent Configuration

Create a simple test to verify your setup:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func testUserAgent() {
    c := colly.NewCollector()

    // Set test user agent
    testUA := "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    c.UserAgent(testUA)

    c.OnHTML("body", func(e *colly.HTMLElement) {
        fmt.Printf("User agent test result: %s\n", e.Text)
    })

    c.OnError(func(r *colly.Response, err error) {
        fmt.Printf("Error: %s\n", err.Error())
    })

    c.Visit("https://httpbin.org/user-agent")
}

Conclusion

Setting up proper user agent strings in Colly is essential for successful web scraping. By implementing rotation strategies, using realistic user agents, and following best practices, you can significantly improve your scraper's success rate while avoiding detection. Remember to always respect robots.txt files and website terms of service when scraping.

The key is to make your requests appear as natural as possible by using current browser user agents and rotating them appropriately. Combined with proper rate limiting and header management, a well-configured user agent strategy will make your Colly scraper much more effective and reliable.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon