Can Colly rotate user agents or proxies to mimic human behavior?

Yes, Colly, a popular web scraping framework for Golang, does allow you to rotate user agents and proxies to more closely mimic human behavior and avoid detection by web servers.

Rotating User Agents

To rotate user agents, you can create a slice of strings containing various user agents and then select one at random for each request. Here's a simplified example in Go:

package main

import (
    "fmt"
    "math/rand"
    "time"

    "github.com/gocolly/colly"
)

func main() {
    userAgents := []string{
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Safari/605.1.15",
        // Add more user agents as needed
    }

    c := colly.NewCollector()

    // Randomly set a User-Agent header for each request
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", userAgents[rand.Intn(len(userAgents))])
    })

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", r.Request.URL)
    })

    // Start scraping
    c.Visit("https://httpbin.org/user-agent")
}

func init() {
    rand.Seed(time.Now().UnixNano())
}

This script initializes a slice of user agents, then sets the User-Agent header of each request to a random selection from this slice.

Rotating Proxies

To rotate proxies, you can follow a similar approach by using Colly's SetProxyFunc method. Here's an example:

package main

import (
    "github.com/gocolly/colly"
    "github.com/gocolly/colly/proxy"
)

func main() {
    proxies := []string{
        "http://proxy1.com:8080",
        "http://proxy2.com:8080",
        // Add more proxies as needed
    }

    c := colly.NewCollector()

    // Rotate proxies
    rp, err := proxy.RoundRobinProxySwitcher(proxies...)
    if err != nil {
        log.Fatal(err)
    }
    c.SetProxyFunc(rp)

    c.OnResponse(func(r *colly.Response) {
        fmt.Println("Visited", r.Request.URL)
    })

    // Start scraping
    c.Visit("https://httpbin.org/ip")
}

In this code, we specify a list of proxy servers and then use proxy.RoundRobinProxySwitcher to create a proxy switcher that rotates through them in a round-robin fashion. We then set this as the proxy function for our Colly collector.

Notes on Usage

  • Web scraping should always be performed ethically and within the bounds of the website's terms of service and applicable laws, such as the Computer Fraud and Abuse Act in the United States or the GDPR in the European Union. Always respect robots.txt rules and request rate limits.
  • Frequent rotation of user agents and proxies can help to reduce the likelihood of being blocked by the target website, but it does not guarantee it. Some websites may employ more advanced detection mechanisms.
  • When using proxies, remember that free proxies can be unreliable and may compromise the privacy and security of your scraping operations. It may be better to use a trusted proxy or VPN service.
  • Always initialize your random number generator to ensure you get a good mix of user agents and proxies. In the example above, rand.Seed(time.Now().UnixNano()) is used to seed the random number generator with the current time.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon