Pholcus is a distributed, high-concurrency, and powerful web crawler software written in Go. When using Pholcus or any web scraping tool, it's important to respect the rate limits of the target website to avoid getting your IP address banned.
Managing rate limits typically involves:
Respecting
robots.txt
: Start by checking the target website'srobots.txt
file to understand their scraping policy.Limiting Request Rate: Slowing down the rate at which you make requests to the website.
User-Agent Rotation: Rotating user agents to mimic different browsers.
IP Rotation: Using proxy servers to rotate IP addresses.
Retry with Backoff: Implementing a retry mechanism with exponential backoff in case requests are throttled.
Pholcus provides features to manage some of these strategies, especially limiting the concurrency level which effectively controls the rate of requests.
Here's how you can manage rate limits with Pholcus:
Limiting Request Rate
To avoid hitting rate limits, you can limit the number of concurrent requests that Pholcus makes. This can be done by setting the ThreadNum
parameter which controls the number of threads (goroutines in Go) that are used for crawling.
package main
import (
"github.com/henrylee2cn/pholcus/exec"
_ "github.com/henrylee2cn/pholcus_lib" // Any spider must be imported.
// "github.com/henrylee2cn/pholcus_lib_pte" // If using a distributed server, import this.
"github.com/henrylee2cn/pholcus/runtime/cache"
)
func main() {
// Set the number of concurrent threads
cache.ThreadNum = 5 // Limit to 5 threads to avoid hitting rate limits
// Run Pholcus
exec.DefaultRun("web")
}
User-Agent Rotation
Although Pholcus doesn't provide a built-in feature for rotating user agents, you can implement a list of user agents and pick one randomly for each request in your spider's code.
import (
"math/rand"
"time"
"github.com/henrylee2cn/pholcus/common/pinyin"
)
// ...
func (self *mySpider) GetPage(ctx *Context) {
// ... other code ...
userAgents := []string{
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
// ... add more user agents ...
}
// Seed the random number generator
rand.Seed(time.Now().UnixNano())
// Pick a random user agent
randomUserAgent := userAgents[rand.Intn(len(userAgents))]
// Add the User-Agent header to the request
ctx.SetHeader("User-Agent", randomUserAgent)
// ... other code ...
}
IP Rotation
For rotating IP addresses, Pholcus does not have built-in support for proxies. However, you can modify your Go code to use a proxy client by setting up an http.Transport
with the desired proxy settings and attaching it to the http.Client
you use in your spider.
import (
"net/http"
"golang.org/x/net/proxy"
)
// ...
func getProxiedHttpClient(proxyAddr string) (*http.Client, error) {
// Create a dialer using the proxy address
dialer, err := proxy.SOCKS5("tcp", proxyAddr, nil, proxy.Direct)
if err != nil {
return nil, err
}
// Set up the HTTP transport to use the dialer (proxy)
transport := &http.Transport{
Dial: dialer.Dial,
}
// Create an HTTP client with the transport
client := &http.Client{
Transport: transport,
}
return client, nil
}
// ...
func (self *mySpider) GetPage(ctx *Context) {
// ... other code ...
// Example proxy address
proxyAddr := "127.0.0.1:1080"
// Get a proxied HTTP client
client, err := getProxiedHttpClient(proxyAddr)
if err != nil {
// Handle the error
}
// Use the client for HTTP requests
// resp, err := client.Get("http://example.com")
// ... other code ...
}
Remember to use a pool of proxy servers to rotate through them and avoid using the same IP address for too many requests.
Retry with Backoff
Implementing a retry mechanism with exponential backoff isn't directly provided by Pholcus, but you can write this logic yourself within your spider's task functions. You'll have to handle the retry logic manually after detecting a failed request or a response indicating that you've been rate-limited.
Conclusion
When using Pholcus, or any web scraping tools, always be aware of the website's scraping policies and terms of service. It's important to scrape responsibly and ethically to avoid legal issues and maintain a good relationship with web service providers. If a website provides an API, prefer using it over web scraping as APIs are designed to handle requests efficiently and are less likely to result in bans when used correctly.