Pholcus is a distributed, high-concurrency, and powerful web crawler software written in the Go language. To limit the scraping speed to mimic human browsing behavior, you can adjust the configuration settings of Pholcus to introduce delays or control concurrency, thus slowing down the request rate.
Here are some strategies you can use to limit the scraping speed in Pholcus:
Set Delay Between Requests: Introduce a delay between each request to mimic the time a human might take to read a page before moving on to the next one.
Control Concurrency: Limit the number of concurrent requests to reduce the load on the target server and to make the scraping activity less aggressive.
Randomize Delays: Instead of having a fixed delay, use a random delay within a certain range to better simulate human behavior.
Respect
robots.txt
: Make sure your crawler respects therobots.txt
file of the target website, which may specify the crawl-delay for the user-agent.
Here is an example of how you can adjust the configuration in Pholcus to limit the scraping speed:
package main
import (
"github.com/henrylee2cn/pholcus/exec"
"github.com/henrylee2cn/pholcus/spider"
"time"
"math/rand"
)
func main() {
// Set up the Pholcus runtime
exec.DefaultRun("web")
// Create a new spider
mySpider := &spider.Spider{}
// Set up some configuration options for the spider
mySpider.SetPausetime(func() time.Duration {
// Randomize the delay between 5 to 10 seconds
return time.Duration(rand.Intn(5)+5) * time.Second
})
// Set the maximum number of concurrent requests
mySpider.SetThreadnum(1)
// Add the spider to the Pholcus runtime
exec.AddSpider(mySpider)
// Run the crawler
exec.Run()
}
In this example, SetPausetime
is used to set a randomized delay between each request, which will help mimic human browsing patterns. Also, SetThreadnum
is used to control the concurrency level, setting it to 1
to make sure only one request is processed at a time.
Please note that the actual implementation details might be different depending on the version of Pholcus and the specific requirements of your scraping task. Always refer to the official Pholcus documentation for the most accurate and up-to-date information.
Remember that web scraping should be done responsibly and ethically. Always comply with the website's terms of service, and avoid putting excessive load on the servers you are scraping. If possible, try to obtain the needed data through official APIs or by seeking permission from the website owners.