Colly is a Golang framework designed for building efficient and elegant web scraping applications. It is known for its simplicity, flexibility, and its clean API, which makes it a popular choice among developers who are comfortable with the Go programming language. Colly provides a range of features that simplify tasks such as making HTTP requests, handling HTML documents, scraping data, and managing concurrency.
Here's how Colly is typically used for web scraping:
Installation
Before you can use Colly, you need to install Go and set up your Go workspace. Once that's done, you can install Colly using the following command:
go get -u github.com/gocolly/colly/v2
Basic Usage
Here's a simple example of how to use Colly to scrape data from a website:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a new collector
c := colly.NewCollector()
// On every <a> element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
In this example, a new Collector
is created, which is the core of Colly. The OnHTML
method is used to specify a callback function that processes HTML elements matching the given CSS selector—in this case, <a>
tags with an href
attribute. The OnRequest
method sets a callback function that is called before each request is made, allowing you to log or modify requests on the fly.
Advanced Features
Colly provides several advanced features for more complex scraping tasks:
- Caching: Colly supports caching of requests to avoid hitting the same page multiple times.
- Concurrency: Colly allows you to control the number of concurrent requests made by the scraper.
- Cookies and Session Handling: Colly can maintain session information across requests.
- Rate Limiting: Colly can limit the rate at which requests are made to avoid overwhelming the server.
- Proxy Switcher: Colly can rotate between different proxies for each request.
- Error Handling: Colly provides error handling mechanisms to deal with network issues or unexpected content.
Example with Concurrency and Rate Limiting
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/queue"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
colly.Async(true),
)
// Limit the number of threads started by colly to two
// when visiting links concurrently
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
// Create a request queue with 2 consumer threads
q, _ := queue.New(
2, // Number of consumer threads
&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
// Add link to the queue
q.AddURL(link)
})
// Start scraping in five threads on https://hackerspaces.org
for i := 0; i < 5; i++ {
q.AddURL("https://hackerspaces.org/")
}
// Consume URLs
q.Run(c)
}
In this example, the scraper is set up with a concurrency limit and a built-in request queue. The Limit
method is used to specify the concurrency rules, while the queue is managed by queue.New
and q.Run
.
Colly is a powerful tool for developers who are familiar with Go, providing a structured way to build web scraping solutions. It's important to note that when scraping websites, you should always respect the site's robots.txt
file and terms of service, and avoid overloading the servers with too many requests.