Colly is a popular scraping framework for Go (Golang), designed for simplicity and ease of use. When you want to start the scraping process using Colly, you typically make use of the Collector
object, which provides the Visit
function. The Visit
function is used to initiate a GET request to the specified URL, and it's where your scraping process begins.
Here's a step-by-step guide on how to use the c.Visit
function in Colly:
Step 1: Install Colly
First, you need to have Colly installed. You can install it using the go get
command:
go get -u github.com/gocolly/colly/v2
Step 2: Import Colly in Your Go Program
Create a new .go
file and import Colly at the beginning of your file:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
Step 3: Create a New Colly Collector
Instantiate a new Colly Collector
:
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Optionally configure the collector
colly.AllowedDomains("example.com", "www.example.com"),
// ... other options
)
// ... set up callbacks and options for the collector
}
Step 4: Set up Callbacks
Before you start visiting URLs, you typically want to set up callbacks to handle the data that is scraped:
func main() {
c := colly.NewCollector(
// ... collector options
)
// On every <a> element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page on a new thread
e.Request.Visit(link)
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// ... other callbacks
}
Step 5: Start Scraping
Now you are ready to start scraping by calling the Visit
function with the URL you want to scrape:
func main() {
// ... setup Collector and callbacks
// Start scraping on http://example.com
err := c.Visit("http://example.com")
if err != nil {
fmt.Println("Error visiting the page:", err)
}
}
That's it! When c.Visit("http://example.com")
is called, Colly sends a GET request to the specified URL. The callbacks you defined will be triggered based on the response received. For example, the OnHTML
callback will be called for every HTML element that matches the selector you specified.
Here's a complete example that puts it all together:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com", "www.example.com"),
)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
err := c.Visit("http://example.com")
if err != nil {
fmt.Println("Error visiting the page:", err)
}
}
Remember to handle errors appropriately and respect the website's robots.txt
rules and terms of service to ensure that your scraping activities are ethical and legal.