Colly is a powerful Go web scraping framework that makes data extraction simple and efficient. This guide covers the complete installation process and provides practical examples to get you started.
Prerequisites
Before installing Colly, ensure you have:
- Go 1.16 or later installed on your system
- Basic familiarity with Go modules
- A properly configured $GOPATH
(if using Go < 1.16)
Installation Steps
1. Initialize Your Go Project
First, create a new directory and initialize a Go module:
mkdir colly-scraper
cd colly-scraper
go mod init colly-scraper
For existing projects, navigate to your project directory:
cd path/to/your/existing/project
2. Install Colly
Install the latest version of Colly v2 using go get
:
go get github.com/gocolly/colly/v2
This command will:
- Download Colly and its dependencies
- Update your go.mod
file automatically
- Create a go.sum
file for dependency verification
3. Verify Installation
Check your go.mod
file to confirm Colly was added:
module colly-scraper
go 1.21
require github.com/gocolly/colly/v2 v2.1.0
require (
github.com/PuerkitoBio/goquery v1.8.1 // indirect
github.com/andybalholm/cascadia v1.3.1 // indirect
github.com/antchfx/htmlquery v1.3.0 // indirect
github.com/antchfx/xmlquery v1.3.17 // indirect
github.com/antchfx/xpath v1.2.4 // indirect
github.com/gobwas/glob v0.2.3 // indirect
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect
github.com/golang/protobuf v1.5.3 // indirect
github.com/kennygrant/sanitize v1.2.4 // indirect
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d // indirect
github.com/temoto/robotstxt v1.1.2 // indirect
golang.org/x/net v0.12.0 // indirect
golang.org/x/text v0.11.0 // indirect
google.golang.org/appengine v1.6.7 // indirect
google.golang.org/protobuf v1.31.0 // indirect
)
Basic Usage Examples
Simple Web Scraper
Create a main.go
file with this basic example:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a new collector
c := colly.NewCollector(
colly.Domains("example.com"), // Restrict to specific domains
)
// Set up callbacks
c.OnHTML("h1", func(e *colly.HTMLElement) {
fmt.Printf("Title: %s\n", e.Text)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Link: %s -> %s\n", e.Text, link)
})
c.OnRequest(func(r *colly.Request) {
fmt.Printf("Visiting: %s\n", r.URL.String())
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Request URL: %s failed with response: %v\nError: %s",
r.Request.URL, r, err)
})
// Start scraping
err := c.Visit("https://example.com")
if err != nil {
log.Fatal(err)
}
}
Advanced Example with Rate Limiting
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}), // Enable debugging
)
// Limit requests per second
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
// Set custom headers
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "MyBot 1.0")
})
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Printf("Page title: %s\n", e.Text)
})
c.Visit("https://httpbin.org/")
c.Wait() // Wait for all requests to complete
}
Running Your Scraper
Execute your scraper with:
go run main.go
For production builds:
go build -o scraper main.go
./scraper
Installation Troubleshooting
Common Issues
Module not found error:
go mod tidy
go get github.com/gocolly/colly/v2@latest
Permission denied (proxy environments):
go env -w GOPROXY=direct
go env -w GOSUMDB=off
SSL certificate errors:
go env -w GOSUMDB=off
Next Steps
- Explore Colly's debugging capabilities
- Learn about extensions for additional features
- Check the official documentation for advanced usage patterns
You now have Colly successfully installed and ready for web scraping in your Go project!