What are the best Go libraries for web scraping?
Go offers several powerful libraries for web scraping, each with unique strengths and use cases. Here's a comprehensive guide to the best Go libraries for web scraping, complete with code examples and practical implementations.
1. Colly - The Most Popular Go Scraping Framework
Colly is the most popular web scraping framework for Go, offering a clean API and excellent performance for crawling websites at scale.
Key Features
- Lightning-fast HTTP/1.1 and HTTP/2 support
- JavaScript rendering with Chrome Devtools Protocol
- Distributed scraping support
- Automatic cookie and session handling
- Built-in caching and rate limiting
Basic Colly Example
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
// Create a new collector
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
// Set user agent
c.UserAgent = "Mozilla/5.0 (compatible; Go-Scraper/1.0)"
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
fmt.Printf("Found link: %s\n", link)
e.Request.Visit(link)
})
// Extract data
c.OnHTML("h1", func(e *colly.HTMLElement) {
fmt.Printf("Title: %s\n", e.Text)
})
// Start scraping
c.Visit("https://example.com")
}
Advanced Colly with Rate Limiting
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/extensions"
)
func main() {
c := colly.NewCollector()
// Add random user agent
extensions.RandomUserAgent(c)
// Limit requests per domain
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
// Handle forms and extract data
c.OnHTML("form", func(e *colly.HTMLElement) {
action := e.Attr("action")
method := e.Attr("method")
fmt.Printf("Form: %s %s\n", method, action)
})
c.Visit("https://example.com")
}
2. GoQuery - jQuery-like HTML Parsing
GoQuery provides jQuery-like syntax for HTML parsing and manipulation, making it familiar for developers with web development experience.
Key Features
- jQuery-like selector syntax
- CSS selector support
- DOM traversal and manipulation
- Works well with standard HTTP clients
GoQuery Example
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func main() {
// Make HTTP request
res, err := http.Get("https://example.com")
if err != nil {
log.Fatal(err)
}
defer res.Body.Close()
// Parse HTML
doc, err := goquery.NewDocumentFromReader(res.Body)
if err != nil {
log.Fatal(err)
}
// Extract data using CSS selectors
doc.Find("h2").Each(func(i int, s *goquery.Selection) {
title := s.Text()
link, exists := s.Find("a").Attr("href")
fmt.Printf("Title %d: %s\n", i, title)
if exists {
fmt.Printf("Link: %s\n", link)
}
})
// Extract metadata
doc.Find("meta").Each(func(i int, s *goquery.Selection) {
name, _ := s.Attr("name")
content, _ := s.Attr("content")
if name != "" {
fmt.Printf("Meta %s: %s\n", name, content)
}
})
}
3. Chromedp - Chrome DevTools Protocol
Chromedp is a Go library for controlling Chrome/Chromium browsers programmatically, perfect for JavaScript-heavy websites.
Key Features
- Full browser automation
- JavaScript execution
- Screenshot capture
- PDF generation
- Network interception
Chromedp Example
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/chromedp"
"github.com/chromedp/cdproto/cdp"
)
func main() {
// Create context
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// Set timeout
ctx, cancel = context.WithTimeout(ctx, 15*time.Second)
defer cancel()
var title string
var nodes []*cdp.Node
// Navigate and extract data
err := chromedp.Run(ctx,
chromedp.Navigate("https://example.com"),
chromedp.WaitVisible("body"),
chromedp.Title(&title),
chromedp.Nodes("h1, h2, h3", &nodes, chromedp.ByQueryAll),
)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Page Title: %s\n", title)
// Extract text from nodes
for _, node := range nodes {
var text string
err := chromedp.Run(ctx, chromedp.Text([]cdp.NodeID{node.NodeID}, &text))
if err == nil {
fmt.Printf("Heading: %s\n", text)
}
}
}
JavaScript Execution with Chromedp
func scrapeWithJS() {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var result string
err := chromedp.Run(ctx,
chromedp.Navigate("https://spa-example.com"),
chromedp.WaitVisible("#dynamic-content"),
chromedp.Evaluate(`
JSON.stringify({
title: document.title,
links: Array.from(document.querySelectorAll('a')).map(a => a.href),
text: document.body.innerText.substring(0, 100)
})
`, &result),
)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Scraped data: %s\n", result)
}
4. Rod - DevTools Protocol Alternative
Rod is another browser automation library that's faster and more user-friendly than Chromedp.
Rod Example
package main
import (
"fmt"
"github.com/go-rod/rod"
)
func main() {
// Launch browser
browser := rod.New().MustConnect()
defer browser.MustClose()
// Navigate to page
page := browser.MustPage("https://example.com")
// Wait for element and extract text
title := page.MustElement("h1").MustText()
fmt.Printf("Title: %s\n", title)
// Extract all links
links := page.MustElements("a")
for _, link := range links {
href := link.MustAttribute("href")
text := link.MustText()
fmt.Printf("Link: %s -> %s\n", text, *href)
}
}
5. Surf - Stateful Web Browsing
Surf provides a stateful browsing experience with form handling and cookie management.
Surf Example
package main
import (
"fmt"
"github.com/headzoo/surf"
)
func main() {
// Create browser
bow := surf.NewBrowser()
// Visit page
err := bow.Open("https://example.com")
if err != nil {
panic(err)
}
// Extract data
fmt.Printf("Title: %s\n", bow.Title())
fmt.Printf("URL: %s\n", bow.Url())
// Find forms
bow.Find("form").Each(func(_ int, s *goquery.Selection) {
action, _ := s.Attr("action")
method, _ := s.Attr("method")
fmt.Printf("Form: %s %s\n", method, action)
})
}
Performance Comparison and Use Cases
When to Use Each Library
Colly - Best for: - Large-scale web crawling - Sites requiring rate limiting - Distributed scraping - Performance-critical applications
GoQuery - Best for: - Simple HTML parsing - Static content extraction - When you need jQuery-like syntax - Lightweight scraping tasks
Chromedp/Rod - Best for: - JavaScript-heavy websites - Single Page Applications (SPAs) - When you need browser automation - Screenshot/PDF generation
Surf - Best for: - Form submissions - Session management - Stateful browsing
Best Practices for Go Web Scraping
1. Respect robots.txt
import "github.com/temoto/robotstxt"
func checkRobots(url string) bool {
robots, err := robotstxt.FromURL(url + "/robots.txt")
if err != nil {
return true // Allow if robots.txt not found
}
return robots.TestAgent(url, "Go-Scraper")
}
2. Implement Proper Error Handling
func scrapeWithRetry(url string, maxRetries int) error {
c := colly.NewCollector()
var lastErr error
for i := 0; i < maxRetries; i++ {
err := c.Visit(url)
if err == nil {
return nil
}
lastErr = err
time.Sleep(time.Duration(i+1) * time.Second)
}
return lastErr
}
3. Use Concurrent Processing
func concurrentScraping(urls []string) {
c := colly.NewCollector(colly.Async(true))
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 10})
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Printf("Title: %s\n", e.Text)
})
for _, url := range urls {
c.Visit(url)
}
c.Wait()
}
4. Handle HTTP Headers and User Agents
func setupAdvancedColly() *colly.Collector {
c := colly.NewCollector()
// Set custom headers
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
r.Headers.Set("Accept-Encoding", "gzip, deflate")
r.Headers.Set("DNT", "1")
r.Headers.Set("Connection", "keep-alive")
r.Headers.Set("Upgrade-Insecure-Requests", "1")
})
return c
}
Installing Go Web Scraping Libraries
Installation Commands
# Install Colly
go mod init scraper
go get -u github.com/gocolly/colly/v2
# Install GoQuery
go get github.com/PuerkitoBio/goquery
# Install Chromedp
go get -u github.com/chromedp/chromedp
# Install Rod
go get github.com/go-rod/rod
# Install Surf
go get github.com/headzoo/surf
Dependencies Setup
// go.mod example
module webscraper
go 1.19
require (
github.com/PuerkitoBio/goquery v1.8.1
github.com/chromedp/chromedp v0.9.2
github.com/go-rod/rod v0.112.0
github.com/gocolly/colly/v2 v2.1.0
github.com/headzoo/surf v1.0.1
)
Handling Dynamic Content and JavaScript
For JavaScript-heavy sites, similar to how browser automation handles AJAX requests, you can use Chromedp or Rod to wait for dynamic content:
// Wait for dynamic content with Chromedp
func waitForDynamicContent() {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
var result string
err := chromedp.Run(ctx,
chromedp.Navigate("https://spa-site.com"),
chromedp.WaitVisible("#dynamic-content", chromedp.ByID),
chromedp.Sleep(2*time.Second), // Additional wait
chromedp.InnerHTML("#content", &result),
)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Dynamic content: %s\n", result)
}
Conclusion
Go offers excellent libraries for web scraping, each suited for different scenarios. Colly excels for large-scale crawling, GoQuery provides familiar jQuery syntax, while Chromedp and Rod handle JavaScript-heavy sites effectively. Choose the library that best fits your specific scraping requirements, considering factors like performance, complexity, and the type of content you're extracting.
For projects requiring sophisticated session management and browser automation capabilities, these Go libraries provide the necessary tools to handle complex navigation patterns and dynamic content loading efficiently.