Table of contents

What are the best Go libraries for web scraping?

Go offers several powerful libraries for web scraping, each with unique strengths and use cases. Here's a comprehensive guide to the best Go libraries for web scraping, complete with code examples and practical implementations.

1. Colly - The Most Popular Go Scraping Framework

Colly is the most popular web scraping framework for Go, offering a clean API and excellent performance for crawling websites at scale.

Key Features

  • Lightning-fast HTTP/1.1 and HTTP/2 support
  • JavaScript rendering with Chrome Devtools Protocol
  • Distributed scraping support
  • Automatic cookie and session handling
  • Built-in caching and rate limiting

Basic Colly Example

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    // Create a new collector
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Set user agent
    c.UserAgent = "Mozilla/5.0 (compatible; Go-Scraper/1.0)"

    // Find and visit all links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        fmt.Printf("Found link: %s\n", link)
        e.Request.Visit(link)
    })

    // Extract data
    c.OnHTML("h1", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s\n", e.Text)
    })

    // Start scraping
    c.Visit("https://example.com")
}

Advanced Colly with Rate Limiting

package main

import (
    "fmt"
    "time"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/extensions"
)

func main() {
    c := colly.NewCollector()

    // Add random user agent
    extensions.RandomUserAgent(c)

    // Limit requests per domain
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    // Handle forms and extract data
    c.OnHTML("form", func(e *colly.HTMLElement) {
        action := e.Attr("action")
        method := e.Attr("method")
        fmt.Printf("Form: %s %s\n", method, action)
    })

    c.Visit("https://example.com")
}

2. GoQuery - jQuery-like HTML Parsing

GoQuery provides jQuery-like syntax for HTML parsing and manipulation, making it familiar for developers with web development experience.

Key Features

  • jQuery-like selector syntax
  • CSS selector support
  • DOM traversal and manipulation
  • Works well with standard HTTP clients

GoQuery Example

package main

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    // Make HTTP request
    res, err := http.Get("https://example.com")
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    // Parse HTML
    doc, err := goquery.NewDocumentFromReader(res.Body)
    if err != nil {
        log.Fatal(err)
    }

    // Extract data using CSS selectors
    doc.Find("h2").Each(func(i int, s *goquery.Selection) {
        title := s.Text()
        link, exists := s.Find("a").Attr("href")

        fmt.Printf("Title %d: %s\n", i, title)
        if exists {
            fmt.Printf("Link: %s\n", link)
        }
    })

    // Extract metadata
    doc.Find("meta").Each(func(i int, s *goquery.Selection) {
        name, _ := s.Attr("name")
        content, _ := s.Attr("content")
        if name != "" {
            fmt.Printf("Meta %s: %s\n", name, content)
        }
    })
}

3. Chromedp - Chrome DevTools Protocol

Chromedp is a Go library for controlling Chrome/Chromium browsers programmatically, perfect for JavaScript-heavy websites.

Key Features

  • Full browser automation
  • JavaScript execution
  • Screenshot capture
  • PDF generation
  • Network interception

Chromedp Example

package main

import (
    "context"
    "fmt"
    "log"
    "time"
    "github.com/chromedp/chromedp"
    "github.com/chromedp/cdproto/cdp"
)

func main() {
    // Create context
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    // Set timeout
    ctx, cancel = context.WithTimeout(ctx, 15*time.Second)
    defer cancel()

    var title string
    var nodes []*cdp.Node

    // Navigate and extract data
    err := chromedp.Run(ctx,
        chromedp.Navigate("https://example.com"),
        chromedp.WaitVisible("body"),
        chromedp.Title(&title),
        chromedp.Nodes("h1, h2, h3", &nodes, chromedp.ByQueryAll),
    )
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Page Title: %s\n", title)

    // Extract text from nodes
    for _, node := range nodes {
        var text string
        err := chromedp.Run(ctx, chromedp.Text([]cdp.NodeID{node.NodeID}, &text))
        if err == nil {
            fmt.Printf("Heading: %s\n", text)
        }
    }
}

JavaScript Execution with Chromedp

func scrapeWithJS() {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    var result string

    err := chromedp.Run(ctx,
        chromedp.Navigate("https://spa-example.com"),
        chromedp.WaitVisible("#dynamic-content"),
        chromedp.Evaluate(`
            JSON.stringify({
                title: document.title,
                links: Array.from(document.querySelectorAll('a')).map(a => a.href),
                text: document.body.innerText.substring(0, 100)
            })
        `, &result),
    )

    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Scraped data: %s\n", result)
}

4. Rod - DevTools Protocol Alternative

Rod is another browser automation library that's faster and more user-friendly than Chromedp.

Rod Example

package main

import (
    "fmt"
    "github.com/go-rod/rod"
)

func main() {
    // Launch browser
    browser := rod.New().MustConnect()
    defer browser.MustClose()

    // Navigate to page
    page := browser.MustPage("https://example.com")

    // Wait for element and extract text
    title := page.MustElement("h1").MustText()
    fmt.Printf("Title: %s\n", title)

    // Extract all links
    links := page.MustElements("a")
    for _, link := range links {
        href := link.MustAttribute("href")
        text := link.MustText()
        fmt.Printf("Link: %s -> %s\n", text, *href)
    }
}

5. Surf - Stateful Web Browsing

Surf provides a stateful browsing experience with form handling and cookie management.

Surf Example

package main

import (
    "fmt"
    "github.com/headzoo/surf"
)

func main() {
    // Create browser
    bow := surf.NewBrowser()

    // Visit page
    err := bow.Open("https://example.com")
    if err != nil {
        panic(err)
    }

    // Extract data
    fmt.Printf("Title: %s\n", bow.Title())
    fmt.Printf("URL: %s\n", bow.Url())

    // Find forms
    bow.Find("form").Each(func(_ int, s *goquery.Selection) {
        action, _ := s.Attr("action")
        method, _ := s.Attr("method")
        fmt.Printf("Form: %s %s\n", method, action)
    })
}

Performance Comparison and Use Cases

When to Use Each Library

Colly - Best for: - Large-scale web crawling - Sites requiring rate limiting - Distributed scraping - Performance-critical applications

GoQuery - Best for: - Simple HTML parsing - Static content extraction - When you need jQuery-like syntax - Lightweight scraping tasks

Chromedp/Rod - Best for: - JavaScript-heavy websites - Single Page Applications (SPAs) - When you need browser automation - Screenshot/PDF generation

Surf - Best for: - Form submissions - Session management - Stateful browsing

Best Practices for Go Web Scraping

1. Respect robots.txt

import "github.com/temoto/robotstxt"

func checkRobots(url string) bool {
    robots, err := robotstxt.FromURL(url + "/robots.txt")
    if err != nil {
        return true // Allow if robots.txt not found
    }

    return robots.TestAgent(url, "Go-Scraper")
}

2. Implement Proper Error Handling

func scrapeWithRetry(url string, maxRetries int) error {
    c := colly.NewCollector()

    var lastErr error
    for i := 0; i < maxRetries; i++ {
        err := c.Visit(url)
        if err == nil {
            return nil
        }
        lastErr = err
        time.Sleep(time.Duration(i+1) * time.Second)
    }

    return lastErr
}

3. Use Concurrent Processing

func concurrentScraping(urls []string) {
    c := colly.NewCollector(colly.Async(true))
    c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 10})

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s\n", e.Text)
    })

    for _, url := range urls {
        c.Visit(url)
    }

    c.Wait()
}

4. Handle HTTP Headers and User Agents

func setupAdvancedColly() *colly.Collector {
    c := colly.NewCollector()

    // Set custom headers
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
        r.Headers.Set("Accept-Language", "en-US,en;q=0.5")
        r.Headers.Set("Accept-Encoding", "gzip, deflate")
        r.Headers.Set("DNT", "1")
        r.Headers.Set("Connection", "keep-alive")
        r.Headers.Set("Upgrade-Insecure-Requests", "1")
    })

    return c
}

Installing Go Web Scraping Libraries

Installation Commands

# Install Colly
go mod init scraper
go get -u github.com/gocolly/colly/v2

# Install GoQuery
go get github.com/PuerkitoBio/goquery

# Install Chromedp
go get -u github.com/chromedp/chromedp

# Install Rod
go get github.com/go-rod/rod

# Install Surf
go get github.com/headzoo/surf

Dependencies Setup

// go.mod example
module webscraper

go 1.19

require (
    github.com/PuerkitoBio/goquery v1.8.1
    github.com/chromedp/chromedp v0.9.2
    github.com/go-rod/rod v0.112.0
    github.com/gocolly/colly/v2 v2.1.0
    github.com/headzoo/surf v1.0.1
)

Handling Dynamic Content and JavaScript

For JavaScript-heavy sites, similar to how browser automation handles AJAX requests, you can use Chromedp or Rod to wait for dynamic content:

// Wait for dynamic content with Chromedp
func waitForDynamicContent() {
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    var result string

    err := chromedp.Run(ctx,
        chromedp.Navigate("https://spa-site.com"),
        chromedp.WaitVisible("#dynamic-content", chromedp.ByID),
        chromedp.Sleep(2*time.Second), // Additional wait
        chromedp.InnerHTML("#content", &result),
    )

    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Dynamic content: %s\n", result)
}

Conclusion

Go offers excellent libraries for web scraping, each suited for different scenarios. Colly excels for large-scale crawling, GoQuery provides familiar jQuery syntax, while Chromedp and Rod handle JavaScript-heavy sites effectively. Choose the library that best fits your specific scraping requirements, considering factors like performance, complexity, and the type of content you're extracting.

For projects requiring sophisticated session management and browser automation capabilities, these Go libraries provide the necessary tools to handle complex navigation patterns and dynamic content loading efficiently.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon