What are some popular Go packages for web scraping?

Web scraping in Go, often referred to as "Golang," can be performed using a variety of packages. Each package offers different features and capabilities, so the choice of package often depends on the specific needs of the project. Here are some popular Go packages for web scraping:

  • Colly
    • Repository: https://github.com/gocolly/colly
    • Description: Colly is one of the most popular web scraping frameworks for Go. It provides a clean and intuitive API for scraping, visiting pages, and handling concurrency. It is also extensible and has support for caching and other advanced features.
   package main

   import (
       "fmt"
       "github.com/gocolly/colly/v2"
   )

   func main() {
       c := colly.NewCollector()

       c.OnHTML("a[href]", func(e *colly.HTMLElement) {
           link := e.Attr("href")
           fmt.Printf("Link found: %q -> %s\n", e.Text, link)
       })

       c.OnRequest(func(r *colly.Request) {
           fmt.Println("Visiting", r.URL)
       })

       c.Visit("http://go-colly.org/")
   }
  • GoQuery
    • Repository: https://github.com/PuerkitoBio/goquery
    • Description: GoQuery is a powerful and expressive library that brings jQuery-like syntax to Go for parsing and manipulating HTML documents. It is particularly useful for tasks that involve searching or transforming HTML documents.
   package main

   import (
       "fmt"
       "github.com/PuerkitoBio/goquery"
       "log"
       "net/http"
   )

   func main() {
       res, err := http.Get("http://example.com/")
       if err != nil {
           log.Fatal(err)
       }
       defer res.Body.Close()

       if res.StatusCode != 200 {
           log.Fatalf("Status code error: %d %s", res.StatusCode, res.Status)
       }

       doc, err := goquery.NewDocumentFromReader(res.Body)
       if err != nil {
           log.Fatal(err)
       }

       doc.Find("a").Each(func(i int, s *goquery.Selection) {
           href, _ := s.Attr("href")
           fmt.Printf("Link %d: %s\n", i, href)
       })
   }
  • Rod
    • Repository: https://github.com/go-rod/rod
    • Description: Rod is a high-level driver directly based on Chrome DevTools Protocol. It can be used for web scraping as well as browser automation. It provides features to simulate browser actions like clicking, typing, and navigation.
   package main

   import (
       "github.com/go-rod/rod"
   )

   func main() {
       browser := rod.New().MustConnect()
       defer browser.MustClose()

       page := browser.MustPage("https://example.com/")
       page.MustWaitLoad().MustScreenshot("example.png")
   }
  • Surf
    • Repository: https://github.com/headzoo/surf
    • Description: Surf is a stateful web browser for Go that mimics a web browser's interactions, including handling cookies and session data. It is built on top of the GoQuery library.
   package main

   import (
       "fmt"
       "github.com/headzoo/surf"
   )

   func main() {
       bow := surf.NewBrowser()

       err := bow.Open("http://example.com/")
       if err != nil {
           panic(err)
       }

       fmt.Println(bow.Title())
   }

These packages are robust solutions for web scraping in Go. Your choice depends on whether you need a high-level API (like Colly), direct DOM manipulation (like GoQuery), browser automation (like Rod), or a stateful browser simulation (like Surf). Always remember to respect the terms of service and robots.txt of the websites you scrape, and consider the legal and ethical implications of your web scraping activities.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon