What is Colly and how is it used for web scraping?

Colly is a Golang framework designed for building efficient and elegant web scraping applications. It is known for its simplicity, flexibility, and its clean API, which makes it a popular choice among developers who are comfortable with the Go programming language. Colly provides a range of features that simplify tasks such as making HTTP requests, handling HTML documents, scraping data, and managing concurrency.

Here's how Colly is typically used for web scraping:

Installation

Before you can use Colly, you need to install Go and set up your Go workspace. Once that's done, you can install Colly using the following command:

go get -u github.com/gocolly/colly/v2

Basic Usage

Here's a simple example of how to use Colly to scrape data from a website:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    // Create a new collector
    c := colly.NewCollector()

    // On every <a> element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        // Print link
        fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    })

    // Before making a request print "Visiting ..."
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    // Start scraping on https://hackerspaces.org
    c.Visit("https://hackerspaces.org/")
}

In this example, a new Collector is created, which is the core of Colly. The OnHTML method is used to specify a callback function that processes HTML elements matching the given CSS selector—in this case, <a> tags with an href attribute. The OnRequest method sets a callback function that is called before each request is made, allowing you to log or modify requests on the fly.

Advanced Features

Colly provides several advanced features for more complex scraping tasks:

  • Caching: Colly supports caching of requests to avoid hitting the same page multiple times.
  • Concurrency: Colly allows you to control the number of concurrent requests made by the scraper.
  • Cookies and Session Handling: Colly can maintain session information across requests.
  • Rate Limiting: Colly can limit the rate at which requests are made to avoid overwhelming the server.
  • Proxy Switcher: Colly can rotate between different proxies for each request.
  • Error Handling: Colly provides error handling mechanisms to deal with network issues or unexpected content.

Example with Concurrency and Rate Limiting

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/queue"
)

func main() {
    // Instantiate default collector
    c := colly.NewCollector(
        colly.Async(true),
    )

    // Limit the number of threads started by colly to two
    // when visiting links concurrently
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    // Create a request queue with 2 consumer threads
    q, _ := queue.New(
        2, // Number of consumer threads
        &queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
    )

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // Add link to the queue
        q.AddURL(link)
    })

    // Start scraping in five threads on https://hackerspaces.org
    for i := 0; i < 5; i++ {
        q.AddURL("https://hackerspaces.org/")
    }
    // Consume URLs
    q.Run(c)
}

In this example, the scraper is set up with a concurrency limit and a built-in request queue. The Limit method is used to specify the concurrency rules, while the queue is managed by queue.New and q.Run.

Colly is a powerful tool for developers who are familiar with Go, providing a structured way to build web scraping solutions. It's important to note that when scraping websites, you should always respect the site's robots.txt file and terms of service, and avoid overloading the servers with too many requests.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon