How do I extract data from a website using Colly's OnHTML function?

Colly is a popular web scraping framework for Golang that provides a convenient way to extract data from websites. The OnHTML function is one of the key features of Colly. It allows you to specify a callback function that will be called whenever a specified HTML element is found during the scraping process.

Here's a step-by-step guide on how to use Colly's OnHTML function:

Step 1: Install Colly

First, you need to install Colly. You can do this by running the following command in your terminal:

go get -u github.com/gocolly/colly/v2

Step 2: Setup a Colly Collector

Next, you need to create a new Colly collector, which is the scraper instance:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    // Create a new collector
    c := colly.NewCollector(
        // Optionally, you can set various options on the collector
        colly.AllowedDomains("example.com", "www.example.com"),
    )

    // ... setup OnHTML and other callbacks
}

Step 3: Use OnHTML to Define Callbacks

Now, you can use OnHTML to define what should happen when the scraper encounters specific HTML elements. You need to provide a selector string and a callback function. The selector string is a CSS-like selector to specify the elements you're interested in, and the callback function is what will process those elements.

For example, to scrape all the article titles from a blog, you might use an OnHTML function like this:

// ...
func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("example.com", "www.example.com"),
    )

    // On every <a> element with the class "article-title" call the callback
    c.OnHTML("a.article-title", func(e *colly.HTMLElement) {
        // e.Attr("href") will get the href attribute from the <a> element
        link := e.Attr("href")
        // e.Text will get the text content of the <a> element
        fmt.Printf("Article found: %q -> %s\n", e.Text, link)
    })

    // ... start the collector
}

Step 4: Start the Scraping Process

Finally, you need to start the scraping process by telling the collector to visit a URL:

// ...
func main() {
    c := colly.NewCollector(
        // ... same as above
    )

    // ... OnHTML callbacks

    // Start scraping on http://example.com
    c.Visit("http://example.com")
}

The collector will visit the given URL and start processing the page according to your OnHTML callbacks.

Full Example

Putting it all together, here's a full example that scrapes article titles and links from a hypothetical blog:

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("blog.example.com"),
    )

    c.OnHTML("a.article-title", func(e *colly.HTMLElement) {
        link := e.Request.AbsoluteURL(e.Attr("href"))
        fmt.Printf("Article found: %q -> %s\n", e.Text, link)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    c.Visit("http://blog.example.com")
}

Remember to handle any errors that may occur and to respect the website's robots.txt rules and terms of service. Happy scraping!

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon