Colly is a popular web scraping framework for Golang that provides a convenient way to extract data from websites. The OnHTML
function is one of the key features of Colly. It allows you to specify a callback function that will be called whenever a specified HTML element is found during the scraping process.
Here's a step-by-step guide on how to use Colly's OnHTML
function:
Step 1: Install Colly
First, you need to install Colly. You can do this by running the following command in your terminal:
go get -u github.com/gocolly/colly/v2
Step 2: Setup a Colly Collector
Next, you need to create a new Colly collector, which is the scraper instance:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a new collector
c := colly.NewCollector(
// Optionally, you can set various options on the collector
colly.AllowedDomains("example.com", "www.example.com"),
)
// ... setup OnHTML and other callbacks
}
Step 3: Use OnHTML
to Define Callbacks
Now, you can use OnHTML
to define what should happen when the scraper encounters specific HTML elements. You need to provide a selector string and a callback function. The selector string is a CSS-like selector to specify the elements you're interested in, and the callback function is what will process those elements.
For example, to scrape all the article titles from a blog, you might use an OnHTML
function like this:
// ...
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com", "www.example.com"),
)
// On every <a> element with the class "article-title" call the callback
c.OnHTML("a.article-title", func(e *colly.HTMLElement) {
// e.Attr("href") will get the href attribute from the <a> element
link := e.Attr("href")
// e.Text will get the text content of the <a> element
fmt.Printf("Article found: %q -> %s\n", e.Text, link)
})
// ... start the collector
}
Step 4: Start the Scraping Process
Finally, you need to start the scraping process by telling the collector to visit a URL:
// ...
func main() {
c := colly.NewCollector(
// ... same as above
)
// ... OnHTML callbacks
// Start scraping on http://example.com
c.Visit("http://example.com")
}
The collector will visit the given URL and start processing the page according to your OnHTML
callbacks.
Full Example
Putting it all together, here's a full example that scrapes article titles and links from a hypothetical blog:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("blog.example.com"),
)
c.OnHTML("a.article-title", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
fmt.Printf("Article found: %q -> %s\n", e.Text, link)
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
c.Visit("http://blog.example.com")
}
Remember to handle any errors that may occur and to respect the website's robots.txt rules and terms of service. Happy scraping!