Colly is a popular scraping framework for Go (Golang) that makes it easy to build web scrapers. Handling pagination with Colly is a common task when scraping data from websites that have their content spread across multiple pages.
To handle pagination with Colly, you'll typically need to:
- Identify the pattern or the link that leads to the next page.
- Use Colly's methods to visit the next page URL.
- Implement a callback function that Colly will call for each visited page.
- Make sure to avoid infinite loops by setting conditions for pagination to stop.
Here's a step-by-step example of how to handle pagination with Colly:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly"
)
func main() {
// Create a new collector
c := colly.NewCollector(
// Optionally restrict the domains to visit
colly.AllowedDomains("example.com"),
)
// Define a callback function that will be called for each page
c.OnHTML("a.next", func(e *colly.HTMLElement) {
// Find the link to the next page
nextPage := e.Attr("href")
if nextPage != "" {
// Visit the next page
e.Request.Visit(nextPage)
}
})
// Define a callback for when a page's HTML is fully loaded
c.OnHTML("div.content", func(e *colly.HTMLElement) {
// Extract the content of interest
// For example, you might extract articles, products, etc.
fmt.Println("Content found:", e.Text)
})
// Define a callback for when Colly completes visiting a page
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on the first page
if err := c.Visit("http://example.com/start"); err != nil {
log.Fatal(err)
}
}
In this example, a.next
is a CSS selector that targets the link to the next page. This could be different on the website you are scraping, such as a.pagination__next
, li.next > a
, etc. You'll need to inspect the website's HTML structure to determine the correct selector.
The OnHTML
callback with the a.next
selector is used to find the next page link. We then use e.Request.Visit(nextPage)
to tell Colly to visit the next page.
The other OnHTML
callback is an example of how you might process the content on each page. Replace div.content
with the appropriate selector for the content you're interested in.
Finally, we start the scraping process by calling c.Visit
with the URL of the first page.
Remember to handle pagination carefully to respect the website's terms of service and to avoid overloading the server with requests. Consider adding delays or obeying robots.txt
rules by setting appropriate options on your Colly collector:
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
)
// Limit the number of threads started by Colly to two
// when visiting links which domains' matches "*httpbin.*" glob
c.Limit(&colly.LimitRule{
DomainGlob: "*example.*",
Parallelism: 2,
Delay: 1 * time.Second,
})
Using these settings, you can ensure that your scraper behaves in a more polite manner by limiting its concurrency and adding delays between requests.