How do I Handle Pagination When Scraping Multiple Pages with Colly?
Pagination is one of the most common challenges in web scraping, especially when dealing with e-commerce sites, search results, or any content that spans multiple pages. Colly, the Go web scraping framework, provides several approaches to handle pagination effectively. This guide covers different pagination patterns and how to implement them using Colly.
Understanding Pagination Patterns
Before diving into implementation, it's important to understand the common pagination patterns you'll encounter:
- Numbered pagination (1, 2, 3... Next)
- Offset-based pagination (URLs with page parameters)
- Load more buttons (AJAX-based pagination)
- Infinite scroll (requires JavaScript execution)
Basic Pagination Setup
Here's a fundamental approach to handle pagination in Colly:
package main
import (
"fmt"
"log"
"net/url"
"strconv"
"strings"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Set up rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
// Handle page content
c.OnHTML(".item", func(e *colly.HTMLElement) {
// Extract data from each item
title := e.ChildText(".title")
price := e.ChildText(".price")
fmt.Printf("Item: %s - Price: %s\n", title, price)
})
// Handle pagination links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
if strings.Contains(link, "page=") || strings.Contains(e.Text, "Next") {
c.Visit(e.Request.AbsoluteURL(link))
}
})
// Start scraping
c.Visit("https://example.com/products?page=1")
}
Numbered Pagination Strategy
For sites with numbered pagination, you can implement a more controlled approach:
package main
import (
"fmt"
"log"
"time"
"github.com/gocolly/colly/v2"
)
func scrapeWithNumberedPagination() {
c := colly.NewCollector()
// Track visited pages to avoid infinite loops
visitedPages := make(map[string]bool)
maxPages := 50 // Set a reasonable limit
c.OnHTML(".product", func(e *colly.HTMLElement) {
title := e.ChildText("h3")
price := e.ChildText(".price")
fmt.Printf("Product: %s - %s\n", title, price)
})
// Handle next page links
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
nextURL := e.Attr("href")
if !visitedPages[nextURL] && len(visitedPages) < maxPages {
visitedPages[nextURL] = true
time.Sleep(2 * time.Second) // Respectful delay
c.Visit(e.Request.AbsoluteURL(nextURL))
}
})
// Start from the first page
startURL := "https://example.com/products"
visitedPages[startURL] = true
c.Visit(startURL)
}
Parameter-Based Pagination
Many sites use URL parameters for pagination. Here's how to handle this pattern:
package main
import (
"fmt"
"net/url"
"strconv"
"time"
"github.com/gocolly/colly/v2"
)
func scrapeWithParameterPagination() {
c := colly.NewCollector()
baseURL := "https://example.com/api/products"
currentPage := 1
itemsPerPage := 20
hasMorePages := true
c.OnHTML(".product-list", func(e *colly.HTMLElement) {
// Check if we have items on this page
items := e.ChildTexts(".product")
if len(items) == 0 {
hasMorePages = false
return
}
// Process items
for _, item := range items {
fmt.Printf("Item: %s\n", item)
}
// If we have a full page, there might be more
if len(items) == itemsPerPage {
currentPage++
nextURL := buildPaginationURL(baseURL, currentPage, itemsPerPage)
time.Sleep(1 * time.Second)
c.Visit(nextURL)
} else {
hasMorePages = false
}
})
// Start scraping
firstPageURL := buildPaginationURL(baseURL, currentPage, itemsPerPage)
c.Visit(firstPageURL)
}
func buildPaginationURL(baseURL string, page, limit int) string {
u, _ := url.Parse(baseURL)
q := u.Query()
q.Set("page", strconv.Itoa(page))
q.Set("limit", strconv.Itoa(limit))
u.RawQuery = q.Encode()
return u.String()
}
Advanced Pagination with Context
For more complex scenarios, you can use Colly's context feature to track pagination state:
package main
import (
"context"
"fmt"
"strconv"
"time"
"github.com/gocolly/colly/v2"
)
type PaginationContext struct {
CurrentPage int
MaxPages int
TotalItems int
}
func scrapeWithContext() {
c := colly.NewCollector()
c.OnHTML(".pagination-info", func(e *colly.HTMLElement) {
// Extract pagination metadata
totalPagesText := e.ChildText(".total-pages")
if totalPages, err := strconv.Atoi(totalPagesText); err == nil {
ctx := &PaginationContext{
CurrentPage: 1,
MaxPages: totalPages,
TotalItems: 0,
}
e.Request.Ctx.Put("pagination", ctx)
}
})
c.OnHTML(".product", func(e *colly.HTMLElement) {
// Extract product data
title := e.ChildText("h2")
price := e.ChildText(".price")
fmt.Printf("Product: %s - %s\n", title, price)
// Update context
if pagination := e.Request.Ctx.GetAny("pagination"); pagination != nil {
ctx := pagination.(*PaginationContext)
ctx.TotalItems++
}
})
c.OnHTML("a.next", func(e *colly.HTMLElement) {
if pagination := e.Request.Ctx.GetAny("pagination"); pagination != nil {
ctx := pagination.(*PaginationContext)
if ctx.CurrentPage < ctx.MaxPages {
ctx.CurrentPage++
nextURL := e.Attr("href")
// Create new request with context
newReq := e.Request.New("GET", e.Request.AbsoluteURL(nextURL), nil)
newReq.Ctx = e.Request.Ctx
time.Sleep(2 * time.Second)
c.Request("GET", e.Request.AbsoluteURL(nextURL), nil, newReq.Ctx, nil)
}
}
})
c.Visit("https://example.com/products")
}
Handling Dynamic Pagination
For sites with JavaScript-heavy pagination, you might need to extract pagination URLs from JSON responses or API calls:
package main
import (
"encoding/json"
"fmt"
"time"
"github.com/gocolly/colly/v2"
)
type APIResponse struct {
Data []Product `json:"data"`
NextPage string `json:"next_page_url"`
HasMore bool `json:"has_more"`
}
type Product struct {
ID int `json:"id"`
Name string `json:"name"`
Price string `json:"price"`
}
func scrapeAPIWithPagination() {
c := colly.NewCollector()
c.OnResponse(func(r *colly.Response) {
var apiResponse APIResponse
if err := json.Unmarshal(r.Body, &apiResponse); err != nil {
fmt.Printf("Error parsing JSON: %v\n", err)
return
}
// Process products
for _, product := range apiResponse.Data {
fmt.Printf("Product: %s - %s\n", product.Name, product.Price)
}
// Continue to next page if available
if apiResponse.HasMore && apiResponse.NextPage != "" {
time.Sleep(1 * time.Second)
c.Visit(apiResponse.NextPage)
}
})
// Start with the first page of the API
c.Visit("https://api.example.com/products?page=1")
}
Best Practices for Pagination Scraping
1. Implement Rate Limiting
Always respect the target website by implementing appropriate delays:
c.Limit(&colly.LimitRule{
DomainGlob: "*example.com",
Parallelism: 1,
Delay: 2 * time.Second,
})
2. Set Maximum Page Limits
Prevent infinite loops by setting reasonable limits:
const MAX_PAGES = 100
var pageCount = 0
c.OnHTML("a.next", func(e *colly.HTMLElement) {
pageCount++
if pageCount >= MAX_PAGES {
return
}
// Continue pagination logic
})
3. Handle Errors Gracefully
Implement proper error handling for failed requests:
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error on page %s: %v\n", r.Request.URL, err)
// Optionally retry the request
time.Sleep(5 * time.Second)
r.Request.Retry()
})
4. Use Caching for Development
During development, cache responses to avoid repeated requests:
c.CacheDir = "./colly_cache"
Debugging Pagination Issues
When pagination isn't working as expected, add debug output:
c.OnRequest(func(r *colly.Request) {
fmt.Printf("Visiting: %s\n", r.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
text := strings.TrimSpace(e.Text)
fmt.Printf("Found link: %s -> %s\n", text, link)
})
Conclusion
Handling pagination with Colly requires understanding the specific pagination pattern used by your target website. Whether it's numbered pages, parameter-based URLs, or API responses, Colly provides the flexibility to handle various scenarios effectively. Remember to always implement rate limiting, error handling, and reasonable limits to create robust and respectful scrapers.
For more complex scenarios involving JavaScript-heavy pagination, you might want to consider how to handle AJAX requests using Puppeteer or explore how to run multiple pages in parallel with Puppeteer for more advanced scraping techniques.
The key is to start simple, understand the pagination pattern, and gradually add complexity as needed while maintaining respectful scraping practices.