How do I Handle Pagination When Scraping Multiple Pages with Colly?

Pagination is one of the most common challenges in web scraping, especially when dealing with e-commerce sites, search results, or any content that spans multiple pages. Colly, the Go web scraping framework, provides several approaches to handle pagination effectively. This guide covers different pagination patterns and how to implement them using Colly.

Understanding Pagination Patterns

Before diving into implementation, it's important to understand the common pagination patterns you'll encounter:

Numbered pagination (1, 2, 3... Next)
Offset-based pagination (URLs with page parameters)
Load more buttons (AJAX-based pagination)
Infinite scroll (requires JavaScript execution)

Basic Pagination Setup

Here's a fundamental approach to handle pagination in Colly:

package main

import (
    "fmt"
    "log"
    "net/url"
    "strconv"
    "strings"

    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // Set up rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    // Handle page content
    c.OnHTML(".item", func(e *colly.HTMLElement) {
        // Extract data from each item
        title := e.ChildText(".title")
        price := e.ChildText(".price")
        fmt.Printf("Item: %s - Price: %s\n", title, price)
    })

    // Handle pagination links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        if strings.Contains(link, "page=") || strings.Contains(e.Text, "Next") {
            c.Visit(e.Request.AbsoluteURL(link))
        }
    })

    // Start scraping
    c.Visit("https://example.com/products?page=1")
}

Numbered Pagination Strategy

For sites with numbered pagination, you can implement a more controlled approach:

package main

import (
    "fmt"
    "log"
    "time"

    "github.com/gocolly/colly/v2"
)

func scrapeWithNumberedPagination() {
    c := colly.NewCollector()

    // Track visited pages to avoid infinite loops
    visitedPages := make(map[string]bool)
    maxPages := 50 // Set a reasonable limit

    c.OnHTML(".product", func(e *colly.HTMLElement) {
        title := e.ChildText("h3")
        price := e.ChildText(".price")
        fmt.Printf("Product: %s - %s\n", title, price)
    })

    // Handle next page links
    c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
        nextURL := e.Attr("href")

        if !visitedPages[nextURL] && len(visitedPages) < maxPages {
            visitedPages[nextURL] = true
            time.Sleep(2 * time.Second) // Respectful delay
            c.Visit(e.Request.AbsoluteURL(nextURL))
        }
    })

    // Start from the first page
    startURL := "https://example.com/products"
    visitedPages[startURL] = true
    c.Visit(startURL)
}

Parameter-Based Pagination

Many sites use URL parameters for pagination. Here's how to handle this pattern:

package main

import (
    "fmt"
    "net/url"
    "strconv"
    "time"

    "github.com/gocolly/colly/v2"
)

func scrapeWithParameterPagination() {
    c := colly.NewCollector()

    baseURL := "https://example.com/api/products"
    currentPage := 1
    itemsPerPage := 20
    hasMorePages := true

    c.OnHTML(".product-list", func(e *colly.HTMLElement) {
        // Check if we have items on this page
        items := e.ChildTexts(".product")
        if len(items) == 0 {
            hasMorePages = false
            return
        }

        // Process items
        for _, item := range items {
            fmt.Printf("Item: %s\n", item)
        }

        // If we have a full page, there might be more
        if len(items) == itemsPerPage {
            currentPage++
            nextURL := buildPaginationURL(baseURL, currentPage, itemsPerPage)
            time.Sleep(1 * time.Second)
            c.Visit(nextURL)
        } else {
            hasMorePages = false
        }
    })

    // Start scraping
    firstPageURL := buildPaginationURL(baseURL, currentPage, itemsPerPage)
    c.Visit(firstPageURL)
}

func buildPaginationURL(baseURL string, page, limit int) string {
    u, _ := url.Parse(baseURL)
    q := u.Query()
    q.Set("page", strconv.Itoa(page))
    q.Set("limit", strconv.Itoa(limit))
    u.RawQuery = q.Encode()
    return u.String()
}

Advanced Pagination with Context

For more complex scenarios, you can use Colly's context feature to track pagination state:

package main

import (
    "context"
    "fmt"
    "strconv"
    "time"

    "github.com/gocolly/colly/v2"
)

type PaginationContext struct {
    CurrentPage int
    MaxPages    int
    TotalItems  int
}

func scrapeWithContext() {
    c := colly.NewCollector()

    c.OnHTML(".pagination-info", func(e *colly.HTMLElement) {
        // Extract pagination metadata
        totalPagesText := e.ChildText(".total-pages")
        if totalPages, err := strconv.Atoi(totalPagesText); err == nil {
            ctx := &PaginationContext{
                CurrentPage: 1,
                MaxPages:    totalPages,
                TotalItems:  0,
            }
            e.Request.Ctx.Put("pagination", ctx)
        }
    })

    c.OnHTML(".product", func(e *colly.HTMLElement) {
        // Extract product data
        title := e.ChildText("h2")
        price := e.ChildText(".price")
        fmt.Printf("Product: %s - %s\n", title, price)

        // Update context
        if pagination := e.Request.Ctx.GetAny("pagination"); pagination != nil {
            ctx := pagination.(*PaginationContext)
            ctx.TotalItems++
        }
    })

    c.OnHTML("a.next", func(e *colly.HTMLElement) {
        if pagination := e.Request.Ctx.GetAny("pagination"); pagination != nil {
            ctx := pagination.(*PaginationContext)

            if ctx.CurrentPage < ctx.MaxPages {
                ctx.CurrentPage++
                nextURL := e.Attr("href")

                // Create new request with context
                newReq := e.Request.New("GET", e.Request.AbsoluteURL(nextURL), nil)
                newReq.Ctx = e.Request.Ctx

                time.Sleep(2 * time.Second)
                c.Request("GET", e.Request.AbsoluteURL(nextURL), nil, newReq.Ctx, nil)
            }
        }
    })

    c.Visit("https://example.com/products")
}

Handling Dynamic Pagination

For sites with JavaScript-heavy pagination, you might need to extract pagination URLs from JSON responses or API calls:

package main

import (
    "encoding/json"
    "fmt"
    "time"

    "github.com/gocolly/colly/v2"
)

type APIResponse struct {
    Data     []Product `json:"data"`
    NextPage string    `json:"next_page_url"`
    HasMore  bool      `json:"has_more"`
}

type Product struct {
    ID    int    `json:"id"`
    Name  string `json:"name"`
    Price string `json:"price"`
}

func scrapeAPIWithPagination() {
    c := colly.NewCollector()

    c.OnResponse(func(r *colly.Response) {
        var apiResponse APIResponse
        if err := json.Unmarshal(r.Body, &apiResponse); err != nil {
            fmt.Printf("Error parsing JSON: %v\n", err)
            return
        }

        // Process products
        for _, product := range apiResponse.Data {
            fmt.Printf("Product: %s - %s\n", product.Name, product.Price)
        }

        // Continue to next page if available
        if apiResponse.HasMore && apiResponse.NextPage != "" {
            time.Sleep(1 * time.Second)
            c.Visit(apiResponse.NextPage)
        }
    })

    // Start with the first page of the API
    c.Visit("https://api.example.com/products?page=1")
}

Best Practices for Pagination Scraping

1. Implement Rate Limiting

Always respect the target website by implementing appropriate delays:

c.Limit(&colly.LimitRule{
    DomainGlob:  "*example.com",
    Parallelism: 1,
    Delay:       2 * time.Second,
})

2. Set Maximum Page Limits

Prevent infinite loops by setting reasonable limits:

const MAX_PAGES = 100
var pageCount = 0

c.OnHTML("a.next", func(e *colly.HTMLElement) {
    pageCount++
    if pageCount >= MAX_PAGES {
        return
    }
    // Continue pagination logic
})

3. Handle Errors Gracefully

Implement proper error handling for failed requests:

c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error on page %s: %v\n", r.Request.URL, err)
    // Optionally retry the request
    time.Sleep(5 * time.Second)
    r.Request.Retry()
})

4. Use Caching for Development

During development, cache responses to avoid repeated requests:

c.CacheDir = "./colly_cache"

Debugging Pagination Issues

When pagination isn't working as expected, add debug output:

c.OnRequest(func(r *colly.Request) {
    fmt.Printf("Visiting: %s\n", r.URL)
})

c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    text := strings.TrimSpace(e.Text)
    fmt.Printf("Found link: %s -> %s\n", text, link)
})

Conclusion

Handling pagination with Colly requires understanding the specific pagination pattern used by your target website. Whether it's numbered pages, parameter-based URLs, or API responses, Colly provides the flexibility to handle various scenarios effectively. Remember to always implement rate limiting, error handling, and reasonable limits to create robust and respectful scrapers.

For more complex scenarios involving JavaScript-heavy pagination, you might want to consider how to handle AJAX requests using Puppeteer or explore how to run multiple pages in parallel with Puppeteer for more advanced scraping techniques.

The key is to start simple, understand the pagination pattern, and gradually add complexity as needed while maintaining respectful scraping practices.

Table of contents

How do I Handle Pagination When Scraping Multiple Pages with Colly?

Understanding Pagination Patterns

Basic Pagination Setup

Numbered Pagination Strategy

Parameter-Based Pagination

Advanced Pagination with Context

Handling Dynamic Pagination

Best Practices for Pagination Scraping

1. Implement Rate Limiting

2. Set Maximum Page Limits

3. Handle Errors Gracefully

4. Use Caching for Development

Debugging Pagination Issues

Conclusion

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

What are the best practices for error handling in Colly?

How do I set request timeouts in Colly?

Can Colly handle websites that require authentication tokens?

Get Started Now

Support