How do I handle pagination when scraping with GoQuery?

When scraping websites with pagination using GoQuery in Go (Golang), you need to identify how the website implements pagination. Typically, this can be in the form of:

  1. Query parameters: The URL changes by a query parameter (e.g., ?page=2).
  2. Path segments: The URL changes by path segments (e.g., /page/2/).
  3. Asynchronous requests: The content for the next page is loaded asynchronously through an API (XHR requests).

Here's a step-by-step guide to handling pagination with GoQuery:

Step 1: Install GoQuery

If you haven't already, install GoQuery using go get:

go get github.com/PuerkitoBio/goquery

Step 2: Analyze the Pagination Structure

Before writing code, manually inspect the website and understand how pagination is structured. Look for patterns in the URL or the HTML structure that you can use to iterate over pages.

Step 3: Scrape a Single Page

First, write code to scrape a single page. For instance:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func scrapePage(url string) {
    // Make HTTP GET request
    response, err := http.Get(url)
    if err != nil {
        log.Fatal(err)
    }
    defer response.Body.Close()

    // Create a goquery document from the HTTP response
    document, err := goquery.NewDocumentFromReader(response.Body)
    if err != nil {
        log.Fatal("Error loading HTTP response body. ", err)
    }

    // Find and iterate over the desired elements
    document.Find(".item").Each(func(index int, element *goquery.Selection) {
        title := element.Find(".title").Text()
        fmt.Printf("Title %d: %s\n", index, title)
    })
}

func main() {
    scrapePage("http://example.com/page/1/")
}

Step 4: Loop Through Pages

Once you can successfully scrape a single page, modify your code to loop through pages. You can either:

  • Use a for loop with a known number of pages.
  • Use a for loop that breaks when a certain condition is met (e.g., no next page link).

Here's an example using a for loop with a predefined number of pages:

func main() {
    baseURL := "http://example.com/page/"
    for i := 1; i <= 5; i++ { // Assuming there are 5 pages
        scrapePage(fmt.Sprintf("%s%d/", baseURL, i))
    }
}

If you don't know the total number of pages, you might need to look for a "next" link or button on each page:

func main() {
    baseURL := "http://example.com/page/"
    page := 1

    for {
        url := fmt.Sprintf("%s%d/", baseURL, page)
        response, err := http.Get(url)
        if err != nil {
            log.Fatal(err)
            break
        }

        // Check if the page is empty or has a status indicating the end
        if response.StatusCode == http.StatusNotFound {
            break
        }

        scrapePage(url)

        // Increment page number
        page++

        // Optionally, add a delay between requests to be polite to the server
        time.Sleep(2 * time.Second)
    }
}

Step 5: Extract the "Next" Link (If Needed)

If the pagination relies on a "next" link, you can adjust the loop to look for this link:

func main() {
    url := "http://example.com/page/1/"

    for {
        response, err := http.Get(url)
        if err != nil {
            log.Fatal(err)
            break
        }

        document, err := goquery.NewDocumentFromReader(response.Body)
        if err != nil {
            log.Fatal("Error loading HTTP response body. ", err)
        }

        scrapePage(url)

        nextSelector := document.Find("a.next")
        if nextSelector.Length() == 0 {
            break // No next page
        }

        nextPage, exists := nextSelector.Attr("href")
        if !exists {
            break // Next page link not found
        }

        url = nextPage
    }
}

Remember to respect the website's robots.txt file and terms of service when web scraping, and consider the load your script might put on the website's server. It's best practice to include delays between requests and possibly rotate user agents or IP addresses if scraping at a larger scale.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon