When scraping websites with pagination using GoQuery in Go (Golang), you need to identify how the website implements pagination. Typically, this can be in the form of:
- Query parameters: The URL changes by a query parameter (e.g.,
?page=2
). - Path segments: The URL changes by path segments (e.g.,
/page/2/
). - Asynchronous requests: The content for the next page is loaded asynchronously through an API (XHR requests).
Here's a step-by-step guide to handling pagination with GoQuery:
Step 1: Install GoQuery
If you haven't already, install GoQuery using go get
:
go get github.com/PuerkitoBio/goquery
Step 2: Analyze the Pagination Structure
Before writing code, manually inspect the website and understand how pagination is structured. Look for patterns in the URL or the HTML structure that you can use to iterate over pages.
Step 3: Scrape a Single Page
First, write code to scrape a single page. For instance:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func scrapePage(url string) {
// Make HTTP GET request
response, err := http.Get(url)
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
// Create a goquery document from the HTTP response
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatal("Error loading HTTP response body. ", err)
}
// Find and iterate over the desired elements
document.Find(".item").Each(func(index int, element *goquery.Selection) {
title := element.Find(".title").Text()
fmt.Printf("Title %d: %s\n", index, title)
})
}
func main() {
scrapePage("http://example.com/page/1/")
}
Step 4: Loop Through Pages
Once you can successfully scrape a single page, modify your code to loop through pages. You can either:
- Use a
for
loop with a known number of pages. - Use a
for
loop that breaks when a certain condition is met (e.g., no next page link).
Here's an example using a for
loop with a predefined number of pages:
func main() {
baseURL := "http://example.com/page/"
for i := 1; i <= 5; i++ { // Assuming there are 5 pages
scrapePage(fmt.Sprintf("%s%d/", baseURL, i))
}
}
If you don't know the total number of pages, you might need to look for a "next" link or button on each page:
func main() {
baseURL := "http://example.com/page/"
page := 1
for {
url := fmt.Sprintf("%s%d/", baseURL, page)
response, err := http.Get(url)
if err != nil {
log.Fatal(err)
break
}
// Check if the page is empty or has a status indicating the end
if response.StatusCode == http.StatusNotFound {
break
}
scrapePage(url)
// Increment page number
page++
// Optionally, add a delay between requests to be polite to the server
time.Sleep(2 * time.Second)
}
}
Step 5: Extract the "Next" Link (If Needed)
If the pagination relies on a "next" link, you can adjust the loop to look for this link:
func main() {
url := "http://example.com/page/1/"
for {
response, err := http.Get(url)
if err != nil {
log.Fatal(err)
break
}
document, err := goquery.NewDocumentFromReader(response.Body)
if err != nil {
log.Fatal("Error loading HTTP response body. ", err)
}
scrapePage(url)
nextSelector := document.Find("a.next")
if nextSelector.Length() == 0 {
break // No next page
}
nextPage, exists := nextSelector.Attr("href")
if !exists {
break // Next page link not found
}
url = nextPage
}
}
Remember to respect the website's robots.txt
file and terms of service when web scraping, and consider the load your script might put on the website's server. It's best practice to include delays between requests and possibly rotate user agents or IP addresses if scraping at a larger scale.