How do I scrape images and download them using GoQuery?

GoQuery is a library for Go that allows you to scrape and manipulate HTML documents in a manner similar to jQuery. While GoQuery itself does not directly handle downloading of binary data such as images, it can be used to parse HTML and extract image URLs, which you can then download using Go's HTTP client.

Here's a step-by-step guide on how to scrape images and download them using GoQuery:

  1. Install GoQuery: Make sure you have Go installed on your machine. Then, install GoQuery by running:
   go get github.com/PuerkitoBio/goquery
  1. Write a Go Program to Scrape Image URLs: Use GoQuery to load the webpage and extract the image URLs.
   package main

   import (
       "fmt"
       "log"
       "net/http"
       "os"

       "github.com/PuerkitoBio/goquery"
   )

   func extractImageUrls(url string) ([]string, error) {
       // Slice to hold the image URLs
       var imageUrls []string

       // Make HTTP GET request
       res, err := http.Get(url)
       if err != nil {
           return nil, err
       }
       defer res.Body.Close()

       if res.StatusCode != 200 {
           return nil, fmt.Errorf("status code error: %d %s", res.StatusCode, res.Status)
       }

       // Load the HTML document
       doc, err := goquery.NewDocumentFromReader(res.Body)
       if err != nil {
           return nil, err
       }

       // Find and iterate through all image elements
       doc.Find("img").Each(func(i int, s *goquery.Selection) {
           // For each item, get the src attribute
           src, exists := s.Attr("src")
           if exists {
               imageUrls = append(imageUrls, src)
           }
       })

       return imageUrls, nil
   }

   func main() {
       // The URL of the page you want to scrape
       url := "http://example.com"

       // Extract all image URLs
       imageUrls, err := extractImageUrls(url)
       if err != nil {
           log.Fatal(err)
       }

       // Print out all image URLs
       for _, imgUrl := range imageUrls {
           fmt.Println(imgUrl)
       }
   }
  1. Download the Images: After extracting the URLs, use http.Get to download each image and save it to a file.
   func downloadImage(url, filePath string) error {
       // Get the data
       resp, err := http.Get(url)
       if err != nil {
           return err
       }
       defer resp.Body.Close()

       // Check server response
       if resp.StatusCode != http.StatusOK {
           return fmt.Errorf("bad status: %s", resp.Status)
       }

       // Create the file
       out, err := os.Create(filePath)
       if err != nil {
           return err
       }
       defer out.Close()

       // Write the body to file
       _, err = io.Copy(out, resp.Body)
       return err
   }

   func main() {
       // ... previous code ...

       // Download each image
       for i, imgUrl := range imageUrls {
           // Determine the local file path (you may want to create a dedicated folder and check for duplicates)
           filePath := fmt.Sprintf("image_%d.jpg", i)
           err := downloadImage(imgUrl, filePath)
           if err != nil {
               log.Printf("Failed to download %s: %v", imgUrl, err)
           } else {
               log.Printf("Downloaded %s to %s\n", imgUrl, filePath)
           }
       }
   }
  1. Run Your Go Program: Save the code to a .go file, for example scrape_images.go, and run it using:
   go run scrape_images.go

Please note that when scraping websites, you should always check the site's robots.txt file and Terms of Service to understand the scraping rules, and ensure that you are not violating any terms or causing excessive load on the website. Additionally, when saving files, ensure you have the right to download and use the images as per the website's copyright and licensing policies.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon