How do I extract images and files during web scraping with Go?

To extract images and files during web scraping with Go (Golang), you can use the standard library packages like net/http for making requests and io for reading/writing data, along with other utility packages such as os for file system operations.

Here's a step-by-step outline of the process:

  1. Perform a GET request to the webpage where the images or files are located.
  2. Parse the HTML content to find the URLs of the images or files.
  3. Perform a GET request for each URL to fetch the image or file data.
  4. Save the fetched data to a file.

A popular package for parsing HTML in Go is goquery, which is similar to jQuery for DOM manipulation.

Here's an example of how you could extract images from a web page and save them to your local machine:

package main

import (
    "fmt"
    "io"
    "net/http"
    "os"
    "path/filepath"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    // The URL of the page you want to scrape
    pageURL := "http://example.com"

    // Fetch the page
    resp, err := http.Get(pageURL)
    if err != nil {
        fmt.Println("Error fetching the page:", err)
        return
    }
    defer resp.Body.Close()

    // Parse the page with goquery
    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        fmt.Println("Error parsing the page:", err)
        return
    }

    // Find all image tags and get the src attribute
    doc.Find("img").Each(func(index int, item *goquery.Selection) {
        src, exists := item.Attr("src")
        if exists {
            fmt.Println("Found image:", src)

            // Get the absolute URL of the image
            imageURL := getAbsoluteURL(src, pageURL)

            // Download the image
            err := downloadFile(imageURL)
            if err != nil {
                fmt.Printf("Error downloading file from %s: %v\n", imageURL, err)
            }
        }
    })
}

// downloadFile takes a URL, fetches the content and saves it to a file
func downloadFile(url string) error {
    // Get the data
    resp, err := http.Get(url)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    // Create the file
    out, err := os.Create(filepath.Base(url))
    if err != nil {
        return err
    }
    defer out.Close()

    // Write the body to file
    _, err = io.Copy(out, resp.Body)
    return err
}

// getAbsoluteURL constructs an absolute URL from a potentially relative one
func getAbsoluteURL(href, base string) string {
    if strings.HasPrefix(href, "http") {
        return href
    }
    if href[0] != '/' {
        href = "/" + href
    }
    return base + href
}

In this example, we are doing the following: - We fetch the content of the webpage using http.Get. - We parse the HTML content of the page using goquery. - We find all img elements and extract the src attribute, which contains the URL of the image. - We fetch each image using http.Get and save it using io.Copy to write the response body to a file.

Please note that this is a basic example, and in a real-world scenario, you may need to handle relative URLs, redirects, and errors more gracefully. Also, ensure you respect the terms of service of the websites you scrape and that you're not violating any copyright laws by downloading images or files.

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon