Can Colly handle file downloads and binary content?
Yes, Colly can efficiently handle file downloads and binary content including images, PDFs, documents, and other binary files. The framework provides built-in support for downloading and processing binary data through its response handling mechanisms and callback functions.
Understanding Binary Content in Colly
Colly treats all HTTP responses as byte arrays, making it naturally capable of handling both text and binary content. When you access the response body through r.Body
, you're working with raw bytes that can represent any type of content.
Basic File Download with Colly
Here's a simple example of downloading a file using Colly:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/url"
"path"
"path/filepath"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Handle file downloads
c.OnResponse(func(r *colly.Response) {
// Get the filename from URL or Content-Disposition header
filename := getFilename(r)
// Save the file
err := r.Save(filename)
if err != nil {
log.Printf("Error saving file %s: %v", filename, err)
return
}
fmt.Printf("Downloaded: %s (%d bytes)\n", filename, len(r.Body))
})
// Download a PDF file
c.Visit("https://example.com/document.pdf")
// Download an image
c.Visit("https://example.com/image.jpg")
}
func getFilename(r *colly.Response) string {
// Try to get filename from Content-Disposition header
disposition := r.Headers.Get("Content-Disposition")
if disposition != "" {
// Parse Content-Disposition header for filename
// This is simplified - you might want to use a proper parser
return "downloaded_file"
}
// Fallback to URL path
u, err := url.Parse(r.Request.URL.String())
if err != nil {
return "unknown_file"
}
return path.Base(u.Path)
}
Advanced File Download with Error Handling
For production use, you'll want more robust error handling and file management:
package main
import (
"fmt"
"io"
"log"
"mime"
"net/url"
"os"
"path"
"path/filepath"
"strings"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Set user agent and other headers
c.UserAgent = "FileDownloader/1.0"
// Handle different content types
c.OnResponse(func(r *colly.Response) {
contentType := r.Headers.Get("Content-Type")
// Check if it's a binary file type
if isBinaryContent(contentType) {
err := downloadBinaryFile(r)
if err != nil {
log.Printf("Error downloading binary file: %v", err)
}
} else {
log.Printf("Skipping non-binary content: %s", contentType)
}
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error downloading %s: %v", r.Request.URL.String(), err)
})
// Download various file types
urls := []string{
"https://example.com/document.pdf",
"https://example.com/image.png",
"https://example.com/archive.zip",
"https://example.com/video.mp4",
}
for _, url := range urls {
c.Visit(url)
}
}
func isBinaryContent(contentType string) bool {
binaryTypes := []string{
"image/", "video/", "audio/", "application/pdf",
"application/zip", "application/octet-stream",
"application/msword", "application/vnd.ms-excel",
}
for _, binaryType := range binaryTypes {
if strings.HasPrefix(contentType, binaryType) {
return true
}
}
return false
}
func downloadBinaryFile(r *colly.Response) error {
// Create downloads directory if it doesn't exist
downloadDir := "downloads"
if err := os.MkdirAll(downloadDir, 0755); err != nil {
return fmt.Errorf("failed to create download directory: %v", err)
}
// Get filename
filename := getFilenameFromResponse(r)
filePath := filepath.Join(downloadDir, filename)
// Create the file
file, err := os.Create(filePath)
if err != nil {
return fmt.Errorf("failed to create file %s: %v", filePath, err)
}
defer file.Close()
// Write binary data to file
_, err = file.Write(r.Body)
if err != nil {
return fmt.Errorf("failed to write file %s: %v", filePath, err)
}
fmt.Printf("Successfully downloaded: %s (%d bytes)\n", filePath, len(r.Body))
return nil
}
func getFilenameFromResponse(r *colly.Response) string {
// Try Content-Disposition header first
disposition := r.Headers.Get("Content-Disposition")
if disposition != "" {
_, params, err := mime.ParseMediaType(disposition)
if err == nil && params["filename"] != "" {
return params["filename"]
}
}
// Fallback to URL path
u, err := url.Parse(r.Request.URL.String())
if err != nil {
return "unknown_file"
}
filename := path.Base(u.Path)
if filename == "." || filename == "/" {
// Generate filename based on content type
contentType := r.Headers.Get("Content-Type")
ext, _ := mime.ExtensionsByType(contentType)
if len(ext) > 0 {
filename = "download" + ext[0]
} else {
filename = "download"
}
}
return filename
}
Downloading Images and Processing Metadata
Colly can be combined with image processing libraries to handle image downloads and extract metadata:
package main
import (
"fmt"
"image"
"image/jpeg"
"image/png"
"log"
"os"
"path/filepath"
"strings"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Find and download images
c.OnHTML("img[src]", func(e *colly.HTMLElement) {
imgSrc := e.Attr("src")
imgURL := e.Request.AbsoluteURL(imgSrc)
// Visit the image URL
c.Visit(imgURL)
})
// Handle image downloads
c.OnResponse(func(r *colly.Response) {
contentType := r.Headers.Get("Content-Type")
if strings.HasPrefix(contentType, "image/") {
err := processImageDownload(r)
if err != nil {
log.Printf("Error processing image: %v", err)
}
}
})
// Start scraping from a webpage with images
c.Visit("https://example.com/gallery")
}
func processImageDownload(r *colly.Response) error {
// Save the image file
filename := getImageFilename(r)
filePath := filepath.Join("images", filename)
// Create images directory
os.MkdirAll("images", 0755)
// Save file
err := r.Save(filePath)
if err != nil {
return err
}
// Get image dimensions
config, format, err := image.DecodeConfig(strings.NewReader(string(r.Body)))
if err == nil {
fmt.Printf("Downloaded %s: %s (%dx%d)\n",
filename, format, config.Width, config.Height)
}
return nil
}
func getImageFilename(r *colly.Response) string {
// Implementation similar to previous examples
// but with image-specific logic
return "image.jpg" // Simplified for example
}
Handling Large File Downloads
For large files, you might want to implement streaming downloads to avoid memory issues:
package main
import (
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// For large files, use a custom transport to stream directly
c.OnResponse(func(r *colly.Response) {
contentLength := r.Headers.Get("Content-Length")
if contentLength != "" {
fmt.Printf("File size: %s bytes\n", contentLength)
}
// For very large files, consider implementing streaming
if len(r.Body) > 10*1024*1024 { // 10MB threshold
log.Printf("Large file detected, consider streaming approach")
}
filename := getFilename(r)
r.Save(filepath.Join("downloads", filename))
})
c.Visit("https://example.com/largefile.zip")
}
// Alternative streaming approach for very large files
func streamDownload(url, filename string) error {
resp, err := http.Get(url)
if err != nil {
return err
}
defer resp.Body.Close()
file, err := os.Create(filename)
if err != nil {
return err
}
defer file.Close()
_, err = io.Copy(file, resp.Body)
return err
}
Binary Content Type Detection
Colly can automatically detect and handle different binary content types:
package main
import (
"fmt"
"net/http"
"strings"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
c.OnResponse(func(r *colly.Response) {
contentType := r.Headers.Get("Content-Type")
// Detect content type from body if header is missing
if contentType == "" {
contentType = http.DetectContentType(r.Body)
}
switch {
case strings.HasPrefix(contentType, "image/"):
handleImageDownload(r)
case strings.HasPrefix(contentType, "application/pdf"):
handlePDFDownload(r)
case strings.HasPrefix(contentType, "application/zip"):
handleArchiveDownload(r)
case strings.HasPrefix(contentType, "video/"):
handleVideoDownload(r)
default:
fmt.Printf("Unknown binary type: %s\n", contentType)
}
})
}
func handleImageDownload(r *colly.Response) {
fmt.Printf("Downloading image: %s\n", r.Request.URL.String())
// Implementation here
}
func handlePDFDownload(r *colly.Response) {
fmt.Printf("Downloading PDF: %s\n", r.Request.URL.String())
// Implementation here
}
func handleArchiveDownload(r *colly.Response) {
fmt.Printf("Downloading archive: %s\n", r.Request.URL.String())
// Implementation here
}
func handleVideoDownload(r *colly.Response) {
fmt.Printf("Downloading video: %s\n", r.Request.URL.String())
// Implementation here
}
Best Practices for Binary Downloads
1. Memory Management
- Monitor memory usage when downloading large files
- Consider streaming for files larger than available RAM
- Use appropriate buffer sizes for I/O operations
2. Error Handling
- Always check for HTTP status codes
- Implement retry logic for failed downloads
- Validate file integrity after download
3. Rate Limiting
- Respect server resources with appropriate delays
- Implement concurrent download limits
- Monitor bandwidth usage
4. File Organization
// Organize downloads by content type
func getDownloadPath(contentType, filename string) string {
var subdir string
switch {
case strings.HasPrefix(contentType, "image/"):
subdir = "images"
case strings.HasPrefix(contentType, "video/"):
subdir = "videos"
case strings.HasPrefix(contentType, "application/pdf"):
subdir = "documents"
default:
subdir = "misc"
}
return filepath.Join("downloads", subdir, filename)
}
Comparison with Other Tools
While Colly excels at file downloads, you might also consider how to handle file downloads in Puppeteer for JavaScript-heavy sites that require browser automation. For simpler HTTP-based downloads, Colly often provides better performance and resource efficiency.
Conclusion
Colly provides robust support for downloading and handling binary content through its flexible response handling system. Whether you're downloading images, documents, or other binary files, Colly's built-in methods like r.Save()
and direct access to response bodies make it an excellent choice for file download tasks. The framework's efficiency and Go's excellent standard library support for binary data processing make Colly particularly well-suited for bulk download operations and content archiving workflows.
Remember to always respect robots.txt files, implement appropriate rate limiting, and handle errors gracefully when downloading files at scale.