What are the best practices for structuring Colly project code?
Structuring Colly projects properly is crucial for building maintainable, scalable, and robust web scraping applications. This guide covers the essential patterns, architectural decisions, and code organization strategies that will help you create professional-grade Colly scrapers.
Project Structure Overview
A well-structured Colly project should follow a clear separation of concerns with organized directories and modules. Here's a recommended project structure:
project-root/
├── cmd/
│ └── scraper/
│ └── main.go
├── internal/
│ ├── config/
│ │ └── config.go
│ ├── models/
│ │ └── data.go
│ ├── scrapers/
│ │ ├── base.go
│ │ └── product_scraper.go
│ ├── storage/
│ │ ├── interface.go
│ │ ├── csv.go
│ │ └── database.go
│ └── utils/
│ └── helpers.go
├── pkg/
│ └── scraping/
│ └── client.go
├── configs/
│ └── config.yaml
├── scripts/
│ └── build.sh
├── go.mod
├── go.sum
└── README.md
Configuration Management
Create a centralized configuration system to manage all scraper settings, timeouts, and external dependencies:
// internal/config/config.go
package config
import (
"time"
"gopkg.in/yaml.v2"
"os"
)
type Config struct {
Scraper ScraperConfig `yaml:"scraper"`
Storage StorageConfig `yaml:"storage"`
Logging LoggingConfig `yaml:"logging"`
}
type ScraperConfig struct {
UserAgent string `yaml:"user_agent"`
Delay time.Duration `yaml:"delay"`
Timeout time.Duration `yaml:"timeout"`
MaxRetries int `yaml:"max_retries"`
ParallelWorkers int `yaml:"parallel_workers"`
RespectRobots bool `yaml:"respect_robots"`
}
type StorageConfig struct {
Type string `yaml:"type"`
Path string `yaml:"path"`
Database struct {
Host string `yaml:"host"`
Port int `yaml:"port"`
Database string `yaml:"database"`
User string `yaml:"user"`
Password string `yaml:"password"`
} `yaml:"database"`
}
type LoggingConfig struct {
Level string `yaml:"level"`
Format string `yaml:"format"`
Output string `yaml:"output"`
}
func Load(path string) (*Config, error) {
file, err := os.Open(path)
if err != nil {
return nil, err
}
defer file.Close()
var config Config
decoder := yaml.NewDecoder(file)
if err := decoder.Decode(&config); err != nil {
return nil, err
}
return &config, nil
}
Data Models and Interfaces
Define clear data structures and interfaces to ensure type safety and maintainability:
// internal/models/data.go
package models
import (
"time"
"errors"
)
type Product struct {
ID string `json:"id" db:"id"`
Name string `json:"name" db:"name"`
Price float64 `json:"price" db:"price"`
Description string `json:"description" db:"description"`
ImageURL string `json:"image_url" db:"image_url"`
Category string `json:"category" db:"category"`
InStock bool `json:"in_stock" db:"in_stock"`
ScrapedAt time.Time `json:"scraped_at" db:"scraped_at"`
SourceURL string `json:"source_url" db:"source_url"`
}
type ScrapingResult struct {
Products []Product `json:"products"`
Metadata struct {
TotalPages int `json:"total_pages"`
CurrentPage int `json:"current_page"`
ScrapedAt time.Time `json:"scraped_at"`
Duration string `json:"duration"`
ErrorsCount int `json:"errors_count"`
} `json:"metadata"`
}
// Validation methods
func (p *Product) Validate() error {
if p.Name == "" {
return errors.New("product name is required")
}
if p.Price < 0 {
return errors.New("product price cannot be negative")
}
return nil
}
Storage Interface Pattern
Implement a storage interface to support multiple output formats and databases:
// internal/storage/interface.go
package storage
import "internal/models"
type Storage interface {
Save(data interface{}) error
SaveBatch(data []interface{}) error
Close() error
}
type ProductStorage interface {
Storage
SaveProduct(product *models.Product) error
SaveProducts(products []*models.Product) error
GetProductByID(id string) (*models.Product, error)
}
// internal/storage/csv.go
package storage
import (
"encoding/csv"
"fmt"
"os"
"strconv"
"internal/models"
)
type CSVStorage struct {
file *os.File
writer *csv.Writer
}
func NewCSVStorage(filename string) (*CSVStorage, error) {
file, err := os.Create(filename)
if err != nil {
return nil, err
}
writer := csv.NewWriter(file)
// Write CSV header
header := []string{"ID", "Name", "Price", "Description", "ImageURL", "Category", "InStock", "ScrapedAt", "SourceURL"}
if err := writer.Write(header); err != nil {
return nil, err
}
return &CSVStorage{
file: file,
writer: writer,
}, nil
}
func (c *CSVStorage) SaveProduct(product *models.Product) error {
if err := product.Validate(); err != nil {
return fmt.Errorf("product validation failed: %w", err)
}
record := []string{
product.ID,
product.Name,
strconv.FormatFloat(product.Price, 'f', 2, 64),
product.Description,
product.ImageURL,
product.Category,
strconv.FormatBool(product.InStock),
product.ScrapedAt.Format("2006-01-02 15:04:05"),
product.SourceURL,
}
return c.writer.Write(record)
}
func (c *CSVStorage) Close() error {
c.writer.Flush()
return c.file.Close()
}
Base Scraper Implementation
Create a base scraper struct that encapsulates common Colly functionality:
// internal/scrapers/base.go
package scrapers
import (
"fmt"
"log"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
"github.com/gocolly/colly/v2/extensions"
"internal/config"
"internal/storage"
)
type BaseScraper struct {
collector *colly.Collector
config *config.Config
storage storage.Storage
logger *log.Logger
stats ScrapingStats
}
type ScrapingStats struct {
StartTime time.Time
Duration time.Duration
RequestCount int
SuccessCount int
ErrorCount int
ItemsScraped int
}
func NewBaseScraper(cfg *config.Config, storage storage.Storage) *BaseScraper {
c := colly.NewCollector(
colly.UserAgent(cfg.Scraper.UserAgent),
)
// Configure timeouts and delays
c.SetRequestTimeout(cfg.Scraper.Timeout)
if cfg.Scraper.Delay > 0 {
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: cfg.Scraper.ParallelWorkers,
Delay: cfg.Scraper.Delay,
})
}
// Add extensions
extensions.RandomUserAgent(c)
extensions.Referer(c)
scraper := &BaseScraper{
collector: c,
config: cfg,
storage: storage,
stats: ScrapingStats{StartTime: time.Now()},
}
scraper.setupCallbacks()
return scraper
}
func (bs *BaseScraper) setupCallbacks() {
bs.collector.OnRequest(func(r *colly.Request) {
bs.stats.RequestCount++
if bs.logger != nil {
bs.logger.Printf("Visiting: %s", r.URL.String())
}
})
bs.collector.OnResponse(func(r *colly.Response) {
bs.stats.SuccessCount++
})
bs.collector.OnError(func(r *colly.Response, err error) {
bs.stats.ErrorCount++
if bs.logger != nil {
bs.logger.Printf("Error on %s: %v", r.Request.URL.String(), err)
}
})
}
func (bs *BaseScraper) GetStats() ScrapingStats {
bs.stats.Duration = time.Since(bs.stats.StartTime)
return bs.stats
}
Specific Scraper Implementation
Build specific scrapers that extend the base functionality:
// internal/scrapers/product_scraper.go
package scrapers
import (
"strconv"
"strings"
"time"
"github.com/gocolly/colly/v2"
"internal/config"
"internal/models"
"internal/storage"
)
type ProductScraper struct {
*BaseScraper
productStorage storage.ProductStorage
}
func NewProductScraper(cfg *config.Config, storage storage.ProductStorage) *ProductScraper {
base := NewBaseScraper(cfg, storage)
scraper := &ProductScraper{
BaseScraper: base,
productStorage: storage,
}
scraper.setupProductCallbacks()
return scraper
}
func (ps *ProductScraper) setupProductCallbacks() {
// Product listing page
ps.collector.OnHTML(".product-item", func(e *colly.HTMLElement) {
productURL := e.ChildAttr("a", "href")
if productURL != "" {
ps.collector.Visit(e.Request.AbsoluteURL(productURL))
}
})
// Product detail page
ps.collector.OnHTML(".product-detail", func(e *colly.HTMLElement) {
product := ps.extractProduct(e)
if err := ps.productStorage.SaveProduct(product); err != nil {
if ps.logger != nil {
ps.logger.Printf("Error saving product: %v", err)
}
} else {
ps.stats.ItemsScraped++
}
})
// Pagination
ps.collector.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
nextPage := e.Attr("href")
if nextPage != "" {
ps.collector.Visit(e.Request.AbsoluteURL(nextPage))
}
})
}
func (ps *ProductScraper) extractProduct(e *colly.HTMLElement) *models.Product {
priceText := strings.TrimSpace(e.ChildText(".price"))
priceText = strings.ReplaceAll(priceText, "$", "")
priceText = strings.ReplaceAll(priceText, ",", "")
price, _ := strconv.ParseFloat(priceText, 64)
return &models.Product{
ID: e.ChildAttr("[data-product-id]", "data-product-id"),
Name: strings.TrimSpace(e.ChildText(".product-name")),
Price: price,
Description: strings.TrimSpace(e.ChildText(".product-description")),
ImageURL: e.ChildAttr(".product-image img", "src"),
Category: strings.TrimSpace(e.ChildText(".product-category")),
InStock: !strings.Contains(e.ChildText(".stock-status"), "Out of Stock"),
ScrapedAt: time.Now(),
SourceURL: e.Request.URL.String(),
}
}
func (ps *ProductScraper) ScrapeProducts(startURL string) error {
return ps.collector.Visit(startURL)
}
Error Handling and Retry Logic
Implement robust error handling with retry mechanisms:
// internal/utils/helpers.go
package utils
import (
"time"
"math"
"math/rand"
)
type RetryConfig struct {
MaxRetries int
BaseDelay time.Duration
MaxDelay time.Duration
}
func RetryWithBackoff(fn func() error, config RetryConfig) error {
var lastErr error
for attempt := 0; attempt <= config.MaxRetries; attempt++ {
if err := fn(); err != nil {
lastErr = err
if attempt == config.MaxRetries {
break
}
// Exponential backoff with jitter
delay := time.Duration(math.Min(
float64(config.BaseDelay)*math.Pow(2, float64(attempt)),
float64(config.MaxDelay),
))
// Add jitter (±25%)
jitter := time.Duration(rand.Float64() * 0.5 * float64(delay))
if rand.Float64() < 0.5 {
delay -= jitter
} else {
delay += jitter
}
time.Sleep(delay)
continue
}
return nil
}
return lastErr
}
Main Application Setup
Structure your main application to tie everything together:
// cmd/scraper/main.go
package main
import (
"flag"
"log"
"os"
"internal/config"
"internal/scrapers"
"internal/storage"
)
func main() {
configPath := flag.String("config", "configs/config.yaml", "Path to configuration file")
targetURL := flag.String("url", "", "URL to scrape")
flag.Parse()
if *targetURL == "" {
log.Fatal("URL is required")
}
// Load configuration
cfg, err := config.Load(*configPath)
if err != nil {
log.Fatalf("Failed to load config: %v", err)
}
// Initialize storage
var productStorage storage.ProductStorage
switch cfg.Storage.Type {
case "csv":
productStorage, err = storage.NewCSVStorage(cfg.Storage.Path)
if err != nil {
log.Fatalf("Failed to initialize CSV storage: %v", err)
}
default:
log.Fatalf("Unsupported storage type: %s", cfg.Storage.Type)
}
defer productStorage.Close()
// Initialize scraper
scraper := scrapers.NewProductScraper(cfg, productStorage)
// Start scraping
log.Printf("Starting scraping of %s", *targetURL)
if err := scraper.ScrapeProducts(*targetURL); err != nil {
log.Fatalf("Scraping failed: %v", err)
}
// Print statistics
stats := scraper.GetStats()
log.Printf("Scraping completed:")
log.Printf(" Duration: %v", stats.Duration)
log.Printf(" Requests: %d", stats.RequestCount)
log.Printf(" Success: %d", stats.SuccessCount)
log.Printf(" Errors: %d", stats.ErrorCount)
log.Printf(" Items scraped: %d", stats.ItemsScraped)
}
Testing Strategy
Implement comprehensive testing for your scrapers:
// internal/scrapers/product_scraper_test.go
package scrapers
import (
"testing"
"net/http"
"net/http/httptest"
"internal/config"
"internal/models"
)
// MockProductStorage implements ProductStorage for testing
type MockProductStorage struct {
SavedProducts []*models.Product
}
func (m *MockProductStorage) SaveProduct(product *models.Product) error {
m.SavedProducts = append(m.SavedProducts, product)
return nil
}
func (m *MockProductStorage) SaveProducts(products []*models.Product) error {
m.SavedProducts = append(m.SavedProducts, products...)
return nil
}
func (m *MockProductStorage) GetProductByID(id string) (*models.Product, error) {
for _, product := range m.SavedProducts {
if product.ID == id {
return product, nil
}
}
return nil, nil
}
func (m *MockProductStorage) Save(data interface{}) error { return nil }
func (m *MockProductStorage) SaveBatch(data []interface{}) error { return nil }
func (m *MockProductStorage) Close() error { return nil }
func TestProductScraper(t *testing.T) {
// Create test server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
html := `
<div class="product-detail">
<div class="product-name">Test Product</div>
<div class="price">$29.99</div>
<div class="product-description">A great test product</div>
</div>
`
w.Write([]byte(html))
}))
defer server.Close()
// Setup test configuration
cfg := &config.Config{
Scraper: config.ScraperConfig{
UserAgent: "test-agent",
Delay: 0,
},
}
// Mock storage
mockStorage := &MockProductStorage{}
// Create scraper
scraper := NewProductScraper(cfg, mockStorage)
// Test scraping
err := scraper.ScrapeProducts(server.URL)
if err != nil {
t.Fatalf("Expected no error, got: %v", err)
}
// Verify results
if len(mockStorage.SavedProducts) != 1 {
t.Fatalf("Expected 1 product, got %d", len(mockStorage.SavedProducts))
}
product := mockStorage.SavedProducts[0]
if product.Name != "Test Product" {
t.Errorf("Expected product name 'Test Product', got '%s'", product.Name)
}
}
Configuration File Example
Create external configuration files for different environments:
# configs/config.yaml
scraper:
user_agent: "MyBot 1.0"
delay: 1s
timeout: 30s
max_retries: 3
parallel_workers: 2
respect_robots: true
storage:
type: "csv"
path: "output/products.csv"
database:
host: "localhost"
port: 5432
database: "scraping_db"
user: "scraper"
password: "secret"
logging:
level: "info"
format: "json"
output: "logs/scraper.log"
Makefile for Build Automation
Automate common tasks with a Makefile:
# Makefile
.PHONY: build test clean run
BINARY_NAME=scraper
MAIN_PATH=./cmd/scraper
build:
go build -o bin/$(BINARY_NAME) $(MAIN_PATH)
test:
go test -v ./...
clean:
go clean
rm -f bin/$(BINARY_NAME)
run: build
./bin/$(BINARY_NAME) -config configs/config.yaml -url $(URL)
deps:
go mod download
go mod tidy
lint:
golangci-lint run
coverage:
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out
Docker Support
Add Docker support for consistent deployment:
# Dockerfile
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o scraper ./cmd/scraper
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/scraper .
COPY --from=builder /app/configs/ ./configs/
CMD ["./scraper"]
Best Practices Summary
- Separation of Concerns: Keep scrapers, storage, configuration, and business logic separate
- Interface-Driven Design: Use interfaces for storage and external dependencies to enable testing and flexibility
- Configuration Management: Centralize all configuration in external files to support different environments
- Error Handling: Implement comprehensive error handling with retry logic and graceful degradation
- Testing: Write unit tests for individual components and integration tests for complete workflows
- Logging: Implement structured logging for debugging and monitoring production systems
- Performance: Use connection pooling, rate limiting, and parallel processing appropriately
- Maintainability: Follow Go conventions, keep functions focused and small, and document your code
- Observability: Include metrics collection and monitoring for production deployments
- Security: Never hardcode credentials, use environment variables or secure configuration management
This structured approach ensures your Colly projects remain maintainable, testable, and scalable as they grow in complexity. For scenarios requiring JavaScript execution and complex interactions, consider exploring how to handle browser sessions and managing complex navigation patterns with headless browsers like Puppeteer.