Table of contents

What are the best practices for structuring Colly project code?

Structuring Colly projects properly is crucial for building maintainable, scalable, and robust web scraping applications. This guide covers the essential patterns, architectural decisions, and code organization strategies that will help you create professional-grade Colly scrapers.

Project Structure Overview

A well-structured Colly project should follow a clear separation of concerns with organized directories and modules. Here's a recommended project structure:

project-root/
├── cmd/
│   └── scraper/
│       └── main.go
├── internal/
│   ├── config/
│   │   └── config.go
│   ├── models/
│   │   └── data.go
│   ├── scrapers/
│   │   ├── base.go
│   │   └── product_scraper.go
│   ├── storage/
│   │   ├── interface.go
│   │   ├── csv.go
│   │   └── database.go
│   └── utils/
│       └── helpers.go
├── pkg/
│   └── scraping/
│       └── client.go
├── configs/
│   └── config.yaml
├── scripts/
│   └── build.sh
├── go.mod
├── go.sum
└── README.md

Configuration Management

Create a centralized configuration system to manage all scraper settings, timeouts, and external dependencies:

// internal/config/config.go
package config

import (
    "time"
    "gopkg.in/yaml.v2"
    "os"
)

type Config struct {
    Scraper ScraperConfig `yaml:"scraper"`
    Storage StorageConfig `yaml:"storage"`
    Logging LoggingConfig `yaml:"logging"`
}

type ScraperConfig struct {
    UserAgent       string        `yaml:"user_agent"`
    Delay           time.Duration `yaml:"delay"`
    Timeout         time.Duration `yaml:"timeout"`
    MaxRetries      int           `yaml:"max_retries"`
    ParallelWorkers int           `yaml:"parallel_workers"`
    RespectRobots   bool          `yaml:"respect_robots"`
}

type StorageConfig struct {
    Type     string `yaml:"type"`
    Path     string `yaml:"path"`
    Database struct {
        Host     string `yaml:"host"`
        Port     int    `yaml:"port"`
        Database string `yaml:"database"`
        User     string `yaml:"user"`
        Password string `yaml:"password"`
    } `yaml:"database"`
}

type LoggingConfig struct {
    Level  string `yaml:"level"`
    Format string `yaml:"format"`
    Output string `yaml:"output"`
}

func Load(path string) (*Config, error) {
    file, err := os.Open(path)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    var config Config
    decoder := yaml.NewDecoder(file)
    if err := decoder.Decode(&config); err != nil {
        return nil, err
    }

    return &config, nil
}

Data Models and Interfaces

Define clear data structures and interfaces to ensure type safety and maintainability:

// internal/models/data.go
package models

import (
    "time"
    "errors"
)

type Product struct {
    ID          string    `json:"id" db:"id"`
    Name        string    `json:"name" db:"name"`
    Price       float64   `json:"price" db:"price"`
    Description string    `json:"description" db:"description"`
    ImageURL    string    `json:"image_url" db:"image_url"`
    Category    string    `json:"category" db:"category"`
    InStock     bool      `json:"in_stock" db:"in_stock"`
    ScrapedAt   time.Time `json:"scraped_at" db:"scraped_at"`
    SourceURL   string    `json:"source_url" db:"source_url"`
}

type ScrapingResult struct {
    Products []Product `json:"products"`
    Metadata struct {
        TotalPages    int       `json:"total_pages"`
        CurrentPage   int       `json:"current_page"`
        ScrapedAt     time.Time `json:"scraped_at"`
        Duration      string    `json:"duration"`
        ErrorsCount   int       `json:"errors_count"`
    } `json:"metadata"`
}

// Validation methods
func (p *Product) Validate() error {
    if p.Name == "" {
        return errors.New("product name is required")
    }
    if p.Price < 0 {
        return errors.New("product price cannot be negative")
    }
    return nil
}

Storage Interface Pattern

Implement a storage interface to support multiple output formats and databases:

// internal/storage/interface.go
package storage

import "internal/models"

type Storage interface {
    Save(data interface{}) error
    SaveBatch(data []interface{}) error
    Close() error
}

type ProductStorage interface {
    Storage
    SaveProduct(product *models.Product) error
    SaveProducts(products []*models.Product) error
    GetProductByID(id string) (*models.Product, error)
}
// internal/storage/csv.go
package storage

import (
    "encoding/csv"
    "fmt"
    "os"
    "strconv"
    "internal/models"
)

type CSVStorage struct {
    file   *os.File
    writer *csv.Writer
}

func NewCSVStorage(filename string) (*CSVStorage, error) {
    file, err := os.Create(filename)
    if err != nil {
        return nil, err
    }

    writer := csv.NewWriter(file)

    // Write CSV header
    header := []string{"ID", "Name", "Price", "Description", "ImageURL", "Category", "InStock", "ScrapedAt", "SourceURL"}
    if err := writer.Write(header); err != nil {
        return nil, err
    }

    return &CSVStorage{
        file:   file,
        writer: writer,
    }, nil
}

func (c *CSVStorage) SaveProduct(product *models.Product) error {
    if err := product.Validate(); err != nil {
        return fmt.Errorf("product validation failed: %w", err)
    }

    record := []string{
        product.ID,
        product.Name,
        strconv.FormatFloat(product.Price, 'f', 2, 64),
        product.Description,
        product.ImageURL,
        product.Category,
        strconv.FormatBool(product.InStock),
        product.ScrapedAt.Format("2006-01-02 15:04:05"),
        product.SourceURL,
    }

    return c.writer.Write(record)
}

func (c *CSVStorage) Close() error {
    c.writer.Flush()
    return c.file.Close()
}

Base Scraper Implementation

Create a base scraper struct that encapsulates common Colly functionality:

// internal/scrapers/base.go
package scrapers

import (
    "fmt"
    "log"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
    "github.com/gocolly/colly/v2/extensions"

    "internal/config"
    "internal/storage"
)

type BaseScraper struct {
    collector *colly.Collector
    config    *config.Config
    storage   storage.Storage
    logger    *log.Logger
    stats     ScrapingStats
}

type ScrapingStats struct {
    StartTime     time.Time
    Duration      time.Duration
    RequestCount  int
    SuccessCount  int
    ErrorCount    int
    ItemsScraped  int
}

func NewBaseScraper(cfg *config.Config, storage storage.Storage) *BaseScraper {
    c := colly.NewCollector(
        colly.UserAgent(cfg.Scraper.UserAgent),
    )

    // Configure timeouts and delays
    c.SetRequestTimeout(cfg.Scraper.Timeout)
    if cfg.Scraper.Delay > 0 {
        c.Limit(&colly.LimitRule{
            DomainGlob:  "*",
            Parallelism: cfg.Scraper.ParallelWorkers,
            Delay:       cfg.Scraper.Delay,
        })
    }

    // Add extensions
    extensions.RandomUserAgent(c)
    extensions.Referer(c)

    scraper := &BaseScraper{
        collector: c,
        config:    cfg,
        storage:   storage,
        stats:     ScrapingStats{StartTime: time.Now()},
    }

    scraper.setupCallbacks()
    return scraper
}

func (bs *BaseScraper) setupCallbacks() {
    bs.collector.OnRequest(func(r *colly.Request) {
        bs.stats.RequestCount++
        if bs.logger != nil {
            bs.logger.Printf("Visiting: %s", r.URL.String())
        }
    })

    bs.collector.OnResponse(func(r *colly.Response) {
        bs.stats.SuccessCount++
    })

    bs.collector.OnError(func(r *colly.Response, err error) {
        bs.stats.ErrorCount++
        if bs.logger != nil {
            bs.logger.Printf("Error on %s: %v", r.Request.URL.String(), err)
        }
    })
}

func (bs *BaseScraper) GetStats() ScrapingStats {
    bs.stats.Duration = time.Since(bs.stats.StartTime)
    return bs.stats
}

Specific Scraper Implementation

Build specific scrapers that extend the base functionality:

// internal/scrapers/product_scraper.go
package scrapers

import (
    "strconv"
    "strings"
    "time"

    "github.com/gocolly/colly/v2"

    "internal/config"
    "internal/models"
    "internal/storage"
)

type ProductScraper struct {
    *BaseScraper
    productStorage storage.ProductStorage
}

func NewProductScraper(cfg *config.Config, storage storage.ProductStorage) *ProductScraper {
    base := NewBaseScraper(cfg, storage)

    scraper := &ProductScraper{
        BaseScraper:    base,
        productStorage: storage,
    }

    scraper.setupProductCallbacks()
    return scraper
}

func (ps *ProductScraper) setupProductCallbacks() {
    // Product listing page
    ps.collector.OnHTML(".product-item", func(e *colly.HTMLElement) {
        productURL := e.ChildAttr("a", "href")
        if productURL != "" {
            ps.collector.Visit(e.Request.AbsoluteURL(productURL))
        }
    })

    // Product detail page
    ps.collector.OnHTML(".product-detail", func(e *colly.HTMLElement) {
        product := ps.extractProduct(e)
        if err := ps.productStorage.SaveProduct(product); err != nil {
            if ps.logger != nil {
                ps.logger.Printf("Error saving product: %v", err)
            }
        } else {
            ps.stats.ItemsScraped++
        }
    })

    // Pagination
    ps.collector.OnHTML(".pagination a.next", func(e *colly.HTMLElement) {
        nextPage := e.Attr("href")
        if nextPage != "" {
            ps.collector.Visit(e.Request.AbsoluteURL(nextPage))
        }
    })
}

func (ps *ProductScraper) extractProduct(e *colly.HTMLElement) *models.Product {
    priceText := strings.TrimSpace(e.ChildText(".price"))
    priceText = strings.ReplaceAll(priceText, "$", "")
    priceText = strings.ReplaceAll(priceText, ",", "")

    price, _ := strconv.ParseFloat(priceText, 64)

    return &models.Product{
        ID:          e.ChildAttr("[data-product-id]", "data-product-id"),
        Name:        strings.TrimSpace(e.ChildText(".product-name")),
        Price:       price,
        Description: strings.TrimSpace(e.ChildText(".product-description")),
        ImageURL:    e.ChildAttr(".product-image img", "src"),
        Category:    strings.TrimSpace(e.ChildText(".product-category")),
        InStock:     !strings.Contains(e.ChildText(".stock-status"), "Out of Stock"),
        ScrapedAt:   time.Now(),
        SourceURL:   e.Request.URL.String(),
    }
}

func (ps *ProductScraper) ScrapeProducts(startURL string) error {
    return ps.collector.Visit(startURL)
}

Error Handling and Retry Logic

Implement robust error handling with retry mechanisms:

// internal/utils/helpers.go
package utils

import (
    "time"
    "math"
    "math/rand"
)

type RetryConfig struct {
    MaxRetries int
    BaseDelay  time.Duration
    MaxDelay   time.Duration
}

func RetryWithBackoff(fn func() error, config RetryConfig) error {
    var lastErr error

    for attempt := 0; attempt <= config.MaxRetries; attempt++ {
        if err := fn(); err != nil {
            lastErr = err

            if attempt == config.MaxRetries {
                break
            }

            // Exponential backoff with jitter
            delay := time.Duration(math.Min(
                float64(config.BaseDelay)*math.Pow(2, float64(attempt)),
                float64(config.MaxDelay),
            ))

            // Add jitter (±25%)
            jitter := time.Duration(rand.Float64() * 0.5 * float64(delay))
            if rand.Float64() < 0.5 {
                delay -= jitter
            } else {
                delay += jitter
            }

            time.Sleep(delay)
            continue
        }

        return nil
    }

    return lastErr
}

Main Application Setup

Structure your main application to tie everything together:

// cmd/scraper/main.go
package main

import (
    "flag"
    "log"
    "os"

    "internal/config"
    "internal/scrapers"
    "internal/storage"
)

func main() {
    configPath := flag.String("config", "configs/config.yaml", "Path to configuration file")
    targetURL := flag.String("url", "", "URL to scrape")
    flag.Parse()

    if *targetURL == "" {
        log.Fatal("URL is required")
    }

    // Load configuration
    cfg, err := config.Load(*configPath)
    if err != nil {
        log.Fatalf("Failed to load config: %v", err)
    }

    // Initialize storage
    var productStorage storage.ProductStorage
    switch cfg.Storage.Type {
    case "csv":
        productStorage, err = storage.NewCSVStorage(cfg.Storage.Path)
        if err != nil {
            log.Fatalf("Failed to initialize CSV storage: %v", err)
        }
    default:
        log.Fatalf("Unsupported storage type: %s", cfg.Storage.Type)
    }
    defer productStorage.Close()

    // Initialize scraper
    scraper := scrapers.NewProductScraper(cfg, productStorage)

    // Start scraping
    log.Printf("Starting scraping of %s", *targetURL)
    if err := scraper.ScrapeProducts(*targetURL); err != nil {
        log.Fatalf("Scraping failed: %v", err)
    }

    // Print statistics
    stats := scraper.GetStats()
    log.Printf("Scraping completed:")
    log.Printf("  Duration: %v", stats.Duration)
    log.Printf("  Requests: %d", stats.RequestCount)
    log.Printf("  Success: %d", stats.SuccessCount)
    log.Printf("  Errors: %d", stats.ErrorCount)
    log.Printf("  Items scraped: %d", stats.ItemsScraped)
}

Testing Strategy

Implement comprehensive testing for your scrapers:

// internal/scrapers/product_scraper_test.go
package scrapers

import (
    "testing"
    "net/http"
    "net/http/httptest"

    "internal/config"
    "internal/models"
)

// MockProductStorage implements ProductStorage for testing
type MockProductStorage struct {
    SavedProducts []*models.Product
}

func (m *MockProductStorage) SaveProduct(product *models.Product) error {
    m.SavedProducts = append(m.SavedProducts, product)
    return nil
}

func (m *MockProductStorage) SaveProducts(products []*models.Product) error {
    m.SavedProducts = append(m.SavedProducts, products...)
    return nil
}

func (m *MockProductStorage) GetProductByID(id string) (*models.Product, error) {
    for _, product := range m.SavedProducts {
        if product.ID == id {
            return product, nil
        }
    }
    return nil, nil
}

func (m *MockProductStorage) Save(data interface{}) error { return nil }
func (m *MockProductStorage) SaveBatch(data []interface{}) error { return nil }
func (m *MockProductStorage) Close() error { return nil }

func TestProductScraper(t *testing.T) {
    // Create test server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        html := `
        <div class="product-detail">
            <div class="product-name">Test Product</div>
            <div class="price">$29.99</div>
            <div class="product-description">A great test product</div>
        </div>
        `
        w.Write([]byte(html))
    }))
    defer server.Close()

    // Setup test configuration
    cfg := &config.Config{
        Scraper: config.ScraperConfig{
            UserAgent: "test-agent",
            Delay:     0,
        },
    }

    // Mock storage
    mockStorage := &MockProductStorage{}

    // Create scraper
    scraper := NewProductScraper(cfg, mockStorage)

    // Test scraping
    err := scraper.ScrapeProducts(server.URL)
    if err != nil {
        t.Fatalf("Expected no error, got: %v", err)
    }

    // Verify results
    if len(mockStorage.SavedProducts) != 1 {
        t.Fatalf("Expected 1 product, got %d", len(mockStorage.SavedProducts))
    }

    product := mockStorage.SavedProducts[0]
    if product.Name != "Test Product" {
        t.Errorf("Expected product name 'Test Product', got '%s'", product.Name)
    }
}

Configuration File Example

Create external configuration files for different environments:

# configs/config.yaml
scraper:
  user_agent: "MyBot 1.0"
  delay: 1s
  timeout: 30s
  max_retries: 3
  parallel_workers: 2
  respect_robots: true

storage:
  type: "csv"
  path: "output/products.csv"
  database:
    host: "localhost"
    port: 5432
    database: "scraping_db"
    user: "scraper"
    password: "secret"

logging:
  level: "info"
  format: "json"
  output: "logs/scraper.log"

Makefile for Build Automation

Automate common tasks with a Makefile:

# Makefile
.PHONY: build test clean run

BINARY_NAME=scraper
MAIN_PATH=./cmd/scraper

build:
    go build -o bin/$(BINARY_NAME) $(MAIN_PATH)

test:
    go test -v ./...

clean:
    go clean
    rm -f bin/$(BINARY_NAME)

run: build
    ./bin/$(BINARY_NAME) -config configs/config.yaml -url $(URL)

deps:
    go mod download
    go mod tidy

lint:
    golangci-lint run

coverage:
    go test -coverprofile=coverage.out ./...
    go tool cover -html=coverage.out

Docker Support

Add Docker support for consistent deployment:

# Dockerfile
FROM golang:1.21-alpine AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN go build -o scraper ./cmd/scraper

FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/

COPY --from=builder /app/scraper .
COPY --from=builder /app/configs/ ./configs/

CMD ["./scraper"]

Best Practices Summary

  1. Separation of Concerns: Keep scrapers, storage, configuration, and business logic separate
  2. Interface-Driven Design: Use interfaces for storage and external dependencies to enable testing and flexibility
  3. Configuration Management: Centralize all configuration in external files to support different environments
  4. Error Handling: Implement comprehensive error handling with retry logic and graceful degradation
  5. Testing: Write unit tests for individual components and integration tests for complete workflows
  6. Logging: Implement structured logging for debugging and monitoring production systems
  7. Performance: Use connection pooling, rate limiting, and parallel processing appropriately
  8. Maintainability: Follow Go conventions, keep functions focused and small, and document your code
  9. Observability: Include metrics collection and monitoring for production deployments
  10. Security: Never hardcode credentials, use environment variables or secure configuration management

This structured approach ensures your Colly projects remain maintainable, testable, and scalable as they grow in complexity. For scenarios requiring JavaScript execution and complex interactions, consider exploring how to handle browser sessions and managing complex navigation patterns with headless browsers like Puppeteer.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon