Table of contents

What are the Best Practices for Structuring Go Scraping Projects?

Structuring a Go web scraping project properly from the start is crucial for maintainability, scalability, and team collaboration. A well-organized project structure makes it easier to add new features, debug issues, and scale your scraping operations. This guide covers the essential best practices for organizing Go scraping projects.

Project Directory Structure

Standard Go Project Layout

Following Go's conventional project structure helps other developers understand your codebase quickly:

go-scraper/
├── cmd/
│   ├── scraper/
│   │   └── main.go
│   └── worker/
│       └── main.go
├── internal/
│   ├── config/
│   │   └── config.go
│   ├── scraper/
│   │   ├── client.go
│   │   ├── parser.go
│   │   └── models.go
│   ├── storage/
│   │   ├── database.go
│   │   └── file.go
│   └── queue/
│       └── redis.go
├── pkg/
│   ├── utils/
│   │   └── retry.go
│   └── validator/
│       └── url.go
├── configs/
│   ├── config.yaml
│   └── config.prod.yaml
├── scripts/
│   └── deploy.sh
├── tests/
│   └── integration/
├── go.mod
├── go.sum
├── Makefile
└── README.md

Directory Purpose Explanation

  • cmd/: Contains main applications for the project. Each subdirectory is an executable.
  • internal/: Private application and library code that shouldn't be imported by other applications.
  • pkg/: Library code that can be used by external applications.
  • configs/: Configuration file templates and default configs.
  • scripts/: Scripts for building, installing, analysis, etc.
  • tests/: Additional external test apps and test data.

Core Components Architecture

1. Configuration Management

Create a centralized configuration system:

// internal/config/config.go
package config

import (
    "fmt"
    "time"

    "github.com/spf13/viper"
)

type Config struct {
    Server   ServerConfig   `mapstructure:"server"`
    Database DatabaseConfig `mapstructure:"database"`
    Scraper  ScraperConfig  `mapstructure:"scraper"`
    Redis    RedisConfig    `mapstructure:"redis"`
}

type ServerConfig struct {
    Port         int           `mapstructure:"port"`
    ReadTimeout  time.Duration `mapstructure:"read_timeout"`
    WriteTimeout time.Duration `mapstructure:"write_timeout"`
}

type ScraperConfig struct {
    UserAgent     string        `mapstructure:"user_agent"`
    Timeout       time.Duration `mapstructure:"timeout"`
    MaxRetries    int           `mapstructure:"max_retries"`
    RateLimit     int           `mapstructure:"rate_limit"`
    ConcurrentMax int           `mapstructure:"concurrent_max"`
}

func Load(configPath string) (*Config, error) {
    viper.SetConfigFile(configPath)
    viper.AutomaticEnv()

    if err := viper.ReadInConfig(); err != nil {
        return nil, fmt.Errorf("error reading config file: %w", err)
    }

    var config Config
    if err := viper.Unmarshal(&config); err != nil {
        return nil, fmt.Errorf("error unmarshaling config: %w", err)
    }

    return &config, nil
}

2. HTTP Client Abstraction

Create a reusable HTTP client with proper configuration:

// internal/scraper/client.go
package scraper

import (
    "context"
    "fmt"
    "net/http"
    "time"

    "golang.org/x/time/rate"
)

type Client struct {
    httpClient  *http.Client
    userAgent   string
    rateLimiter *rate.Limiter
    maxRetries  int
}

func NewClient(config ClientConfig) *Client {
    return &Client{
        httpClient: &http.Client{
            Timeout: config.Timeout,
            Transport: &http.Transport{
                MaxIdleConns:        100,
                MaxIdleConnsPerHost: 10,
                IdleConnTimeout:     90 * time.Second,
            },
        },
        userAgent:   config.UserAgent,
        rateLimiter: rate.NewLimiter(rate.Limit(config.RateLimit), 1),
        maxRetries:  config.MaxRetries,
    }
}

func (c *Client) Get(ctx context.Context, url string) (*http.Response, error) {
    // Wait for rate limiter
    if err := c.rateLimiter.Wait(ctx); err != nil {
        return nil, err
    }

    req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
    if err != nil {
        return nil, err
    }

    req.Header.Set("User-Agent", c.userAgent)

    return c.httpClient.Do(req)
}

3. Data Models and Structures

Define clear data structures for scraped content:

// internal/scraper/models.go
package scraper

import (
    "time"
)

type ScrapingJob struct {
    ID          string            `json:"id"`
    URL         string            `json:"url"`
    Status      JobStatus         `json:"status"`
    Priority    int               `json:"priority"`
    Metadata    map[string]string `json:"metadata"`
    CreatedAt   time.Time         `json:"created_at"`
    CompletedAt *time.Time        `json:"completed_at,omitempty"`
    Error       string            `json:"error,omitempty"`
}

type JobStatus string

const (
    JobStatusPending   JobStatus = "pending"
    JobStatusRunning   JobStatus = "running"
    JobStatusCompleted JobStatus = "completed"
    JobStatusFailed    JobStatus = "failed"
)

type ScrapedData struct {
    URL         string            `json:"url"`
    Title       string            `json:"title"`
    Content     string            `json:"content"`
    Links       []string          `json:"links"`
    Images      []string          `json:"images"`
    Metadata    map[string]string `json:"metadata"`
    ScrapedAt   time.Time         `json:"scraped_at"`
}

type ScrapeResult struct {
    Data  *ScrapedData `json:"data,omitempty"`
    Error error        `json:"error,omitempty"`
}

Modular Design Patterns

1. Interface-Based Architecture

Use interfaces to make components easily testable and replaceable:

// internal/scraper/interfaces.go
package scraper

import "context"

type Scraper interface {
    Scrape(ctx context.Context, url string) (*ScrapedData, error)
}

type Parser interface {
    Parse(html string) (*ScrapedData, error)
}

type Storage interface {
    Store(ctx context.Context, data *ScrapedData) error
    Get(ctx context.Context, id string) (*ScrapedData, error)
}

type Queue interface {
    Push(ctx context.Context, job *ScrapingJob) error
    Pop(ctx context.Context) (*ScrapingJob, error)
}

2. Service Layer Implementation

Implement the service layer that coordinates different components:

// internal/scraper/service.go
package scraper

import (
    "context"
    "fmt"
    "log"
)

type Service struct {
    client  *Client
    parser  Parser
    storage Storage
    queue   Queue
    logger  *log.Logger
}

func NewService(client *Client, parser Parser, storage Storage, queue Queue, logger *log.Logger) *Service {
    return &Service{
        client:  client,
        parser:  parser,
        storage: storage,
        queue:   queue,
        logger:  logger,
    }
}

func (s *Service) ProcessJob(ctx context.Context, job *ScrapingJob) error {
    s.logger.Printf("Processing job %s for URL: %s", job.ID, job.URL)

    // Fetch the page
    resp, err := s.client.Get(ctx, job.URL)
    if err != nil {
        return fmt.Errorf("failed to fetch URL %s: %w", job.URL, err)
    }
    defer resp.Body.Close()

    // Parse the content
    data, err := s.parser.ParseResponse(resp)
    if err != nil {
        return fmt.Errorf("failed to parse content: %w", err)
    }

    // Store the data
    if err := s.storage.Store(ctx, data); err != nil {
        return fmt.Errorf("failed to store data: %w", err)
    }

    s.logger.Printf("Successfully processed job %s", job.ID)
    return nil
}

Error Handling and Logging

Structured Error Handling

Implement comprehensive error handling with proper error types:

// pkg/errors/errors.go
package errors

import "fmt"

type ScrapingError struct {
    Type    ErrorType
    URL     string
    Message string
    Err     error
}

type ErrorType string

const (
    ErrorTypeNetwork    ErrorType = "network"
    ErrorTypeParsing    ErrorType = "parsing"
    ErrorTypeValidation ErrorType = "validation"
    ErrorTypeStorage    ErrorType = "storage"
)

func (e *ScrapingError) Error() string {
    if e.Err != nil {
        return fmt.Sprintf("%s error for URL %s: %s (%v)", e.Type, e.URL, e.Message, e.Err)
    }
    return fmt.Sprintf("%s error for URL %s: %s", e.Type, e.URL, e.Message)
}

func NewNetworkError(url, message string, err error) *ScrapingError {
    return &ScrapingError{
        Type:    ErrorTypeNetwork,
        URL:     url,
        Message: message,
        Err:     err,
    }
}

Structured Logging

Use structured logging for better debugging:

// internal/logger/logger.go
package logger

import (
    "log/slog"
    "os"
)

func New(level string) *slog.Logger {
    var logLevel slog.Level
    switch level {
    case "debug":
        logLevel = slog.LevelDebug
    case "info":
        logLevel = slog.LevelInfo
    case "warn":
        logLevel = slog.LevelWarn
    case "error":
        logLevel = slog.LevelError
    default:
        logLevel = slog.LevelInfo
    }

    opts := &slog.HandlerOptions{
        Level: logLevel,
    }

    handler := slog.NewJSONHandler(os.Stdout, opts)
    return slog.New(handler)
}

Concurrency and Performance

Worker Pool Pattern

Implement a worker pool for concurrent scraping:

// internal/worker/pool.go
package worker

import (
    "context"
    "sync"
)

type Pool struct {
    workers    int
    jobChan    chan Job
    resultChan chan Result
    wg         sync.WaitGroup
}

type Job struct {
    ID  string
    URL string
}

type Result struct {
    JobID string
    Data  interface{}
    Error error
}

func NewPool(workers int) *Pool {
    return &Pool{
        workers:    workers,
        jobChan:    make(chan Job, workers*2),
        resultChan: make(chan Result, workers*2),
    }
}

func (p *Pool) Start(ctx context.Context, processor func(Job) Result) {
    for i := 0; i < p.workers; i++ {
        p.wg.Add(1)
        go func() {
            defer p.wg.Done()
            for {
                select {
                case job, ok := <-p.jobChan:
                    if !ok {
                        return
                    }
                    result := processor(job)
                    select {
                    case p.resultChan <- result:
                    case <-ctx.Done():
                        return
                    }
                case <-ctx.Done():
                    return
                }
            }
        }()
    }
}

func (p *Pool) Submit(job Job) {
    p.jobChan <- job
}

func (p *Pool) Results() <-chan Result {
    return p.resultChan
}

func (p *Pool) Close() {
    close(p.jobChan)
    p.wg.Wait()
    close(p.resultChan)
}

Testing Strategy

Unit Testing with Mocks

Create testable components using interfaces:

// internal/scraper/service_test.go
package scraper

import (
    "context"
    "testing"
    "net/http"

    "github.com/stretchr/testify/mock"
    "github.com/stretchr/testify/assert"
)

type MockClient struct {
    mock.Mock
}

func (m *MockClient) Get(ctx context.Context, url string) (*http.Response, error) {
    args := m.Called(ctx, url)
    return args.Get(0).(*http.Response), args.Error(1)
}

func TestService_ProcessJob(t *testing.T) {
    // Setup mocks
    mockClient := new(MockClient)
    mockParser := new(MockParser)
    mockStorage := new(MockStorage)

    service := NewService(mockClient, mockParser, mockStorage, nil, nil)

    // Test implementation
    job := &ScrapingJob{
        ID:  "test-job",
        URL: "https://example.com",
    }

    // Setup expectations
    mockClient.On("Get", mock.Anything, "https://example.com").Return(&http.Response{}, nil)

    err := service.ProcessJob(context.Background(), job)
    assert.NoError(t, err)

    mockClient.AssertExpectations(t)
}

Configuration and Environment Management

Environment-Specific Configurations

Use configuration files for different environments:

# configs/config.yaml
server:
  port: 8080
  read_timeout: 30s
  write_timeout: 30s

database:
  host: localhost
  port: 5432
  name: scraper_db
  user: scraper
  password: password

scraper:
  user_agent: "Go-Scraper/1.0"
  timeout: 30s
  max_retries: 3
  rate_limit: 10
  concurrent_max: 50

redis:
  host: localhost
  port: 6379
  db: 0

Deployment and Operations

Makefile for Build Automation

Create a Makefile for common operations:

# Makefile
.PHONY: build test lint clean docker-build

BINARY_NAME=scraper
VERSION=$(shell git describe --tags --always --dirty)

build:
    go build -ldflags="-X main.version=$(VERSION)" -o bin/$(BINARY_NAME) cmd/scraper/main.go

test:
    go test -v ./...

lint:
    golangci-lint run

clean:
    go clean
    rm -rf bin/

docker-build:
    docker build -t $(BINARY_NAME):$(VERSION) .

run:
    go run cmd/scraper/main.go

deps:
    go mod download
    go mod tidy

Docker Configuration

Create a production-ready Dockerfile:

# Build stage
FROM golang:1.21-alpine AS builder

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o scraper cmd/scraper/main.go

# Production stage
FROM alpine:latest

RUN apk --no-cache add ca-certificates
WORKDIR /root/

COPY --from=builder /app/scraper .
COPY --from=builder /app/configs ./configs

CMD ["./scraper"]

Performance Optimization Tips

  1. Use Connection Pooling: Configure HTTP client with appropriate connection pool settings
  2. Implement Caching: Cache frequently accessed data using Redis or in-memory stores
  3. Rate Limiting: Respect target websites with proper rate limiting
  4. Monitoring: Implement metrics collection using Prometheus or similar tools
  5. Graceful Shutdown: Handle shutdown signals properly to avoid data loss

Integration with External Services

When building complex scraping applications, you might need to integrate with various external services and APIs. For JavaScript-heavy websites, consider handling dynamic content that loads after page load using headless browsers, which can be orchestrated from your Go application.

For applications requiring browser automation capabilities, you can run multiple pages in parallel with Puppeteer and coordinate these operations through your Go service layer.

Conclusion

Structuring Go scraping projects with these best practices ensures maintainable, scalable, and robust applications. Focus on separation of concerns, proper error handling, comprehensive testing, and clear interfaces. This foundation will serve you well as your scraping requirements grow and evolve.

Remember to always respect robots.txt files, implement proper rate limiting, and consider the legal implications of web scraping in your jurisdiction. Following these architectural patterns will help you build professional-grade scraping applications that can handle production workloads effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon