What are the Best Practices for Structuring Go Scraping Projects?
Structuring a Go web scraping project properly from the start is crucial for maintainability, scalability, and team collaboration. A well-organized project structure makes it easier to add new features, debug issues, and scale your scraping operations. This guide covers the essential best practices for organizing Go scraping projects.
Project Directory Structure
Standard Go Project Layout
Following Go's conventional project structure helps other developers understand your codebase quickly:
go-scraper/
├── cmd/
│ ├── scraper/
│ │ └── main.go
│ └── worker/
│ └── main.go
├── internal/
│ ├── config/
│ │ └── config.go
│ ├── scraper/
│ │ ├── client.go
│ │ ├── parser.go
│ │ └── models.go
│ ├── storage/
│ │ ├── database.go
│ │ └── file.go
│ └── queue/
│ └── redis.go
├── pkg/
│ ├── utils/
│ │ └── retry.go
│ └── validator/
│ └── url.go
├── configs/
│ ├── config.yaml
│ └── config.prod.yaml
├── scripts/
│ └── deploy.sh
├── tests/
│ └── integration/
├── go.mod
├── go.sum
├── Makefile
└── README.md
Directory Purpose Explanation
cmd/
: Contains main applications for the project. Each subdirectory is an executable.internal/
: Private application and library code that shouldn't be imported by other applications.pkg/
: Library code that can be used by external applications.configs/
: Configuration file templates and default configs.scripts/
: Scripts for building, installing, analysis, etc.tests/
: Additional external test apps and test data.
Core Components Architecture
1. Configuration Management
Create a centralized configuration system:
// internal/config/config.go
package config
import (
"fmt"
"time"
"github.com/spf13/viper"
)
type Config struct {
Server ServerConfig `mapstructure:"server"`
Database DatabaseConfig `mapstructure:"database"`
Scraper ScraperConfig `mapstructure:"scraper"`
Redis RedisConfig `mapstructure:"redis"`
}
type ServerConfig struct {
Port int `mapstructure:"port"`
ReadTimeout time.Duration `mapstructure:"read_timeout"`
WriteTimeout time.Duration `mapstructure:"write_timeout"`
}
type ScraperConfig struct {
UserAgent string `mapstructure:"user_agent"`
Timeout time.Duration `mapstructure:"timeout"`
MaxRetries int `mapstructure:"max_retries"`
RateLimit int `mapstructure:"rate_limit"`
ConcurrentMax int `mapstructure:"concurrent_max"`
}
func Load(configPath string) (*Config, error) {
viper.SetConfigFile(configPath)
viper.AutomaticEnv()
if err := viper.ReadInConfig(); err != nil {
return nil, fmt.Errorf("error reading config file: %w", err)
}
var config Config
if err := viper.Unmarshal(&config); err != nil {
return nil, fmt.Errorf("error unmarshaling config: %w", err)
}
return &config, nil
}
2. HTTP Client Abstraction
Create a reusable HTTP client with proper configuration:
// internal/scraper/client.go
package scraper
import (
"context"
"fmt"
"net/http"
"time"
"golang.org/x/time/rate"
)
type Client struct {
httpClient *http.Client
userAgent string
rateLimiter *rate.Limiter
maxRetries int
}
func NewClient(config ClientConfig) *Client {
return &Client{
httpClient: &http.Client{
Timeout: config.Timeout,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
},
userAgent: config.UserAgent,
rateLimiter: rate.NewLimiter(rate.Limit(config.RateLimit), 1),
maxRetries: config.MaxRetries,
}
}
func (c *Client) Get(ctx context.Context, url string) (*http.Response, error) {
// Wait for rate limiter
if err := c.rateLimiter.Wait(ctx); err != nil {
return nil, err
}
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
if err != nil {
return nil, err
}
req.Header.Set("User-Agent", c.userAgent)
return c.httpClient.Do(req)
}
3. Data Models and Structures
Define clear data structures for scraped content:
// internal/scraper/models.go
package scraper
import (
"time"
)
type ScrapingJob struct {
ID string `json:"id"`
URL string `json:"url"`
Status JobStatus `json:"status"`
Priority int `json:"priority"`
Metadata map[string]string `json:"metadata"`
CreatedAt time.Time `json:"created_at"`
CompletedAt *time.Time `json:"completed_at,omitempty"`
Error string `json:"error,omitempty"`
}
type JobStatus string
const (
JobStatusPending JobStatus = "pending"
JobStatusRunning JobStatus = "running"
JobStatusCompleted JobStatus = "completed"
JobStatusFailed JobStatus = "failed"
)
type ScrapedData struct {
URL string `json:"url"`
Title string `json:"title"`
Content string `json:"content"`
Links []string `json:"links"`
Images []string `json:"images"`
Metadata map[string]string `json:"metadata"`
ScrapedAt time.Time `json:"scraped_at"`
}
type ScrapeResult struct {
Data *ScrapedData `json:"data,omitempty"`
Error error `json:"error,omitempty"`
}
Modular Design Patterns
1. Interface-Based Architecture
Use interfaces to make components easily testable and replaceable:
// internal/scraper/interfaces.go
package scraper
import "context"
type Scraper interface {
Scrape(ctx context.Context, url string) (*ScrapedData, error)
}
type Parser interface {
Parse(html string) (*ScrapedData, error)
}
type Storage interface {
Store(ctx context.Context, data *ScrapedData) error
Get(ctx context.Context, id string) (*ScrapedData, error)
}
type Queue interface {
Push(ctx context.Context, job *ScrapingJob) error
Pop(ctx context.Context) (*ScrapingJob, error)
}
2. Service Layer Implementation
Implement the service layer that coordinates different components:
// internal/scraper/service.go
package scraper
import (
"context"
"fmt"
"log"
)
type Service struct {
client *Client
parser Parser
storage Storage
queue Queue
logger *log.Logger
}
func NewService(client *Client, parser Parser, storage Storage, queue Queue, logger *log.Logger) *Service {
return &Service{
client: client,
parser: parser,
storage: storage,
queue: queue,
logger: logger,
}
}
func (s *Service) ProcessJob(ctx context.Context, job *ScrapingJob) error {
s.logger.Printf("Processing job %s for URL: %s", job.ID, job.URL)
// Fetch the page
resp, err := s.client.Get(ctx, job.URL)
if err != nil {
return fmt.Errorf("failed to fetch URL %s: %w", job.URL, err)
}
defer resp.Body.Close()
// Parse the content
data, err := s.parser.ParseResponse(resp)
if err != nil {
return fmt.Errorf("failed to parse content: %w", err)
}
// Store the data
if err := s.storage.Store(ctx, data); err != nil {
return fmt.Errorf("failed to store data: %w", err)
}
s.logger.Printf("Successfully processed job %s", job.ID)
return nil
}
Error Handling and Logging
Structured Error Handling
Implement comprehensive error handling with proper error types:
// pkg/errors/errors.go
package errors
import "fmt"
type ScrapingError struct {
Type ErrorType
URL string
Message string
Err error
}
type ErrorType string
const (
ErrorTypeNetwork ErrorType = "network"
ErrorTypeParsing ErrorType = "parsing"
ErrorTypeValidation ErrorType = "validation"
ErrorTypeStorage ErrorType = "storage"
)
func (e *ScrapingError) Error() string {
if e.Err != nil {
return fmt.Sprintf("%s error for URL %s: %s (%v)", e.Type, e.URL, e.Message, e.Err)
}
return fmt.Sprintf("%s error for URL %s: %s", e.Type, e.URL, e.Message)
}
func NewNetworkError(url, message string, err error) *ScrapingError {
return &ScrapingError{
Type: ErrorTypeNetwork,
URL: url,
Message: message,
Err: err,
}
}
Structured Logging
Use structured logging for better debugging:
// internal/logger/logger.go
package logger
import (
"log/slog"
"os"
)
func New(level string) *slog.Logger {
var logLevel slog.Level
switch level {
case "debug":
logLevel = slog.LevelDebug
case "info":
logLevel = slog.LevelInfo
case "warn":
logLevel = slog.LevelWarn
case "error":
logLevel = slog.LevelError
default:
logLevel = slog.LevelInfo
}
opts := &slog.HandlerOptions{
Level: logLevel,
}
handler := slog.NewJSONHandler(os.Stdout, opts)
return slog.New(handler)
}
Concurrency and Performance
Worker Pool Pattern
Implement a worker pool for concurrent scraping:
// internal/worker/pool.go
package worker
import (
"context"
"sync"
)
type Pool struct {
workers int
jobChan chan Job
resultChan chan Result
wg sync.WaitGroup
}
type Job struct {
ID string
URL string
}
type Result struct {
JobID string
Data interface{}
Error error
}
func NewPool(workers int) *Pool {
return &Pool{
workers: workers,
jobChan: make(chan Job, workers*2),
resultChan: make(chan Result, workers*2),
}
}
func (p *Pool) Start(ctx context.Context, processor func(Job) Result) {
for i := 0; i < p.workers; i++ {
p.wg.Add(1)
go func() {
defer p.wg.Done()
for {
select {
case job, ok := <-p.jobChan:
if !ok {
return
}
result := processor(job)
select {
case p.resultChan <- result:
case <-ctx.Done():
return
}
case <-ctx.Done():
return
}
}
}()
}
}
func (p *Pool) Submit(job Job) {
p.jobChan <- job
}
func (p *Pool) Results() <-chan Result {
return p.resultChan
}
func (p *Pool) Close() {
close(p.jobChan)
p.wg.Wait()
close(p.resultChan)
}
Testing Strategy
Unit Testing with Mocks
Create testable components using interfaces:
// internal/scraper/service_test.go
package scraper
import (
"context"
"testing"
"net/http"
"github.com/stretchr/testify/mock"
"github.com/stretchr/testify/assert"
)
type MockClient struct {
mock.Mock
}
func (m *MockClient) Get(ctx context.Context, url string) (*http.Response, error) {
args := m.Called(ctx, url)
return args.Get(0).(*http.Response), args.Error(1)
}
func TestService_ProcessJob(t *testing.T) {
// Setup mocks
mockClient := new(MockClient)
mockParser := new(MockParser)
mockStorage := new(MockStorage)
service := NewService(mockClient, mockParser, mockStorage, nil, nil)
// Test implementation
job := &ScrapingJob{
ID: "test-job",
URL: "https://example.com",
}
// Setup expectations
mockClient.On("Get", mock.Anything, "https://example.com").Return(&http.Response{}, nil)
err := service.ProcessJob(context.Background(), job)
assert.NoError(t, err)
mockClient.AssertExpectations(t)
}
Configuration and Environment Management
Environment-Specific Configurations
Use configuration files for different environments:
# configs/config.yaml
server:
port: 8080
read_timeout: 30s
write_timeout: 30s
database:
host: localhost
port: 5432
name: scraper_db
user: scraper
password: password
scraper:
user_agent: "Go-Scraper/1.0"
timeout: 30s
max_retries: 3
rate_limit: 10
concurrent_max: 50
redis:
host: localhost
port: 6379
db: 0
Deployment and Operations
Makefile for Build Automation
Create a Makefile for common operations:
# Makefile
.PHONY: build test lint clean docker-build
BINARY_NAME=scraper
VERSION=$(shell git describe --tags --always --dirty)
build:
go build -ldflags="-X main.version=$(VERSION)" -o bin/$(BINARY_NAME) cmd/scraper/main.go
test:
go test -v ./...
lint:
golangci-lint run
clean:
go clean
rm -rf bin/
docker-build:
docker build -t $(BINARY_NAME):$(VERSION) .
run:
go run cmd/scraper/main.go
deps:
go mod download
go mod tidy
Docker Configuration
Create a production-ready Dockerfile:
# Build stage
FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o scraper cmd/scraper/main.go
# Production stage
FROM alpine:latest
RUN apk --no-cache add ca-certificates
WORKDIR /root/
COPY --from=builder /app/scraper .
COPY --from=builder /app/configs ./configs
CMD ["./scraper"]
Performance Optimization Tips
- Use Connection Pooling: Configure HTTP client with appropriate connection pool settings
- Implement Caching: Cache frequently accessed data using Redis or in-memory stores
- Rate Limiting: Respect target websites with proper rate limiting
- Monitoring: Implement metrics collection using Prometheus or similar tools
- Graceful Shutdown: Handle shutdown signals properly to avoid data loss
Integration with External Services
When building complex scraping applications, you might need to integrate with various external services and APIs. For JavaScript-heavy websites, consider handling dynamic content that loads after page load using headless browsers, which can be orchestrated from your Go application.
For applications requiring browser automation capabilities, you can run multiple pages in parallel with Puppeteer and coordinate these operations through your Go service layer.
Conclusion
Structuring Go scraping projects with these best practices ensures maintainable, scalable, and robust applications. Focus on separation of concerns, proper error handling, comprehensive testing, and clear interfaces. This foundation will serve you well as your scraping requirements grow and evolve.
Remember to always respect robots.txt files, implement proper rate limiting, and consider the legal implications of web scraping in your jurisdiction. Following these architectural patterns will help you build professional-grade scraping applications that can handle production workloads effectively.