apiVersion: v1 kind: ConfigMap metadata: name: scraper-config data: config.yaml: | targets: - url: "https://example.com" delay: "2s" timeout: "30s" ```
Best Practices for Dockerized Colly Applications
1. Use Non-Root Users
RUN adduser -D -s /bin/sh scraper
USER scraper
2. Handle Graceful Shutdowns
func main() {
// Set up signal handling
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
// Create context with cancellation
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Handle shutdown signal
go func() {
<-sigChan
log.Println("Received shutdown signal, gracefully stopping...")
cancel()
}()
// Run scraper with context
runScraper(ctx)
}
3. Implement Health Checks
func healthCheck() error {
// Simple HTTP request to verify connectivity
client := &http.Client{Timeout: 5 * time.Second}
resp, err := client.Get("https://httpbin.org/status/200")
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return fmt.Errorf("health check failed with status %d", resp.StatusCode)
}
return nil
}
Monitoring and Logging
Configure proper logging for containerized environments:
import (
"github.com/sirupsen/logrus"
)
func setupLogging() {
// Use JSON format for better log aggregation
logrus.SetFormatter(&logrus.JSONFormatter{})
// Set log level from environment
level, err := logrus.ParseLevel(os.Getenv("LOG_LEVEL"))
if err != nil {
level = logrus.InfoLevel
}
logrus.SetLevel(level)
}
Scaling with Docker Swarm
Deploy your scraper across multiple nodes using Docker Swarm:
version: '3.8'
services:
colly-scraper:
image: colly-scraper:latest
deploy:
replicas: 5
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
resources:
limits:
memory: 128M
reservations:
memory: 64M
environment:
- TARGET_URL=https://example.com
- SCRAPER_DELAY=3
networks:
- overlay-network
networks:
overlay-network:
driver: overlay
Conclusion
Containerizing Colly scrapers with Docker provides numerous benefits including consistent deployments, easy scaling, and improved resource management. By following the patterns and examples in this guide, you can build robust, production-ready web scraping applications that leverage Docker's containerization capabilities.
Similar to how Puppeteer can be used with Docker for browser-based scraping, Colly offers a lightweight alternative for scenarios where JavaScript execution isn't required. The containerization principles remain similar, but Colly's lower resource requirements make it ideal for high-throughput scraping scenarios in container orchestration platforms.
Remember to implement proper error handling, monitoring, and resource limits to ensure your containerized scrapers run reliably in production environments.