Table of contents

How do I use Colly with Docker containers?

apiVersion: v1 kind: ConfigMap metadata: name: scraper-config data: config.yaml: | targets: - url: "https://example.com" delay: "2s" timeout: "30s" ```

Best Practices for Dockerized Colly Applications

1. Use Non-Root Users

RUN adduser -D -s /bin/sh scraper
USER scraper

2. Handle Graceful Shutdowns

func main() {
    // Set up signal handling
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    // Create context with cancellation
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    // Handle shutdown signal
    go func() {
        <-sigChan
        log.Println("Received shutdown signal, gracefully stopping...")
        cancel()
    }()

    // Run scraper with context
    runScraper(ctx)
}

3. Implement Health Checks

func healthCheck() error {
    // Simple HTTP request to verify connectivity
    client := &http.Client{Timeout: 5 * time.Second}
    resp, err := client.Get("https://httpbin.org/status/200")
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200 {
        return fmt.Errorf("health check failed with status %d", resp.StatusCode)
    }

    return nil
}

Monitoring and Logging

Configure proper logging for containerized environments:

import (
    "github.com/sirupsen/logrus"
)

func setupLogging() {
    // Use JSON format for better log aggregation
    logrus.SetFormatter(&logrus.JSONFormatter{})

    // Set log level from environment
    level, err := logrus.ParseLevel(os.Getenv("LOG_LEVEL"))
    if err != nil {
        level = logrus.InfoLevel
    }
    logrus.SetLevel(level)
}

Scaling with Docker Swarm

Deploy your scraper across multiple nodes using Docker Swarm:

version: '3.8'

services:
  colly-scraper:
    image: colly-scraper:latest
    deploy:
      replicas: 5
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
      resources:
        limits:
          memory: 128M
        reservations:
          memory: 64M
    environment:
      - TARGET_URL=https://example.com
      - SCRAPER_DELAY=3
    networks:
      - overlay-network

networks:
  overlay-network:
    driver: overlay

Conclusion

Containerizing Colly scrapers with Docker provides numerous benefits including consistent deployments, easy scaling, and improved resource management. By following the patterns and examples in this guide, you can build robust, production-ready web scraping applications that leverage Docker's containerization capabilities.

Similar to how Puppeteer can be used with Docker for browser-based scraping, Colly offers a lightweight alternative for scenarios where JavaScript execution isn't required. The containerization principles remain similar, but Colly's lower resource requirements make it ideal for high-throughput scraping scenarios in container orchestration platforms.

Remember to implement proper error handling, monitoring, and resource limits to ensure your containerized scrapers run reliably in production environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon