Can Colly handle scraping multiple pages in parallel?

Yes, Colly excels at parallel web scraping. The Go-based framework includes built-in asynchronous support and concurrency management, making it ideal for scraping multiple pages simultaneously while respecting rate limits and server capacity.

Key Features for Parallel Scraping

  • Async Mode: Enable with colly.Async(true) for non-blocking requests
  • Concurrency Control: Set parallelism limits using LimitRule
  • Rate Limiting: Built-in delays and domain-specific rules
  • Goroutine Management: Automatic handling of concurrent operations

Basic Parallel Scraping Example

package main

import (
    "fmt"
    "sync"

    "github.com/gocolly/colly"
)

func main() {
    // Create collector with async support
    c := colly.NewCollector(
        colly.Async(true),
    )

    // Configure concurrency limits
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 4, // 4 concurrent requests
        Delay:       1 * time.Second, // 1 second between requests
    })

    var wg sync.WaitGroup

    // Set up data extraction callback
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s | URL: %s\n", e.Text, e.Request.URL)
        wg.Done()
    })

    // Error handling
    c.OnError(func(r *colly.Response, err error) {
        fmt.Printf("Error: %s | URL: %s\n", err.Error(), r.Request.URL)
        wg.Done()
    })

    // URLs to scrape
    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2", 
        "https://example.com/page3",
        "https://example.com/page4",
    }

    // Start parallel visits
    for _, url := range urls {
        wg.Add(1)
        c.Visit(url)
    }

    // Wait for all requests to complete
    wg.Wait()
    c.Wait() // Ensure all async operations finish
}

Advanced Configuration with Domain-Specific Limits

func setupDomainLimits(c *colly.Collector) {
    // Different limits per domain
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*.fast-site.com",
        Parallelism: 8,
        Delay:       500 * time.Millisecond,
    })

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*.slow-site.com", 
        Parallelism: 2,
        Delay:       2 * time.Second,
    })
}

Complete Example with Data Collection

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "sync"
    "time"

    "github.com/gocolly/colly"
)

type PageData struct {
    URL   string `json:"url"`
    Title string `json:"title"`
    Links int    `json:"link_count"`
}

func main() {
    c := colly.NewCollector(colly.Async(true))

    // Configure for parallel scraping
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 3,
        Delay:       800 * time.Millisecond,
    })

    var results []PageData
    var mu sync.Mutex
    var wg sync.WaitGroup

    // Extract page data
    c.OnHTML("html", func(e *colly.HTMLElement) {
        data := PageData{
            URL:   e.Request.URL.String(),
            Title: e.ChildText("title"),
            Links: len(e.ChildAttrs("a[href]", "href")),
        }

        mu.Lock()
        results = append(results, data)
        mu.Unlock()

        wg.Done()
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request failed: %s", err.Error())
        wg.Done()
    })

    urls := []string{
        "https://example.com",
        "https://httpbin.org", 
        "https://jsonplaceholder.typicode.com",
    }

    for _, url := range urls {
        wg.Add(1)
        c.Visit(url)
    }

    wg.Wait()
    c.Wait()

    // Output results
    output, _ := json.MarshalIndent(results, "", "  ")
    fmt.Println(string(output))
}

Best Practices for Parallel Scraping

1. Respect Server Limits

  • Start with low parallelism (2-4 concurrent requests)
  • Monitor response times and error rates
  • Implement exponential backoff for failures

2. Use Appropriate Delays

c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 3,
    Delay:       1 * time.Second,
    RandomDelay: 500 * time.Millisecond, // Add randomness
})

3. Handle Errors Gracefully

c.OnError(func(r *colly.Response, err error) {
    if r.StatusCode == 429 { // Rate limited
        time.Sleep(5 * time.Second)
        r.Request.Retry()
    }
})

4. Monitor Performance

  • Track request completion rates
  • Adjust parallelism based on server response
  • Log failed requests for retry logic

Common Pitfalls to Avoid

  • Over-aggressive parallelism: Can overwhelm target servers
  • Missing error handling: Failed requests should be properly managed
  • Ignoring robots.txt: Always check site scraping policies
  • No rate limiting: Can lead to IP blocking or legal issues

Colly's parallel scraping capabilities make it an excellent choice for high-performance web scraping tasks when used responsibly and with proper configuration.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon