Can Colly handle scraping multiple pages in parallel?

Yes, Colly excels at parallel web scraping. The Go-based framework includes built-in asynchronous support and concurrency management, making it ideal for scraping multiple pages simultaneously while respecting rate limits and server capacity.

Key Features for Parallel Scraping

Async Mode: Enable with colly.Async(true) for non-blocking requests
Concurrency Control: Set parallelism limits using LimitRule
Rate Limiting: Built-in delays and domain-specific rules
Goroutine Management: Automatic handling of concurrent operations

Basic Parallel Scraping Example

package main

import (
    "fmt"
    "sync"

    "github.com/gocolly/colly"
)

func main() {
    // Create collector with async support
    c := colly.NewCollector(
        colly.Async(true),
    )

    // Configure concurrency limits
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 4, // 4 concurrent requests
        Delay:       1 * time.Second, // 1 second between requests
    })

    var wg sync.WaitGroup

    // Set up data extraction callback
    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Printf("Title: %s | URL: %s\n", e.Text, e.Request.URL)
        wg.Done()
    })

    // Error handling
    c.OnError(func(r *colly.Response, err error) {
        fmt.Printf("Error: %s | URL: %s\n", err.Error(), r.Request.URL)
        wg.Done()
    })

    // URLs to scrape
    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2", 
        "https://example.com/page3",
        "https://example.com/page4",
    }

    // Start parallel visits
    for _, url := range urls {
        wg.Add(1)
        c.Visit(url)
    }

    // Wait for all requests to complete
    wg.Wait()
    c.Wait() // Ensure all async operations finish
}

Advanced Configuration with Domain-Specific Limits

func setupDomainLimits(c *colly.Collector) {
    // Different limits per domain
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*.fast-site.com",
        Parallelism: 8,
        Delay:       500 * time.Millisecond,
    })

    c.Limit(&colly.LimitRule{
        DomainGlob:  "*.slow-site.com", 
        Parallelism: 2,
        Delay:       2 * time.Second,
    })
}

Complete Example with Data Collection

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "sync"
    "time"

    "github.com/gocolly/colly"
)

type PageData struct {
    URL   string `json:"url"`
    Title string `json:"title"`
    Links int    `json:"link_count"`
}

func main() {
    c := colly.NewCollector(colly.Async(true))

    // Configure for parallel scraping
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 3,
        Delay:       800 * time.Millisecond,
    })

    var results []PageData
    var mu sync.Mutex
    var wg sync.WaitGroup

    // Extract page data
    c.OnHTML("html", func(e *colly.HTMLElement) {
        data := PageData{
            URL:   e.Request.URL.String(),
            Title: e.ChildText("title"),
            Links: len(e.ChildAttrs("a[href]", "href")),
        }

        mu.Lock()
        results = append(results, data)
        mu.Unlock()

        wg.Done()
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request failed: %s", err.Error())
        wg.Done()
    })

    urls := []string{
        "https://example.com",
        "https://httpbin.org", 
        "https://jsonplaceholder.typicode.com",
    }

    for _, url := range urls {
        wg.Add(1)
        c.Visit(url)
    }

    wg.Wait()
    c.Wait()

    // Output results
    output, _ := json.MarshalIndent(results, "", "  ")
    fmt.Println(string(output))
}

Best Practices for Parallel Scraping

1. Respect Server Limits

Start with low parallelism (2-4 concurrent requests)
Monitor response times and error rates
Implement exponential backoff for failures

2. Use Appropriate Delays

c.Limit(&colly.LimitRule{
    DomainGlob:  "*",
    Parallelism: 3,
    Delay:       1 * time.Second,
    RandomDelay: 500 * time.Millisecond, // Add randomness
})

3. Handle Errors Gracefully

c.OnError(func(r *colly.Response, err error) {
    if r.StatusCode == 429 { // Rate limited
        time.Sleep(5 * time.Second)
        r.Request.Retry()
    }
})

4. Monitor Performance

Track request completion rates
Adjust parallelism based on server response
Log failed requests for retry logic

Common Pitfalls to Avoid

Over-aggressive parallelism: Can overwhelm target servers
Missing error handling: Failed requests should be properly managed
Ignoring robots.txt: Always check site scraping policies
No rate limiting: Can lead to IP blocking or legal issues

Colly's parallel scraping capabilities make it an excellent choice for high-performance web scraping tasks when used responsibly and with proper configuration.

Table of contents

Can Colly handle scraping multiple pages in parallel?

Key Features for Parallel Scraping

Basic Parallel Scraping Example

Advanced Configuration with Domain-Specific Limits

Complete Example with Data Collection

Best Practices for Parallel Scraping

1. Respect Server Limits

2. Use Appropriate Delays

3. Handle Errors Gracefully

4. Monitor Performance

Common Pitfalls to Avoid

Try WebScraping.AI for Your Web Scraping Needs

Key Features:

Getting Started:

Related Questions

How do I scrape content from behind a login wall using Colly?

How do I handle redirects in Colly?

How do I use Colly's callback functions effectively?

Get Started Now

Support

Support