Yes, Colly excels at parallel web scraping. The Go-based framework includes built-in asynchronous support and concurrency management, making it ideal for scraping multiple pages simultaneously while respecting rate limits and server capacity.
Key Features for Parallel Scraping
- Async Mode: Enable with
colly.Async(true)
for non-blocking requests - Concurrency Control: Set parallelism limits using
LimitRule
- Rate Limiting: Built-in delays and domain-specific rules
- Goroutine Management: Automatic handling of concurrent operations
Basic Parallel Scraping Example
package main
import (
"fmt"
"sync"
"github.com/gocolly/colly"
)
func main() {
// Create collector with async support
c := colly.NewCollector(
colly.Async(true),
)
// Configure concurrency limits
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 4, // 4 concurrent requests
Delay: 1 * time.Second, // 1 second between requests
})
var wg sync.WaitGroup
// Set up data extraction callback
c.OnHTML("title", func(e *colly.HTMLElement) {
fmt.Printf("Title: %s | URL: %s\n", e.Text, e.Request.URL)
wg.Done()
})
// Error handling
c.OnError(func(r *colly.Response, err error) {
fmt.Printf("Error: %s | URL: %s\n", err.Error(), r.Request.URL)
wg.Done()
})
// URLs to scrape
urls := []string{
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
"https://example.com/page4",
}
// Start parallel visits
for _, url := range urls {
wg.Add(1)
c.Visit(url)
}
// Wait for all requests to complete
wg.Wait()
c.Wait() // Ensure all async operations finish
}
Advanced Configuration with Domain-Specific Limits
func setupDomainLimits(c *colly.Collector) {
// Different limits per domain
c.Limit(&colly.LimitRule{
DomainGlob: "*.fast-site.com",
Parallelism: 8,
Delay: 500 * time.Millisecond,
})
c.Limit(&colly.LimitRule{
DomainGlob: "*.slow-site.com",
Parallelism: 2,
Delay: 2 * time.Second,
})
}
Complete Example with Data Collection
package main
import (
"encoding/json"
"fmt"
"log"
"sync"
"time"
"github.com/gocolly/colly"
)
type PageData struct {
URL string `json:"url"`
Title string `json:"title"`
Links int `json:"link_count"`
}
func main() {
c := colly.NewCollector(colly.Async(true))
// Configure for parallel scraping
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 3,
Delay: 800 * time.Millisecond,
})
var results []PageData
var mu sync.Mutex
var wg sync.WaitGroup
// Extract page data
c.OnHTML("html", func(e *colly.HTMLElement) {
data := PageData{
URL: e.Request.URL.String(),
Title: e.ChildText("title"),
Links: len(e.ChildAttrs("a[href]", "href")),
}
mu.Lock()
results = append(results, data)
mu.Unlock()
wg.Done()
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Request failed: %s", err.Error())
wg.Done()
})
urls := []string{
"https://example.com",
"https://httpbin.org",
"https://jsonplaceholder.typicode.com",
}
for _, url := range urls {
wg.Add(1)
c.Visit(url)
}
wg.Wait()
c.Wait()
// Output results
output, _ := json.MarshalIndent(results, "", " ")
fmt.Println(string(output))
}
Best Practices for Parallel Scraping
1. Respect Server Limits
- Start with low parallelism (2-4 concurrent requests)
- Monitor response times and error rates
- Implement exponential backoff for failures
2. Use Appropriate Delays
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 3,
Delay: 1 * time.Second,
RandomDelay: 500 * time.Millisecond, // Add randomness
})
3. Handle Errors Gracefully
c.OnError(func(r *colly.Response, err error) {
if r.StatusCode == 429 { // Rate limited
time.Sleep(5 * time.Second)
r.Request.Retry()
}
})
4. Monitor Performance
- Track request completion rates
- Adjust parallelism based on server response
- Log failed requests for retry logic
Common Pitfalls to Avoid
- Over-aggressive parallelism: Can overwhelm target servers
- Missing error handling: Failed requests should be properly managed
- Ignoring robots.txt: Always check site scraping policies
- No rate limiting: Can lead to IP blocking or legal issues
Colly's parallel scraping capabilities make it an excellent choice for high-performance web scraping tasks when used responsibly and with proper configuration.