What is the Best Way to Handle Concurrent Web Scraping in Go?
Concurrent web scraping in Go leverages the language's powerful concurrency primitives like goroutines and channels to dramatically improve scraping performance. Go's lightweight goroutines and built-in concurrency support make it an excellent choice for high-performance web scraping applications that need to process multiple URLs simultaneously.
Why Use Concurrent Web Scraping?
Sequential web scraping processes one URL at a time, which can be extremely slow when dealing with hundreds or thousands of pages. Concurrent scraping allows you to:
- Reduce total execution time by processing multiple requests simultaneously
- Maximize network throughput by overlapping I/O operations
- Handle rate limiting more effectively with controlled concurrency
- Scale efficiently to handle large-scale data extraction tasks
Core Concurrency Patterns for Web Scraping
1. Basic Goroutine Approach
The simplest approach uses goroutines with a WaitGroup to coordinate completion:
package main
import (
"fmt"
"io"
"net/http"
"sync"
"time"
)
func scrapeURL(url string, wg *sync.WaitGroup) {
defer wg.Done()
client := &http.Client{
Timeout: 10 * time.Second,
}
resp, err := client.Get(url)
if err != nil {
fmt.Printf("Error scraping %s: %v\n", url, err)
return
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Printf("Error reading response from %s: %v\n", url, err)
return
}
fmt.Printf("Scraped %s: %d bytes\n", url, len(body))
// Process the scraped data here
}
func main() {
urls := []string{
"https://example.com",
"https://httpbin.org/json",
"https://httpbin.org/xml",
// Add more URLs
}
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go scrapeURL(url, &wg)
}
wg.Wait()
fmt.Println("All scraping completed")
}
2. Worker Pool Pattern
For better resource control and rate limiting, implement a worker pool:
package main
import (
"fmt"
"io"
"net/http"
"sync"
"time"
)
type ScrapingJob struct {
URL string
ID int
}
type ScrapingResult struct {
Job ScrapingJob
Data []byte
Error error
}
func worker(id int, jobs <-chan ScrapingJob, results chan<- ScrapingResult, wg *sync.WaitGroup) {
defer wg.Done()
client := &http.Client{
Timeout: 15 * time.Second,
}
for job := range jobs {
fmt.Printf("Worker %d processing job %d: %s\n", id, job.ID, job.URL)
resp, err := client.Get(job.URL)
if err != nil {
results <- ScrapingResult{Job: job, Error: err}
continue
}
data, err := io.ReadAll(resp.Body)
resp.Body.Close()
results <- ScrapingResult{
Job: job,
Data: data,
Error: err,
}
// Rate limiting - pause between requests
time.Sleep(100 * time.Millisecond)
}
}
func main() {
urls := []string{
"https://example.com",
"https://httpbin.org/json",
"https://httpbin.org/xml",
"https://httpbin.org/html",
"https://httpbin.org/robots.txt",
}
const numWorkers = 3
jobs := make(chan ScrapingJob, len(urls))
results := make(chan ScrapingResult, len(urls))
// Start workers
var wg sync.WaitGroup
for i := 1; i <= numWorkers; i++ {
wg.Add(1)
go worker(i, jobs, results, &wg)
}
// Send jobs
for i, url := range urls {
jobs <- ScrapingJob{URL: url, ID: i + 1}
}
close(jobs)
// Collect results
go func() {
wg.Wait()
close(results)
}()
// Process results
for result := range results {
if result.Error != nil {
fmt.Printf("Job %d failed: %v\n", result.Job.ID, result.Error)
} else {
fmt.Printf("Job %d completed: %d bytes from %s\n",
result.Job.ID, len(result.Data), result.Job.URL)
}
}
}
3. Advanced Concurrent Scraper with Rate Limiting
For production use, implement sophisticated rate limiting and error handling:
package main
import (
"context"
"fmt"
"io"
"net/http"
"sync"
"time"
"golang.org/x/time/rate"
)
type ConcurrentScraper struct {
client *http.Client
rateLimiter *rate.Limiter
maxWorkers int
timeout time.Duration
}
type ScrapeRequest struct {
URL string
Headers map[string]string
Context context.Context
}
type ScrapeResponse struct {
URL string
StatusCode int
Data []byte
Headers map[string][]string
Error error
Duration time.Duration
}
func NewConcurrentScraper(maxWorkers int, requestsPerSecond float64, timeout time.Duration) *ConcurrentScraper {
return &ConcurrentScraper{
client: &http.Client{
Timeout: timeout,
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
IdleConnTimeout: 90 * time.Second,
},
},
rateLimiter: rate.NewLimiter(rate.Limit(requestsPerSecond), 1),
maxWorkers: maxWorkers,
timeout: timeout,
}
}
func (s *ConcurrentScraper) worker(ctx context.Context, requests <-chan ScrapeRequest, responses chan<- ScrapeResponse, wg *sync.WaitGroup) {
defer wg.Done()
for {
select {
case req, ok := <-requests:
if !ok {
return
}
// Rate limiting
if err := s.rateLimiter.Wait(req.Context); err != nil {
responses <- ScrapeResponse{URL: req.URL, Error: err}
continue
}
start := time.Now()
response := s.scrapeURL(req)
response.Duration = time.Since(start)
select {
case responses <- response:
case <-ctx.Done():
return
}
case <-ctx.Done():
return
}
}
}
func (s *ConcurrentScraper) scrapeURL(req ScrapeRequest) ScrapeResponse {
httpReq, err := http.NewRequestWithContext(req.Context, "GET", req.URL, nil)
if err != nil {
return ScrapeResponse{URL: req.URL, Error: err}
}
// Set custom headers
for key, value := range req.Headers {
httpReq.Header.Set(key, value)
}
// Set default User-Agent if not provided
if httpReq.Header.Get("User-Agent") == "" {
httpReq.Header.Set("User-Agent", "Go-Scraper/1.0")
}
resp, err := s.client.Do(httpReq)
if err != nil {
return ScrapeResponse{URL: req.URL, Error: err}
}
defer resp.Body.Close()
data, err := io.ReadAll(resp.Body)
if err != nil {
return ScrapeResponse{URL: req.URL, StatusCode: resp.StatusCode, Error: err}
}
return ScrapeResponse{
URL: req.URL,
StatusCode: resp.StatusCode,
Data: data,
Headers: resp.Header,
}
}
func (s *ConcurrentScraper) ScrapeURLs(ctx context.Context, urls []string) <-chan ScrapeResponse {
requests := make(chan ScrapeRequest, len(urls))
responses := make(chan ScrapeResponse, len(urls))
// Start workers
var wg sync.WaitGroup
for i := 0; i < s.maxWorkers; i++ {
wg.Add(1)
go s.worker(ctx, requests, responses, &wg)
}
// Send requests
go func() {
defer close(requests)
for _, url := range urls {
select {
case requests <- ScrapeRequest{
URL: url,
Context: ctx,
Headers: map[string]string{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
},
}:
case <-ctx.Done():
return
}
}
}()
// Close responses channel when all workers are done
go func() {
wg.Wait()
close(responses)
}()
return responses
}
func main() {
urls := []string{
"https://example.com",
"https://httpbin.org/json",
"https://httpbin.org/xml",
"https://httpbin.org/html",
"https://httpbin.org/status/200",
}
scraper := NewConcurrentScraper(
3, // max workers
2.0, // requests per second
30*time.Second, // timeout
)
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
start := time.Now()
responses := scraper.ScrapeURLs(ctx, urls)
successCount := 0
errorCount := 0
for response := range responses {
if response.Error != nil {
fmt.Printf("ā Failed to scrape %s: %v\n", response.URL, response.Error)
errorCount++
} else {
fmt.Printf("ā
Scraped %s (Status: %d, Size: %d bytes, Duration: %v)\n",
response.URL, response.StatusCode, len(response.Data), response.Duration)
successCount++
}
}
fmt.Printf("\nš Results: %d successful, %d errors, Total time: %v\n",
successCount, errorCount, time.Since(start))
}
Best Practices for Concurrent Web Scraping
1. Implement Proper Rate Limiting
import "golang.org/x/time/rate"
// Create a rate limiter that allows 5 requests per second
limiter := rate.NewLimiter(5, 1)
// Wait for permission before making a request
if err := limiter.Wait(context.Background()); err != nil {
return err
}
2. Use Context for Cancellation
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
// Pass context to all operations
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
3. Handle Errors Gracefully
func scrapeWithRetry(url string, maxRetries int) ([]byte, error) {
var lastErr error
for i := 0; i < maxRetries; i++ {
data, err := scrapeURL(url)
if err == nil {
return data, nil
}
lastErr = err
// Exponential backoff
waitTime := time.Duration(i+1) * time.Second
time.Sleep(waitTime)
}
return nil, fmt.Errorf("failed after %d retries: %w", maxRetries, lastErr)
}
4. Monitor Resource Usage
import "runtime"
func monitorGoroutines() {
ticker := time.NewTicker(5 * time.Second)
defer ticker.Stop()
for range ticker.C {
fmt.Printf("Active goroutines: %d\n", runtime.NumGoroutine())
}
}
Integration with HTML Parsing
For processing scraped HTML content, combine concurrent scraping with parsing libraries:
import "github.com/PuerkitoBio/goquery"
func parseHTML(data []byte, url string) error {
doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(data)))
if err != nil {
return err
}
// Extract data using CSS selectors
doc.Find("title").Each(func(i int, s *goquery.Selection) {
title := s.Text()
fmt.Printf("Title from %s: %s\n", url, title)
})
return nil
}
Performance Considerations
Optimal Worker Count
The optimal number of workers depends on several factors:
- CPU cores: Generally 2-4 workers per core for I/O-bound tasks
- Network latency: Higher latency allows for more concurrent connections
- Target server capacity: Respect the server's ability to handle requests
- Memory constraints: Each worker consumes memory for connections and buffers
Memory Management
// Use buffered channels to control memory usage
jobs := make(chan ScrapeJob, 100) // Limit queued jobs
results := make(chan ScrapeResult, 100) // Limit pending results
// Process results immediately to free memory
go func() {
for result := range results {
processResult(result) // Process immediately
// Don't accumulate results in memory
}
}()
Common Pitfalls and Solutions
1. Too Many Goroutines
Problem: Creating unlimited goroutines can exhaust system resources.
Solution: Use worker pools to limit concurrency.
2. Ignoring Rate Limits
Problem: Overwhelming target servers with requests.
Solution: Implement proper rate limiting and respect robots.txt.
3. Poor Error Handling
Problem: Not handling network failures gracefully.
Solution: Implement retry logic with exponential backoff.
4. Resource Leaks
Problem: Not closing HTTP response bodies or channels.
Solution: Always use defer resp.Body.Close()
and properly close channels.
Conclusion
Concurrent web scraping in Go provides significant performance improvements when implemented correctly. The key is to balance concurrency with respectful scraping practices, proper error handling, and resource management. For complex scraping scenarios involving JavaScript-heavy sites, you might want to consider how to run multiple pages in parallel with Puppeteer for browser-based scraping.
Start with simple goroutines for basic needs, graduate to worker pools for better control, and implement advanced patterns with rate limiting and context management for production applications. Remember that effective concurrent scraping requires finding the right balance between speed and server-friendly behavior.