What is the Best Way to Handle Concurrent Web Scraping in Go?

Concurrent web scraping in Go leverages the language's powerful concurrency primitives like goroutines and channels to dramatically improve scraping performance. Go's lightweight goroutines and built-in concurrency support make it an excellent choice for high-performance web scraping applications that need to process multiple URLs simultaneously.

Why Use Concurrent Web Scraping?

Sequential web scraping processes one URL at a time, which can be extremely slow when dealing with hundreds or thousands of pages. Concurrent scraping allows you to:

Reduce total execution time by processing multiple requests simultaneously
Maximize network throughput by overlapping I/O operations
Handle rate limiting more effectively with controlled concurrency
Scale efficiently to handle large-scale data extraction tasks

Core Concurrency Patterns for Web Scraping

1. Basic Goroutine Approach

The simplest approach uses goroutines with a WaitGroup to coordinate completion:

package main

import (
    "fmt"
    "io"
    "net/http"
    "sync"
    "time"
)

func scrapeURL(url string, wg *sync.WaitGroup) {
    defer wg.Done()

    client := &http.Client{
        Timeout: 10 * time.Second,
    }

    resp, err := client.Get(url)
    if err != nil {
        fmt.Printf("Error scraping %s: %v\n", url, err)
        return
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        fmt.Printf("Error reading response from %s: %v\n", url, err)
        return
    }

    fmt.Printf("Scraped %s: %d bytes\n", url, len(body))
    // Process the scraped data here
}

func main() {
    urls := []string{
        "https://example.com",
        "https://httpbin.org/json",
        "https://httpbin.org/xml",
        // Add more URLs
    }

    var wg sync.WaitGroup

    for _, url := range urls {
        wg.Add(1)
        go scrapeURL(url, &wg)
    }

    wg.Wait()
    fmt.Println("All scraping completed")
}

2. Worker Pool Pattern

For better resource control and rate limiting, implement a worker pool:

package main

import (
    "fmt"
    "io"
    "net/http"
    "sync"
    "time"
)

type ScrapingJob struct {
    URL    string
    ID     int
}

type ScrapingResult struct {
    Job    ScrapingJob
    Data   []byte
    Error  error
}

func worker(id int, jobs <-chan ScrapingJob, results chan<- ScrapingResult, wg *sync.WaitGroup) {
    defer wg.Done()

    client := &http.Client{
        Timeout: 15 * time.Second,
    }

    for job := range jobs {
        fmt.Printf("Worker %d processing job %d: %s\n", id, job.ID, job.URL)

        resp, err := client.Get(job.URL)
        if err != nil {
            results <- ScrapingResult{Job: job, Error: err}
            continue
        }

        data, err := io.ReadAll(resp.Body)
        resp.Body.Close()

        results <- ScrapingResult{
            Job:   job,
            Data:  data,
            Error: err,
        }

        // Rate limiting - pause between requests
        time.Sleep(100 * time.Millisecond)
    }
}

func main() {
    urls := []string{
        "https://example.com",
        "https://httpbin.org/json",
        "https://httpbin.org/xml",
        "https://httpbin.org/html",
        "https://httpbin.org/robots.txt",
    }

    const numWorkers = 3
    jobs := make(chan ScrapingJob, len(urls))
    results := make(chan ScrapingResult, len(urls))

    // Start workers
    var wg sync.WaitGroup
    for i := 1; i <= numWorkers; i++ {
        wg.Add(1)
        go worker(i, jobs, results, &wg)
    }

    // Send jobs
    for i, url := range urls {
        jobs <- ScrapingJob{URL: url, ID: i + 1}
    }
    close(jobs)

    // Collect results
    go func() {
        wg.Wait()
        close(results)
    }()

    // Process results
    for result := range results {
        if result.Error != nil {
            fmt.Printf("Job %d failed: %v\n", result.Job.ID, result.Error)
        } else {
            fmt.Printf("Job %d completed: %d bytes from %s\n", 
                result.Job.ID, len(result.Data), result.Job.URL)
        }
    }
}

3. Advanced Concurrent Scraper with Rate Limiting

For production use, implement sophisticated rate limiting and error handling:

package main

import (
    "context"
    "fmt"
    "io"
    "net/http"
    "sync"
    "time"

    "golang.org/x/time/rate"
)

type ConcurrentScraper struct {
    client      *http.Client
    rateLimiter *rate.Limiter
    maxWorkers  int
    timeout     time.Duration
}

type ScrapeRequest struct {
    URL     string
    Headers map[string]string
    Context context.Context
}

type ScrapeResponse struct {
    URL        string
    StatusCode int
    Data       []byte
    Headers    map[string][]string
    Error      error
    Duration   time.Duration
}

func NewConcurrentScraper(maxWorkers int, requestsPerSecond float64, timeout time.Duration) *ConcurrentScraper {
    return &ConcurrentScraper{
        client: &http.Client{
            Timeout: timeout,
            Transport: &http.Transport{
                MaxIdleConns:        100,
                MaxIdleConnsPerHost: 10,
                IdleConnTimeout:     90 * time.Second,
            },
        },
        rateLimiter: rate.NewLimiter(rate.Limit(requestsPerSecond), 1),
        maxWorkers:  maxWorkers,
        timeout:     timeout,
    }
}

func (s *ConcurrentScraper) worker(ctx context.Context, requests <-chan ScrapeRequest, responses chan<- ScrapeResponse, wg *sync.WaitGroup) {
    defer wg.Done()

    for {
        select {
        case req, ok := <-requests:
            if !ok {
                return
            }

            // Rate limiting
            if err := s.rateLimiter.Wait(req.Context); err != nil {
                responses <- ScrapeResponse{URL: req.URL, Error: err}
                continue
            }

            start := time.Now()
            response := s.scrapeURL(req)
            response.Duration = time.Since(start)

            select {
            case responses <- response:
            case <-ctx.Done():
                return
            }

        case <-ctx.Done():
            return
        }
    }
}

func (s *ConcurrentScraper) scrapeURL(req ScrapeRequest) ScrapeResponse {
    httpReq, err := http.NewRequestWithContext(req.Context, "GET", req.URL, nil)
    if err != nil {
        return ScrapeResponse{URL: req.URL, Error: err}
    }

    // Set custom headers
    for key, value := range req.Headers {
        httpReq.Header.Set(key, value)
    }

    // Set default User-Agent if not provided
    if httpReq.Header.Get("User-Agent") == "" {
        httpReq.Header.Set("User-Agent", "Go-Scraper/1.0")
    }

    resp, err := s.client.Do(httpReq)
    if err != nil {
        return ScrapeResponse{URL: req.URL, Error: err}
    }
    defer resp.Body.Close()

    data, err := io.ReadAll(resp.Body)
    if err != nil {
        return ScrapeResponse{URL: req.URL, StatusCode: resp.StatusCode, Error: err}
    }

    return ScrapeResponse{
        URL:        req.URL,
        StatusCode: resp.StatusCode,
        Data:       data,
        Headers:    resp.Header,
    }
}

func (s *ConcurrentScraper) ScrapeURLs(ctx context.Context, urls []string) <-chan ScrapeResponse {
    requests := make(chan ScrapeRequest, len(urls))
    responses := make(chan ScrapeResponse, len(urls))

    // Start workers
    var wg sync.WaitGroup
    for i := 0; i < s.maxWorkers; i++ {
        wg.Add(1)
        go s.worker(ctx, requests, responses, &wg)
    }

    // Send requests
    go func() {
        defer close(requests)
        for _, url := range urls {
            select {
            case requests <- ScrapeRequest{
                URL:     url,
                Context: ctx,
                Headers: map[string]string{
                    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                },
            }:
            case <-ctx.Done():
                return
            }
        }
    }()

    // Close responses channel when all workers are done
    go func() {
        wg.Wait()
        close(responses)
    }()

    return responses
}

func main() {
    urls := []string{
        "https://example.com",
        "https://httpbin.org/json",
        "https://httpbin.org/xml",
        "https://httpbin.org/html",
        "https://httpbin.org/status/200",
    }

    scraper := NewConcurrentScraper(
        3,                      // max workers
        2.0,                    // requests per second
        30*time.Second,         // timeout
    )

    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
    defer cancel()

    start := time.Now()
    responses := scraper.ScrapeURLs(ctx, urls)

    successCount := 0
    errorCount := 0

    for response := range responses {
        if response.Error != nil {
            fmt.Printf("❌ Failed to scrape %s: %v\n", response.URL, response.Error)
            errorCount++
        } else {
            fmt.Printf("✅ Scraped %s (Status: %d, Size: %d bytes, Duration: %v)\n", 
                response.URL, response.StatusCode, len(response.Data), response.Duration)
            successCount++
        }
    }

    fmt.Printf("\n📊 Results: %d successful, %d errors, Total time: %v\n", 
        successCount, errorCount, time.Since(start))
}

Best Practices for Concurrent Web Scraping

1. Implement Proper Rate Limiting

import "golang.org/x/time/rate"

// Create a rate limiter that allows 5 requests per second
limiter := rate.NewLimiter(5, 1)

// Wait for permission before making a request
if err := limiter.Wait(context.Background()); err != nil {
    return err
}

2. Use Context for Cancellation

ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

// Pass context to all operations
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)

3. Handle Errors Gracefully

func scrapeWithRetry(url string, maxRetries int) ([]byte, error) {
    var lastErr error

    for i := 0; i < maxRetries; i++ {
        data, err := scrapeURL(url)
        if err == nil {
            return data, nil
        }

        lastErr = err

        // Exponential backoff
        waitTime := time.Duration(i+1) * time.Second
        time.Sleep(waitTime)
    }

    return nil, fmt.Errorf("failed after %d retries: %w", maxRetries, lastErr)
}

4. Monitor Resource Usage

import "runtime"

func monitorGoroutines() {
    ticker := time.NewTicker(5 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        fmt.Printf("Active goroutines: %d\n", runtime.NumGoroutine())
    }
}

Integration with HTML Parsing

For processing scraped HTML content, combine concurrent scraping with parsing libraries:

import "github.com/PuerkitoBio/goquery"

func parseHTML(data []byte, url string) error {
    doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(data)))
    if err != nil {
        return err
    }

    // Extract data using CSS selectors
    doc.Find("title").Each(func(i int, s *goquery.Selection) {
        title := s.Text()
        fmt.Printf("Title from %s: %s\n", url, title)
    })

    return nil
}

Performance Considerations

Optimal Worker Count

The optimal number of workers depends on several factors:

CPU cores: Generally 2-4 workers per core for I/O-bound tasks
Network latency: Higher latency allows for more concurrent connections
Target server capacity: Respect the server's ability to handle requests
Memory constraints: Each worker consumes memory for connections and buffers

Memory Management

// Use buffered channels to control memory usage
jobs := make(chan ScrapeJob, 100) // Limit queued jobs
results := make(chan ScrapeResult, 100) // Limit pending results

// Process results immediately to free memory
go func() {
    for result := range results {
        processResult(result) // Process immediately
        // Don't accumulate results in memory
    }
}()

Common Pitfalls and Solutions

1. Too Many Goroutines

Problem: Creating unlimited goroutines can exhaust system resources.

Solution: Use worker pools to limit concurrency.

2. Ignoring Rate Limits

Problem: Overwhelming target servers with requests.

Solution: Implement proper rate limiting and respect robots.txt.

3. Poor Error Handling

Problem: Not handling network failures gracefully.

Solution: Implement retry logic with exponential backoff.

4. Resource Leaks

Problem: Not closing HTTP response bodies or channels.

Solution: Always use defer resp.Body.Close() and properly close channels.

Conclusion

Concurrent web scraping in Go provides significant performance improvements when implemented correctly. The key is to balance concurrency with respectful scraping practices, proper error handling, and resource management. For complex scraping scenarios involving JavaScript-heavy sites, you might want to consider how to run multiple pages in parallel with Puppeteer for browser-based scraping.

Start with simple goroutines for basic needs, graduate to worker pools for better control, and implement advanced patterns with rate limiting and context management for production applications. Remember that effective concurrent scraping requires finding the right balance between speed and server-friendly behavior.

Table of contents