Table of contents

What are the performance considerations for Go web scraping?

Performance is crucial when building web scraping applications in Go, especially when dealing with large-scale data extraction. Go's built-in concurrency features and efficient memory management make it an excellent choice for high-performance scraping, but understanding the key performance considerations will help you build faster, more reliable scrapers.

Concurrency with Goroutines

Go's biggest performance advantage for web scraping comes from its lightweight goroutines. Unlike traditional threads, goroutines have minimal memory overhead (around 2KB) and can be spawned in the thousands without significant performance impact.

Basic Concurrent Scraping

package main

import (
    "fmt"
    "net/http"
    "sync"
    "time"
)

func scrapeURL(url string, wg *sync.WaitGroup, results chan<- string) {
    defer wg.Done()

    client := &http.Client{
        Timeout: 10 * time.Second,
    }

    resp, err := client.Get(url)
    if err != nil {
        results <- fmt.Sprintf("Error scraping %s: %v", url, err)
        return
    }
    defer resp.Body.Close()

    results <- fmt.Sprintf("Successfully scraped %s - Status: %d", url, resp.StatusCode)
}

func main() {
    urls := []string{
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
    }

    var wg sync.WaitGroup
    results := make(chan string, len(urls))

    for _, url := range urls {
        wg.Add(1)
        go scrapeURL(url, &wg, results)
    }

    wg.Wait()
    close(results)

    for result := range results {
        fmt.Println(result)
    }
}

Limiting Concurrent Requests

While goroutines are lightweight, making too many concurrent HTTP requests can overwhelm target servers or your network. Use a semaphore pattern to limit concurrency:

package main

import (
    "fmt"
    "net/http"
    "sync"
    "time"
)

func scrapeConcurrentlyWithLimit(urls []string, maxConcurrency int) {
    semaphore := make(chan struct{}, maxConcurrency)
    var wg sync.WaitGroup

    for _, url := range urls {
        wg.Add(1)
        go func(url string) {
            defer wg.Done()

            semaphore <- struct{}{} // Acquire semaphore
            defer func() { <-semaphore }() // Release semaphore

            client := &http.Client{Timeout: 10 * time.Second}
            resp, err := client.Get(url)
            if err != nil {
                fmt.Printf("Error: %v\n", err)
                return
            }
            defer resp.Body.Close()

            fmt.Printf("Scraped %s - Status: %d\n", url, resp.StatusCode)
        }(url)
    }

    wg.Wait()
}

HTTP Client Optimization

The HTTP client configuration significantly impacts scraping performance. Go's default HTTP client isn't optimized for scraping workloads.

Connection Pooling and Reuse

package main

import (
    "net/http"
    "time"
)

func createOptimizedClient() *http.Client {
    transport := &http.Transport{
        MaxIdleConns:        100,              // Maximum idle connections
        MaxConnsPerHost:     10,               // Maximum connections per host
        MaxIdleConnsPerHost: 10,               // Maximum idle connections per host
        IdleConnTimeout:     90 * time.Second, // How long to keep idle connections
        DisableCompression:  false,            // Enable gzip compression
        ForceAttemptHTTP2:   true,             // Use HTTP/2 when possible
    }

    client := &http.Client{
        Transport: transport,
        Timeout:   30 * time.Second,
    }

    return client
}

func main() {
    client := createOptimizedClient()

    // Reuse this client for all requests
    resp, err := client.Get("https://example.com")
    if err != nil {
        panic(err)
    }
    defer resp.Body.Close()
}

DNS Optimization

DNS lookups can be a bottleneck in high-volume scraping. Consider using a custom dialer with DNS caching:

package main

import (
    "context"
    "net"
    "net/http"
    "time"
)

func createClientWithDNSCache() *http.Client {
    dialer := &net.Dialer{
        Timeout:   5 * time.Second,
        KeepAlive: 30 * time.Second,
        DualStack: true,
    }

    transport := &http.Transport{
        DialContext: func(ctx context.Context, network, addr string) (net.Conn, error) {
            return dialer.DialContext(ctx, network, addr)
        },
        MaxIdleConns:        100,
        MaxConnsPerHost:     10,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
    }

    return &http.Client{
        Transport: transport,
        Timeout:   30 * time.Second,
    }
}

Memory Management

Efficient memory usage is crucial for long-running scrapers that process thousands of pages.

Streaming Response Processing

For large responses, avoid loading entire content into memory:

package main

import (
    "bufio"
    "fmt"
    "net/http"
    "strings"
)

func processResponseStream(url string) error {
    resp, err := http.Get(url)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    scanner := bufio.NewScanner(resp.Body)
    scanner.Split(bufio.ScanLines)

    for scanner.Scan() {
        line := scanner.Text()
        if strings.Contains(line, "target-data") {
            fmt.Printf("Found target data: %s\n", line)
        }
    }

    return scanner.Err()
}

Pool HTML Parsers

When using HTML parsing libraries like goquery, consider pooling parser objects to reduce garbage collection pressure:

package main

import (
    "github.com/PuerkitoBio/goquery"
    "net/http"
    "sync"
)

type DocumentPool struct {
    pool sync.Pool
}

func NewDocumentPool() *DocumentPool {
    return &DocumentPool{
        pool: sync.Pool{
            New: func() interface{} {
                return &goquery.Document{}
            },
        },
    }
}

func (p *DocumentPool) Get() *goquery.Document {
    return p.pool.Get().(*goquery.Document)
}

func (p *DocumentPool) Put(doc *goquery.Document) {
    p.pool.Put(doc)
}

func scrapeWithPool(url string, docPool *DocumentPool) error {
    resp, err := http.Get(url)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        return err
    }

    // Process document
    doc.Find("title").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Title: %s\n", s.Text())
    })

    return nil
}

Error Handling and Retry Logic

Robust error handling prevents performance degradation from failed requests:

package main

import (
    "fmt"
    "math"
    "net/http"
    "time"
)

func scrapeWithRetry(url string, maxRetries int) (*http.Response, error) {
    client := &http.Client{Timeout: 10 * time.Second}

    for attempt := 0; attempt <= maxRetries; attempt++ {
        resp, err := client.Get(url)
        if err == nil && resp.StatusCode < 500 {
            return resp, nil
        }

        if resp != nil {
            resp.Body.Close()
        }

        if attempt < maxRetries {
            // Exponential backoff
            backoff := time.Duration(math.Pow(2, float64(attempt))) * time.Second
            fmt.Printf("Attempt %d failed, retrying in %v\n", attempt+1, backoff)
            time.Sleep(backoff)
        }
    }

    return nil, fmt.Errorf("failed to scrape %s after %d attempts", url, maxRetries+1)
}

Rate Limiting and Respectful Scraping

Implementing proper rate limiting prevents getting blocked and maintains good performance:

package main

import (
    "golang.org/x/time/rate"
    "net/http"
    "time"
)

type RateLimitedScraper struct {
    client  *http.Client
    limiter *rate.Limiter
}

func NewRateLimitedScraper(requestsPerSecond float64) *RateLimitedScraper {
    return &RateLimitedScraper{
        client: &http.Client{
            Timeout: 30 * time.Second,
        },
        limiter: rate.NewLimiter(rate.Limit(requestsPerSecond), 1),
    }
}

func (s *RateLimitedScraper) Scrape(url string) (*http.Response, error) {
    // Wait for rate limiter
    err := s.limiter.Wait(context.Background())
    if err != nil {
        return nil, err
    }

    return s.client.Get(url)
}

Monitoring and Profiling

Use Go's built-in profiling tools to identify performance bottlenecks:

package main

import (
    "log"
    "net/http"
    _ "net/http/pprof"
    "runtime"
)

func main() {
    // Enable profiling endpoint
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()

    // Set GOMAXPROCS for optimal CPU usage
    runtime.GOMAXPROCS(runtime.NumCPU())

    // Your scraping code here
    // Access profiling at http://localhost:6060/debug/pprof/
}

Performance Best Practices Summary

  1. Use goroutines wisely: Leverage concurrency but limit the number of concurrent requests
  2. Optimize HTTP clients: Configure connection pooling and timeouts appropriately
  3. Manage memory efficiently: Use streaming for large responses and object pooling
  4. Implement proper error handling: Use exponential backoff for retries
  5. Respect rate limits: Implement rate limiting to avoid getting blocked
  6. Monitor performance: Use profiling tools to identify bottlenecks

Conclusion

Go's performance advantages for web scraping come from its efficient concurrency model, excellent HTTP client library, and strong memory management. By following these performance considerations and implementing proper concurrency patterns, connection pooling, and error handling, you can build highly efficient scrapers capable of handling thousands of requests per second.

For JavaScript-heavy sites that require browser automation, consider integrating tools like Puppeteer with your Go application, or explore headless browser libraries specifically designed for Go to maintain performance while handling dynamic content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon