Table of contents

What is the best way to test Go web scraping code?

Testing Go web scraping code requires a multi-layered approach that covers unit tests, integration tests, and end-to-end testing scenarios. Since web scraping involves external dependencies like websites and network requests, proper testing strategies are crucial for building reliable and maintainable scraping applications.

Core Testing Strategies

1. Unit Testing with Mocked HTTP Responses

The foundation of testing web scraping code involves mocking HTTP responses to test your parsing logic without depending on external websites.

package scraper

import (
    "io"
    "net/http"
    "net/http/httptest"
    "strings"
    "testing"
)

// ProductScraper represents our scraper
type ProductScraper struct {
    client *http.Client
}

// Product represents scraped data
type Product struct {
    Name  string
    Price string
}

// ScrapeProduct scrapes product data from HTML
func (s *ProductScraper) ScrapeProduct(url string) (*Product, error) {
    resp, err := s.client.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    body, err := io.ReadAll(resp.Body)
    if err != nil {
        return nil, err
    }

    return s.parseProduct(string(body))
}

func (s *ProductScraper) parseProduct(html string) (*Product, error) {
    // Simple parsing logic for demonstration
    product := &Product{}

    if strings.Contains(html, `class="product-name"`) {
        start := strings.Index(html, `class="product-name">`) + len(`class="product-name">`)
        end := strings.Index(html[start:], "<")
        if end > 0 {
            product.Name = html[start : start+end]
        }
    }

    if strings.Contains(html, `class="price"`) {
        start := strings.Index(html, `class="price">`) + len(`class="price">`)
        end := strings.Index(html[start:], "<")
        if end > 0 {
            product.Price = html[start : start+end]
        }
    }

    return product, nil
}

// Test with mocked HTTP response
func TestScrapeProduct(t *testing.T) {
    // Create mock HTML response
    mockHTML := `
        <html>
            <body>
                <h1 class="product-name">Test Product</h1>
                <span class="price">$29.99</span>
            </body>
        </html>
    `

    // Create test server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "text/html")
        w.WriteHeader(http.StatusOK)
        w.Write([]byte(mockHTML))
    }))
    defer server.Close()

    // Test the scraper
    scraper := &ProductScraper{client: server.Client()}
    product, err := scraper.ScrapeProduct(server.URL)

    if err != nil {
        t.Fatalf("Expected no error, got %v", err)
    }

    if product.Name != "Test Product" {
        t.Errorf("Expected name 'Test Product', got '%s'", product.Name)
    }

    if product.Price != "$29.99" {
        t.Errorf("Expected price '$29.99', got '%s'", product.Price)
    }
}

2. Testing with Table-Driven Tests

Table-driven tests are excellent for testing different HTML structures and edge cases:

func TestParseProduct(t *testing.T) {
    scraper := &ProductScraper{}

    tests := []struct {
        name     string
        html     string
        expected Product
        hasError bool
    }{
        {
            name: "valid product",
            html: `<h1 class="product-name">Widget</h1><span class="price">$15.00</span>`,
            expected: Product{Name: "Widget", Price: "$15.00"},
            hasError: false,
        },
        {
            name: "missing price",
            html: `<h1 class="product-name">Widget</h1>`,
            expected: Product{Name: "Widget", Price: ""},
            hasError: false,
        },
        {
            name: "empty html",
            html: "",
            expected: Product{},
            hasError: false,
        },
        {
            name: "malformed html",
            html: `<h1 class="product-name">Unclosed tag`,
            expected: Product{Name: "Unclosed tag", Price: ""},
            hasError: false,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            result, err := scraper.parseProduct(tt.html)

            if (err != nil) != tt.hasError {
                t.Errorf("Expected hasError=%v, got error=%v", tt.hasError, err)
            }

            if result.Name != tt.expected.Name {
                t.Errorf("Expected name '%s', got '%s'", tt.expected.Name, result.Name)
            }

            if result.Price != tt.expected.Price {
                t.Errorf("Expected price '%s', got '%s'", tt.expected.Price, result.Price)
            }
        })
    }
}

3. Integration Testing with Real Websites

For integration tests, you can test against actual websites, but be mindful of rate limiting and website changes:

func TestIntegrationScrapeRealWebsite(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }

    scraper := &ProductScraper{
        client: &http.Client{
            Timeout: 10 * time.Second,
        },
    }

    // Test against a stable, public API or your own test site
    product, err := scraper.ScrapeProduct("https://httpbin.org/html")

    if err != nil {
        t.Fatalf("Integration test failed: %v", err)
    }

    // Basic validation that we got some data
    if product == nil {
        t.Error("Expected product data, got nil")
    }
}

4. Testing HTTP Client Behavior

Test different HTTP scenarios like timeouts, redirects, and error responses:

func TestHTTPClientBehavior(t *testing.T) {
    tests := []struct {
        name           string
        serverHandler  http.HandlerFunc
        expectError    bool
        expectedStatus int
    }{
        {
            name: "successful response",
            serverHandler: func(w http.ResponseWriter, r *http.Request) {
                w.WriteHeader(http.StatusOK)
                w.Write([]byte("<html></html>"))
            },
            expectError: false,
        },
        {
            name: "404 not found",
            serverHandler: func(w http.ResponseWriter, r *http.Request) {
                w.WriteHeader(http.StatusNotFound)
            },
            expectError: false, // Depending on your error handling strategy
        },
        {
            name: "server timeout",
            serverHandler: func(w http.ResponseWriter, r *http.Request) {
                time.Sleep(2 * time.Second) // Simulate slow server
                w.WriteHeader(http.StatusOK)
            },
            expectError: true, // Should timeout with 1-second client timeout
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            server := httptest.NewServer(tt.serverHandler)
            defer server.Close()

            client := &http.Client{Timeout: 1 * time.Second}
            scraper := &ProductScraper{client: client}

            _, err := scraper.ScrapeProduct(server.URL)

            if (err != nil) != tt.expectError {
                t.Errorf("Expected error=%v, got error=%v", tt.expectError, err)
            }
        })
    }
}

Advanced Testing Techniques

5. Testing with GoColly Framework

If you're using GoColly, here's how to test scrapers effectively:

package main

import (
    "testing"
    "net/http/httptest"
    "net/http"

    "github.com/gocolly/colly/v2"
)

func TestCollyScaper(t *testing.T) {
    // Mock HTML content
    mockHTML := `
        <html>
            <body>
                <div class="article">
                    <h2>Article Title 1</h2>
                    <p>Article content 1</p>
                </div>
                <div class="article">
                    <h2>Article Title 2</h2>
                    <p>Article content 2</p>
                </div>
            </body>
        </html>
    `

    // Create test server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "text/html")
        w.Write([]byte(mockHTML))
    }))
    defer server.Close()

    // Test the scraper
    var articles []string

    c := colly.NewCollector()
    c.OnHTML(".article h2", func(e *colly.HTMLElement) {
        articles = append(articles, e.Text)
    })

    err := c.Visit(server.URL)
    if err != nil {
        t.Fatalf("Visit failed: %v", err)
    }

    expected := []string{"Article Title 1", "Article Title 2"}
    if len(articles) != len(expected) {
        t.Errorf("Expected %d articles, got %d", len(expected), len(articles))
    }

    for i, title := range expected {
        if i >= len(articles) || articles[i] != title {
            t.Errorf("Expected article %d to be '%s', got '%s'", i, title, articles[i])
        }
    }
}

6. Benchmarking and Performance Testing

Performance testing is crucial for web scraping applications:

func BenchmarkScrapeProduct(b *testing.B) {
    mockHTML := `<h1 class="product-name">Test Product</h1><span class="price">$29.99</span>`

    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte(mockHTML))
    }))
    defer server.Close()

    scraper := &ProductScraper{client: server.Client()}

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, err := scraper.ScrapeProduct(server.URL)
        if err != nil {
            b.Fatal(err)
        }
    }
}

func BenchmarkParseProductOnly(b *testing.B) {
    mockHTML := `<h1 class="product-name">Test Product</h1><span class="price">$29.99</span>`
    scraper := &ProductScraper{}

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, err := scraper.parseProduct(mockHTML)
        if err != nil {
            b.Fatal(err)
        }
    }
}

Testing Best Practices

7. Environment-Specific Testing

Use build tags and environment variables to control test execution:

// +build integration

package scraper

import "testing"

func TestRealWebsiteIntegration(t *testing.T) {
    // This test only runs with: go test -tags=integration
    // Your integration test code here
}

8. Testing Error Handling

Ensure your scraper handles network errors gracefully:

func TestErrorHandling(t *testing.T) {
    // Test with invalid URL
    scraper := &ProductScraper{client: &http.Client{}}
    _, err := scraper.ScrapeProduct("invalid-url")
    if err == nil {
        t.Error("Expected error for invalid URL")
    }

    // Test with unreachable server
    _, err = scraper.ScrapeProduct("http://localhost:99999")
    if err == nil {
        t.Error("Expected error for unreachable server")
    }
}

Running Tests

Use these commands to run different types of tests:

# Run all tests
go test ./...

# Run only unit tests (exclude integration tests)
go test -short ./...

# Run integration tests
go test -tags=integration ./...

# Run tests with coverage
go test -cover ./...

# Run benchmarks
go test -bench=. ./...

# Run specific test
go test -run TestScrapeProduct ./...

# Run tests with verbose output
go test -v ./...

Conclusion

Testing Go web scraping code effectively requires combining multiple strategies: unit tests with mocked responses for parsing logic, integration tests for real-world scenarios, and performance benchmarks for optimization. By implementing comprehensive test coverage, you ensure your scraping applications are reliable, maintainable, and performant.

Remember to test edge cases, error conditions, and different HTML structures. When dealing with external websites, consider implementing robust retry mechanisms and how to handle timeouts in Go HTTP requests to make your scrapers more resilient in production environments.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon