How do I use Colly with testing frameworks for scraper validation?

Testing web scrapers built with Colly is crucial for ensuring reliability, catching regressions, and validating data extraction logic. This guide covers comprehensive testing strategies using Go's built-in testing framework and popular third-party libraries to create robust, maintainable test suites for your Colly scrapers.

Why Test Colly Scrapers?

Web scraping applications face unique challenges that make testing essential:

Website changes: Target sites frequently update their HTML structure
Network reliability: External dependencies can cause intermittent failures
Data validation: Ensuring extracted data meets quality standards
Rate limiting: Testing proper handling of API limits and delays
Error scenarios: Validating behavior under various failure conditions

Setting Up the Testing Environment

Basic Test Structure

Start by creating a testable Colly scraper with dependency injection:

package scraper

import (
    "fmt"
    "strings"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

type ProductScraper struct {
    collector *colly.Collector
    baseURL   string
}

type Product struct {
    Name        string
    Price       string
    Description string
    ImageURL    string
}

func NewProductScraper(baseURL string) *ProductScraper {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    return &ProductScraper{
        collector: c,
        baseURL:   baseURL,
    }
}

func (ps *ProductScraper) ScrapeProduct(productURL string) (*Product, error) {
    var product Product
    var err error

    ps.collector.OnHTML(".product-title", func(e *colly.HTMLElement) {
        product.Name = strings.TrimSpace(e.Text)
    })

    ps.collector.OnHTML(".price", func(e *colly.HTMLElement) {
        product.Price = strings.TrimSpace(e.Text)
    })

    ps.collector.OnHTML(".description", func(e *colly.HTMLElement) {
        product.Description = strings.TrimSpace(e.Text)
    })

    ps.collector.OnHTML("img.product-image", func(e *colly.HTMLElement) {
        product.ImageURL = e.Attr("src")
    })

    ps.collector.OnError(func(r *colly.Response, e error) {
        err = fmt.Errorf("request failed: %w", e)
    })

    visitErr := ps.collector.Visit(productURL)
    if visitErr != nil {
        return nil, visitErr
    }

    if err != nil {
        return nil, err
    }

    return &product, nil
}

Testing with Mock HTTP Servers

Using httptest for Unit Tests

Create isolated tests using Go's httptest package:

package scraper

import (
    "net/http"
    "net/http/httptest"
    "testing"

    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
)

func TestProductScraper_ScrapeProduct(t *testing.T) {
    tests := []struct {
        name         string
        htmlContent  string
        expectedProduct *Product
        expectError  bool
    }{
        {
            name: "successful product scraping",
            htmlContent: `
                <html>
                    <body>
                        <h1 class="product-title">Gaming Laptop</h1>
                        <span class="price">$1,299.99</span>
                        <p class="description">High-performance gaming laptop</p>
                        <img class="product-image" src="/images/laptop.jpg" alt="laptop">
                    </body>
                </html>
            `,
            expectedProduct: &Product{
                Name:        "Gaming Laptop",
                Price:       "$1,299.99",
                Description: "High-performance gaming laptop",
                ImageURL:    "/images/laptop.jpg",
            },
            expectError: false,
        },
        {
            name: "missing elements",
            htmlContent: `
                <html>
                    <body>
                        <h1 class="product-title">Incomplete Product</h1>
                    </body>
                </html>
            `,
            expectedProduct: &Product{
                Name:        "Incomplete Product",
                Price:       "",
                Description: "",
                ImageURL:    "",
            },
            expectError: false,
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // Create mock server
            server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
                w.Header().Set("Content-Type", "text/html")
                w.WriteHeader(http.StatusOK)
                w.Write([]byte(tt.htmlContent))
            }))
            defer server.Close()

            // Initialize scraper with mock server URL
            scraper := NewProductScraper(server.URL)

            // Execute scraping
            product, err := scraper.ScrapeProduct(server.URL + "/product/123")

            // Assertions
            if tt.expectError {
                assert.Error(t, err)
                assert.Nil(t, product)
            } else {
                require.NoError(t, err)
                require.NotNil(t, product)
                assert.Equal(t, tt.expectedProduct.Name, product.Name)
                assert.Equal(t, tt.expectedProduct.Price, product.Price)
                assert.Equal(t, tt.expectedProduct.Description, product.Description)
                assert.Equal(t, tt.expectedProduct.ImageURL, product.ImageURL)
            }
        })
    }
}

Testing Error Scenarios

Test how your scraper handles various error conditions:

func TestProductScraper_ErrorHandling(t *testing.T) {
    tests := []struct {
        name           string
        serverResponse func(w http.ResponseWriter, r *http.Request)
        expectError    bool
        errorContains  string
    }{
        {
            name: "404 not found",
            serverResponse: func(w http.ResponseWriter, r *http.Request) {
                w.WriteHeader(http.StatusNotFound)
                w.Write([]byte("Page not found"))
            },
            expectError:   true,
            errorContains: "request failed",
        },
        {
            name: "500 server error",
            serverResponse: func(w http.ResponseWriter, r *http.Request) {
                w.WriteHeader(http.StatusInternalServerError)
                w.Write([]byte("Internal server error"))
            },
            expectError:   true,
            errorContains: "request failed",
        },
        {
            name: "invalid HTML",
            serverResponse: func(w http.ResponseWriter, r *http.Request) {
                w.Header().Set("Content-Type", "text/html")
                w.WriteHeader(http.StatusOK)
                w.Write([]byte("<html><body><unclosed-tag></body></html>"))
            },
            expectError: false, // Colly handles malformed HTML gracefully
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            server := httptest.NewServer(http.HandlerFunc(tt.serverResponse))
            defer server.Close()

            scraper := NewProductScraper(server.URL)
            product, err := scraper.ScrapeProduct(server.URL + "/product/123")

            if tt.expectError {
                assert.Error(t, err)
                assert.Contains(t, err.Error(), tt.errorContains)
                assert.Nil(t, product)
            } else {
                assert.NoError(t, err)
                assert.NotNil(t, product)
            }
        })
    }
}

Testing with HTML Fixtures

File-Based Test Data

Store HTML fixtures in separate files for better maintainability:

func TestProductScraper_WithFixtures(t *testing.T) {
    testCases := []struct {
        name         string
        fixtureFile  string
        expectedName string
        expectedPrice string
    }{
        {
            name:         "amazon product page",
            fixtureFile:  "testdata/amazon_product.html",
            expectedName: "Amazon Echo Dot",
            expectedPrice: "$49.99",
        },
        {
            name:         "ebay product page",
            fixtureFile:  "testdata/ebay_product.html",
            expectedName: "Vintage Watch",
            expectedPrice: "$125.00",
        },
    }

    for _, tc := range testCases {
        t.Run(tc.name, func(t *testing.T) {
            // Read HTML fixture
            htmlContent, err := os.ReadFile(tc.fixtureFile)
            require.NoError(t, err)

            // Create mock server with fixture content
            server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
                w.Header().Set("Content-Type", "text/html")
                w.Write(htmlContent)
            }))
            defer server.Close()

            scraper := NewProductScraper(server.URL)
            product, err := scraper.ScrapeProduct(server.URL + "/product")

            require.NoError(t, err)
            assert.Equal(t, tc.expectedName, product.Name)
            assert.Equal(t, tc.expectedPrice, product.Price)
        })
    }
}

Integration Testing with Real Websites

Controlled Integration Tests

For testing against real websites, implement safeguards and use stable endpoints:

func TestProductScraper_Integration(t *testing.T) {
    if testing.Short() {
        t.Skip("Skipping integration test in short mode")
    }

    // Use a stable, predictable website for integration tests
    scraper := NewProductScraper("https://httpbin.org")

    // Test with a known endpoint that returns predictable data
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Serve a known product page structure
        html := `
            <html>
                <body>
                    <h1 class="product-title">Integration Test Product</h1>
                    <span class="price">$99.99</span>
                </body>
            </html>
        `
        w.Header().Set("Content-Type", "text/html")
        w.Write([]byte(html))
    }))
    defer server.Close()

    product, err := scraper.ScrapeProduct(server.URL)

    require.NoError(t, err)
    assert.Equal(t, "Integration Test Product", product.Name)
    assert.Equal(t, "$99.99", product.Price)
}

Testing Rate Limiting and Delays

Validating Rate Limiting Behavior

Test that your scraper properly implements rate limiting:

func TestProductScraper_RateLimiting(t *testing.T) {
    requestCount := 0
    requestTimes := make([]time.Time, 0)

    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        requestCount++
        requestTimes = append(requestTimes, time.Now())

        w.Header().Set("Content-Type", "text/html")
        w.Write([]byte(`<html><body><h1 class="product-title">Test Product</h1></body></html>`))
    }))
    defer server.Close()

    // Create scraper with rate limiting
    c := colly.NewCollector()
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 1,
        Delay:       500 * time.Millisecond,
    })

    scraper := &ProductScraper{collector: c, baseURL: server.URL}

    // Make multiple requests
    for i := 0; i < 3; i++ {
        _, err := scraper.ScrapeProduct(server.URL + fmt.Sprintf("/product/%d", i))
        require.NoError(t, err)
    }

    // Verify rate limiting was applied
    assert.Equal(t, 3, requestCount)

    // Check that delays were respected (with some tolerance)
    for i := 1; i < len(requestTimes); i++ {
        timeDiff := requestTimes[i].Sub(requestTimes[i-1])
        assert.GreaterOrEqual(t, timeDiff, 450*time.Millisecond, "Rate limit not respected")
    }
}

Advanced Testing Patterns

Testing with Testify Suites

Organize complex test scenarios using testify suites:

package scraper

import (
    "testing"

    "github.com/stretchr/testify/suite"
)

type ScraperTestSuite struct {
    suite.Suite
    scraper *ProductScraper
    server  *httptest.Server
}

func (suite *ScraperTestSuite) SetupTest() {
    suite.server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        // Default response
        w.Header().Set("Content-Type", "text/html")
        w.Write([]byte(`<html><body><h1 class="product-title">Default Product</h1></body></html>`))
    }))

    suite.scraper = NewProductScraper(suite.server.URL)
}

func (suite *ScraperTestSuite) TearDownTest() {
    suite.server.Close()
}

func (suite *ScraperTestSuite) TestBasicScraping() {
    product, err := suite.scraper.ScrapeProduct(suite.server.URL + "/product")

    suite.NoError(err)
    suite.Equal("Default Product", product.Name)
}

func TestScraperTestSuite(t *testing.T) {
    suite.Run(t, new(ScraperTestSuite))
}

Property-Based Testing

Use property-based testing for robust validation:

func TestProductValidation_PropertyBased(t *testing.T) {
    // Test that product names are always non-empty and trimmed
    quick.Check(func(name string) bool {
        if strings.TrimSpace(name) == "" {
            return true // Skip empty inputs
        }

        html := fmt.Sprintf(`<html><body><h1 class="product-title">  %s  </h1></body></html>`, name)

        server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            w.Header().Set("Content-Type", "text/html")
            w.Write([]byte(html))
        }))
        defer server.Close()

        scraper := NewProductScraper(server.URL)
        product, err := scraper.ScrapeProduct(server.URL)

        if err != nil {
            return false
        }

        // Property: scraped name should always be trimmed
        return product.Name == strings.TrimSpace(name)
    }, nil)
}

Best Practices for Colly Testing

1. Test Independence

Ensure each test is independent and can run in isolation:

func (suite *ScraperTestSuite) SetupTest() {
    // Create fresh instances for each test
    suite.scraper = NewProductScraper(suite.server.URL)
}

2. Use Dependency Injection

Make your scrapers testable by injecting dependencies:

type ScraperConfig struct {
    UserAgent string
    Timeout   time.Duration
    Retries   int
}

func NewConfigurableScraper(config ScraperConfig) *ProductScraper {
    c := colly.NewCollector()
    c.UserAgent = config.UserAgent
    c.SetRequestTimeout(config.Timeout)

    return &ProductScraper{collector: c}
}

3. Validate Data Quality

Test not just extraction but also data validation:

func TestProductValidation(t *testing.T) {
    product := &Product{
        Name:  "Valid Product",
        Price: "$19.99",
    }

    assert.True(t, isValidPrice(product.Price))
    assert.True(t, len(product.Name) > 0)
}

func isValidPrice(price string) bool {
    // Implement price validation logic
    matched, _ := regexp.MatchString(`^\$\d+\.\d{2}$`, price)
    return matched
}

Running Tests Efficiently

Test Organization

Structure your tests for optimal execution:

# Run only unit tests (fast)
go test -short ./...

# Run integration tests
go test -run Integration ./...

# Run with coverage
go test -cover ./...

# Run specific test suites
go test -run TestScraperTestSuite ./...

Continuous Integration

Set up CI pipelines that separate fast and slow tests:

# .github/workflows/test.yml
name: Tests
on: [push, pull_request]
jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-go@v2
        with:
          go-version: 1.19
      - run: go test -short ./...

  integration-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-go@v2
        with:
          go-version: 1.19
      - run: go test -run Integration ./...

Conclusion

Testing Colly scrapers requires a multi-layered approach combining unit tests with mock servers, integration tests with real endpoints, and validation of error scenarios. By implementing comprehensive test suites, you can ensure your scrapers remain reliable as target websites evolve and your application grows.

The key to successful scraper testing is balancing thorough coverage with execution speed, using appropriate abstractions for testability, and maintaining test independence. With these patterns and practices, you'll build robust, maintainable web scraping applications that stand the test of time.

Remember to complement your Colly testing strategy with proper error handling and monitoring in production environments to catch issues that tests might miss.

Table of contents