How do I use Colly with testing frameworks for scraper validation?
Testing web scrapers built with Colly is crucial for ensuring reliability, catching regressions, and validating data extraction logic. This guide covers comprehensive testing strategies using Go's built-in testing framework and popular third-party libraries to create robust, maintainable test suites for your Colly scrapers.
Why Test Colly Scrapers?
Web scraping applications face unique challenges that make testing essential:
- Website changes: Target sites frequently update their HTML structure
- Network reliability: External dependencies can cause intermittent failures
- Data validation: Ensuring extracted data meets quality standards
- Rate limiting: Testing proper handling of API limits and delays
- Error scenarios: Validating behavior under various failure conditions
Setting Up the Testing Environment
Basic Test Structure
Start by creating a testable Colly scraper with dependency injection:
package scraper
import (
"fmt"
"strings"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
type ProductScraper struct {
collector *colly.Collector
baseURL string
}
type Product struct {
Name string
Price string
Description string
ImageURL string
}
func NewProductScraper(baseURL string) *ProductScraper {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
return &ProductScraper{
collector: c,
baseURL: baseURL,
}
}
func (ps *ProductScraper) ScrapeProduct(productURL string) (*Product, error) {
var product Product
var err error
ps.collector.OnHTML(".product-title", func(e *colly.HTMLElement) {
product.Name = strings.TrimSpace(e.Text)
})
ps.collector.OnHTML(".price", func(e *colly.HTMLElement) {
product.Price = strings.TrimSpace(e.Text)
})
ps.collector.OnHTML(".description", func(e *colly.HTMLElement) {
product.Description = strings.TrimSpace(e.Text)
})
ps.collector.OnHTML("img.product-image", func(e *colly.HTMLElement) {
product.ImageURL = e.Attr("src")
})
ps.collector.OnError(func(r *colly.Response, e error) {
err = fmt.Errorf("request failed: %w", e)
})
visitErr := ps.collector.Visit(productURL)
if visitErr != nil {
return nil, visitErr
}
if err != nil {
return nil, err
}
return &product, nil
}
Testing with Mock HTTP Servers
Using httptest for Unit Tests
Create isolated tests using Go's httptest
package:
package scraper
import (
"net/http"
"net/http/httptest"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestProductScraper_ScrapeProduct(t *testing.T) {
tests := []struct {
name string
htmlContent string
expectedProduct *Product
expectError bool
}{
{
name: "successful product scraping",
htmlContent: `
<html>
<body>
<h1 class="product-title">Gaming Laptop</h1>
<span class="price">$1,299.99</span>
<p class="description">High-performance gaming laptop</p>
<img class="product-image" src="/images/laptop.jpg" alt="laptop">
</body>
</html>
`,
expectedProduct: &Product{
Name: "Gaming Laptop",
Price: "$1,299.99",
Description: "High-performance gaming laptop",
ImageURL: "/images/laptop.jpg",
},
expectError: false,
},
{
name: "missing elements",
htmlContent: `
<html>
<body>
<h1 class="product-title">Incomplete Product</h1>
</body>
</html>
`,
expectedProduct: &Product{
Name: "Incomplete Product",
Price: "",
Description: "",
ImageURL: "",
},
expectError: false,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Create mock server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.WriteHeader(http.StatusOK)
w.Write([]byte(tt.htmlContent))
}))
defer server.Close()
// Initialize scraper with mock server URL
scraper := NewProductScraper(server.URL)
// Execute scraping
product, err := scraper.ScrapeProduct(server.URL + "/product/123")
// Assertions
if tt.expectError {
assert.Error(t, err)
assert.Nil(t, product)
} else {
require.NoError(t, err)
require.NotNil(t, product)
assert.Equal(t, tt.expectedProduct.Name, product.Name)
assert.Equal(t, tt.expectedProduct.Price, product.Price)
assert.Equal(t, tt.expectedProduct.Description, product.Description)
assert.Equal(t, tt.expectedProduct.ImageURL, product.ImageURL)
}
})
}
}
Testing Error Scenarios
Test how your scraper handles various error conditions:
func TestProductScraper_ErrorHandling(t *testing.T) {
tests := []struct {
name string
serverResponse func(w http.ResponseWriter, r *http.Request)
expectError bool
errorContains string
}{
{
name: "404 not found",
serverResponse: func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusNotFound)
w.Write([]byte("Page not found"))
},
expectError: true,
errorContains: "request failed",
},
{
name: "500 server error",
serverResponse: func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusInternalServerError)
w.Write([]byte("Internal server error"))
},
expectError: true,
errorContains: "request failed",
},
{
name: "invalid HTML",
serverResponse: func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.WriteHeader(http.StatusOK)
w.Write([]byte("<html><body><unclosed-tag></body></html>"))
},
expectError: false, // Colly handles malformed HTML gracefully
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(tt.serverResponse))
defer server.Close()
scraper := NewProductScraper(server.URL)
product, err := scraper.ScrapeProduct(server.URL + "/product/123")
if tt.expectError {
assert.Error(t, err)
assert.Contains(t, err.Error(), tt.errorContains)
assert.Nil(t, product)
} else {
assert.NoError(t, err)
assert.NotNil(t, product)
}
})
}
}
Testing with HTML Fixtures
File-Based Test Data
Store HTML fixtures in separate files for better maintainability:
func TestProductScraper_WithFixtures(t *testing.T) {
testCases := []struct {
name string
fixtureFile string
expectedName string
expectedPrice string
}{
{
name: "amazon product page",
fixtureFile: "testdata/amazon_product.html",
expectedName: "Amazon Echo Dot",
expectedPrice: "$49.99",
},
{
name: "ebay product page",
fixtureFile: "testdata/ebay_product.html",
expectedName: "Vintage Watch",
expectedPrice: "$125.00",
},
}
for _, tc := range testCases {
t.Run(tc.name, func(t *testing.T) {
// Read HTML fixture
htmlContent, err := os.ReadFile(tc.fixtureFile)
require.NoError(t, err)
// Create mock server with fixture content
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write(htmlContent)
}))
defer server.Close()
scraper := NewProductScraper(server.URL)
product, err := scraper.ScrapeProduct(server.URL + "/product")
require.NoError(t, err)
assert.Equal(t, tc.expectedName, product.Name)
assert.Equal(t, tc.expectedPrice, product.Price)
})
}
}
Integration Testing with Real Websites
Controlled Integration Tests
For testing against real websites, implement safeguards and use stable endpoints:
func TestProductScraper_Integration(t *testing.T) {
if testing.Short() {
t.Skip("Skipping integration test in short mode")
}
// Use a stable, predictable website for integration tests
scraper := NewProductScraper("https://httpbin.org")
// Test with a known endpoint that returns predictable data
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Serve a known product page structure
html := `
<html>
<body>
<h1 class="product-title">Integration Test Product</h1>
<span class="price">$99.99</span>
</body>
</html>
`
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(html))
}))
defer server.Close()
product, err := scraper.ScrapeProduct(server.URL)
require.NoError(t, err)
assert.Equal(t, "Integration Test Product", product.Name)
assert.Equal(t, "$99.99", product.Price)
}
Testing Rate Limiting and Delays
Validating Rate Limiting Behavior
Test that your scraper properly implements rate limiting:
func TestProductScraper_RateLimiting(t *testing.T) {
requestCount := 0
requestTimes := make([]time.Time, 0)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
requestTimes = append(requestTimes, time.Now())
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(`<html><body><h1 class="product-title">Test Product</h1></body></html>`))
}))
defer server.Close()
// Create scraper with rate limiting
c := colly.NewCollector()
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 1,
Delay: 500 * time.Millisecond,
})
scraper := &ProductScraper{collector: c, baseURL: server.URL}
// Make multiple requests
for i := 0; i < 3; i++ {
_, err := scraper.ScrapeProduct(server.URL + fmt.Sprintf("/product/%d", i))
require.NoError(t, err)
}
// Verify rate limiting was applied
assert.Equal(t, 3, requestCount)
// Check that delays were respected (with some tolerance)
for i := 1; i < len(requestTimes); i++ {
timeDiff := requestTimes[i].Sub(requestTimes[i-1])
assert.GreaterOrEqual(t, timeDiff, 450*time.Millisecond, "Rate limit not respected")
}
}
Advanced Testing Patterns
Testing with Testify Suites
Organize complex test scenarios using testify suites:
package scraper
import (
"testing"
"github.com/stretchr/testify/suite"
)
type ScraperTestSuite struct {
suite.Suite
scraper *ProductScraper
server *httptest.Server
}
func (suite *ScraperTestSuite) SetupTest() {
suite.server = httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Default response
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(`<html><body><h1 class="product-title">Default Product</h1></body></html>`))
}))
suite.scraper = NewProductScraper(suite.server.URL)
}
func (suite *ScraperTestSuite) TearDownTest() {
suite.server.Close()
}
func (suite *ScraperTestSuite) TestBasicScraping() {
product, err := suite.scraper.ScrapeProduct(suite.server.URL + "/product")
suite.NoError(err)
suite.Equal("Default Product", product.Name)
}
func TestScraperTestSuite(t *testing.T) {
suite.Run(t, new(ScraperTestSuite))
}
Property-Based Testing
Use property-based testing for robust validation:
func TestProductValidation_PropertyBased(t *testing.T) {
// Test that product names are always non-empty and trimmed
quick.Check(func(name string) bool {
if strings.TrimSpace(name) == "" {
return true // Skip empty inputs
}
html := fmt.Sprintf(`<html><body><h1 class="product-title"> %s </h1></body></html>`, name)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(html))
}))
defer server.Close()
scraper := NewProductScraper(server.URL)
product, err := scraper.ScrapeProduct(server.URL)
if err != nil {
return false
}
// Property: scraped name should always be trimmed
return product.Name == strings.TrimSpace(name)
}, nil)
}
Best Practices for Colly Testing
1. Test Independence
Ensure each test is independent and can run in isolation:
func (suite *ScraperTestSuite) SetupTest() {
// Create fresh instances for each test
suite.scraper = NewProductScraper(suite.server.URL)
}
2. Use Dependency Injection
Make your scrapers testable by injecting dependencies:
type ScraperConfig struct {
UserAgent string
Timeout time.Duration
Retries int
}
func NewConfigurableScraper(config ScraperConfig) *ProductScraper {
c := colly.NewCollector()
c.UserAgent = config.UserAgent
c.SetRequestTimeout(config.Timeout)
return &ProductScraper{collector: c}
}
3. Validate Data Quality
Test not just extraction but also data validation:
func TestProductValidation(t *testing.T) {
product := &Product{
Name: "Valid Product",
Price: "$19.99",
}
assert.True(t, isValidPrice(product.Price))
assert.True(t, len(product.Name) > 0)
}
func isValidPrice(price string) bool {
// Implement price validation logic
matched, _ := regexp.MatchString(`^\$\d+\.\d{2}$`, price)
return matched
}
Running Tests Efficiently
Test Organization
Structure your tests for optimal execution:
# Run only unit tests (fast)
go test -short ./...
# Run integration tests
go test -run Integration ./...
# Run with coverage
go test -cover ./...
# Run specific test suites
go test -run TestScraperTestSuite ./...
Continuous Integration
Set up CI pipelines that separate fast and slow tests:
# .github/workflows/test.yml
name: Tests
on: [push, pull_request]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-go@v2
with:
go-version: 1.19
- run: go test -short ./...
integration-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-go@v2
with:
go-version: 1.19
- run: go test -run Integration ./...
Conclusion
Testing Colly scrapers requires a multi-layered approach combining unit tests with mock servers, integration tests with real endpoints, and validation of error scenarios. By implementing comprehensive test suites, you can ensure your scrapers remain reliable as target websites evolve and your application grows.
The key to successful scraper testing is balancing thorough coverage with execution speed, using appropriate abstractions for testability, and maintaining test independence. With these patterns and practices, you'll build robust, maintainable web scraping applications that stand the test of time.
Remember to complement your Colly testing strategy with proper error handling and monitoring in production environments to catch issues that tests might miss.