How do I implement URL filtering and validation in Colly?
URL filtering and validation are essential components of web scraping with Colly, allowing you to control which URLs your scraper visits and ensuring data quality. This comprehensive guide covers various filtering techniques, validation methods, and best practices for implementing robust URL management in your Colly scrapers.
Understanding Colly's URL Filtering System
Colly provides several built-in mechanisms for URL filtering and validation. The framework processes URLs through multiple stages:
- Pre-request filtering - Before making HTTP requests
- Response filtering - After receiving responses
- Link discovery filtering - When finding new URLs to crawl
Basic URL Filtering with AllowedDomains
The simplest form of URL filtering in Colly is domain restriction:
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Restrict crawling to specific domains
c.AllowedDomains = []string{
"example.com",
"subdomain.example.com",
"api.example.com",
}
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Only links from allowed domains will be visited
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
Advanced URL Filtering with Regular Expressions
For more sophisticated filtering, use URL patterns with regular expressions:
package main
import (
"fmt"
"regexp"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
// Define URL patterns to allow
allowedURLs := []*regexp.Regexp{
regexp.MustCompile(`https://example\.com/products/.*`),
regexp.MustCompile(`https://example\.com/categories/.*`),
regexp.MustCompile(`https://api\.example\.com/v1/.*`),
}
// Define URL patterns to block
blockedURLs := []*regexp.Regexp{
regexp.MustCompile(`.*\.(jpg|jpeg|png|gif|pdf)$`),
regexp.MustCompile(`.*/admin/.*`),
regexp.MustCompile(`.*/logout.*`),
}
// Custom URL filter function
c.URLFilters = append(c.URLFilters, func(r *colly.Request) bool {
url := r.URL.String()
// Check if URL matches any blocked patterns
for _, pattern := range blockedURLs {
if pattern.MatchString(url) {
fmt.Printf("Blocked URL: %s\n", url)
return false
}
}
// Check if URL matches any allowed patterns
for _, pattern := range allowedURLs {
if pattern.MatchString(url) {
fmt.Printf("Allowed URL: %s\n", url)
return true
}
}
// Default: block URLs that don't match allowed patterns
fmt.Printf("Default blocked URL: %s\n", url)
return false
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
Implementing Custom URL Validation
Create sophisticated validation logic using custom functions:
package main
import (
"fmt"
"net/url"
"strings"
"github.com/gocolly/colly/v2"
)
type URLValidator struct {
maxDepth int
allowedSchemes []string
requiredParams []string
forbiddenPaths []string
}
func NewURLValidator() *URLValidator {
return &URLValidator{
maxDepth: 5,
allowedSchemes: []string{"http", "https"},
requiredParams: []string{},
forbiddenPaths: []string{"/admin", "/private", "/internal"},
}
}
func (v *URLValidator) ValidateURL(requestURL string, depth int) bool {
// Parse the URL
parsedURL, err := url.Parse(requestURL)
if err != nil {
fmt.Printf("Invalid URL format: %s\n", requestURL)
return false
}
// Check depth limit
if depth > v.maxDepth {
fmt.Printf("URL exceeds max depth (%d): %s\n", v.maxDepth, requestURL)
return false
}
// Validate scheme
if !v.isAllowedScheme(parsedURL.Scheme) {
fmt.Printf("Disallowed scheme (%s): %s\n", parsedURL.Scheme, requestURL)
return false
}
// Check forbidden paths
if v.isForbiddenPath(parsedURL.Path) {
fmt.Printf("Forbidden path: %s\n", requestURL)
return false
}
// Validate required parameters
if !v.hasRequiredParams(parsedURL.Query()) {
fmt.Printf("Missing required parameters: %s\n", requestURL)
return false
}
// Additional custom validation logic
if v.isSpamURL(requestURL) {
fmt.Printf("Detected spam URL: %s\n", requestURL)
return false
}
return true
}
func (v *URLValidator) isAllowedScheme(scheme string) bool {
for _, allowed := range v.allowedSchemes {
if scheme == allowed {
return true
}
}
return false
}
func (v *URLValidator) isForbiddenPath(path string) bool {
for _, forbidden := range v.forbiddenPaths {
if strings.HasPrefix(path, forbidden) {
return true
}
}
return false
}
func (v *URLValidator) hasRequiredParams(params url.Values) bool {
for _, required := range v.requiredParams {
if !params.Has(required) {
return false
}
}
return true
}
func (v *URLValidator) isSpamURL(requestURL string) bool {
spamIndicators := []string{
"spam", "malware", "phishing", "suspicious",
"too-many-redirects", "infinite-loop",
}
lowercaseURL := strings.ToLower(requestURL)
for _, indicator := range spamIndicators {
if strings.Contains(lowercaseURL, indicator) {
return true
}
}
return false
}
func main() {
c := colly.NewCollector()
validator := NewURLValidator()
// Set required parameters for API endpoints
validator.requiredParams = []string{"api_key"}
c.URLFilters = append(c.URLFilters, func(r *colly.Request) bool {
depth := r.Depth
return validator.ValidateURL(r.URL.String(), depth)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
URL Normalization and Deduplication
Implement URL normalization to prevent visiting duplicate content:
package main
import (
"fmt"
"net/url"
"sort"
"strings"
"github.com/gocolly/colly/v2"
)
type URLNormalizer struct {
visitedURLs map[string]bool
}
func NewURLNormalizer() *URLNormalizer {
return &URLNormalizer{
visitedURLs: make(map[string]bool),
}
}
func (n *URLNormalizer) NormalizeURL(rawURL string) (string, error) {
parsedURL, err := url.Parse(rawURL)
if err != nil {
return "", err
}
// Convert to lowercase
parsedURL.Scheme = strings.ToLower(parsedURL.Scheme)
parsedURL.Host = strings.ToLower(parsedURL.Host)
// Remove default ports
if (parsedURL.Scheme == "http" && strings.HasSuffix(parsedURL.Host, ":80")) ||
(parsedURL.Scheme == "https" && strings.HasSuffix(parsedURL.Host, ":443")) {
parsedURL.Host = strings.Split(parsedURL.Host, ":")[0]
}
// Normalize path
if parsedURL.Path == "" {
parsedURL.Path = "/"
}
// Sort query parameters
if parsedURL.RawQuery != "" {
params := parsedURL.Query()
sortedParams := url.Values{}
var keys []string
for key := range params {
keys = append(keys, key)
}
sort.Strings(keys)
for _, key := range keys {
for _, value := range params[key] {
sortedParams.Add(key, value)
}
}
parsedURL.RawQuery = sortedParams.Encode()
}
// Remove fragment
parsedURL.Fragment = ""
return parsedURL.String(), nil
}
func (n *URLNormalizer) IsURLVisited(rawURL string) bool {
normalizedURL, err := n.NormalizeURL(rawURL)
if err != nil {
return true // Skip invalid URLs
}
if n.visitedURLs[normalizedURL] {
return true
}
n.visitedURLs[normalizedURL] = true
return false
}
func main() {
c := colly.NewCollector()
normalizer := NewURLNormalizer()
c.URLFilters = append(c.URLFilters, func(r *colly.Request) bool {
if normalizer.IsURLVisited(r.URL.String()) {
fmt.Printf("Skipping duplicate URL: %s\n", r.URL.String())
return false
}
return true
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
Content-Based URL Filtering
Filter URLs based on response content characteristics:
package main
import (
"fmt"
"strings"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Filter based on response content
c.OnResponse(func(r *colly.Response) {
contentType := r.Headers.Get("Content-Type")
contentLength := len(r.Body)
// Skip non-HTML content
if !strings.Contains(contentType, "text/html") {
fmt.Printf("Skipping non-HTML content: %s\n", r.Request.URL)
return
}
// Skip empty or very small responses
if contentLength < 100 {
fmt.Printf("Skipping small response (%d bytes): %s\n",
contentLength, r.Request.URL)
return
}
// Skip responses with error indicators
bodyText := string(r.Body)
errorIndicators := []string{
"404 Not Found",
"Access Denied",
"Page Not Available",
"Under Maintenance",
}
for _, indicator := range errorIndicators {
if strings.Contains(bodyText, indicator) {
fmt.Printf("Skipping error page: %s\n", r.Request.URL)
return
}
}
fmt.Printf("Valid content found: %s\n", r.Request.URL)
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
Performance Optimization for URL Filtering
Implement efficient filtering for large-scale scraping operations:
package main
import (
"fmt"
"sync"
"github.com/gocolly/colly/v2"
)
type HighPerformanceFilter struct {
domainWhitelist map[string]bool
pathBlacklist map[string]bool
mu sync.RWMutex
}
func NewHighPerformanceFilter() *HighPerformanceFilter {
return &HighPerformanceFilter{
domainWhitelist: make(map[string]bool),
pathBlacklist: make(map[string]bool),
}
}
func (f *HighPerformanceFilter) AddAllowedDomain(domain string) {
f.mu.Lock()
defer f.mu.Unlock()
f.domainWhitelist[domain] = true
}
func (f *HighPerformanceFilter) AddBlockedPath(path string) {
f.mu.Lock()
defer f.mu.Unlock()
f.pathBlacklist[path] = true
}
func (f *HighPerformanceFilter) IsAllowed(r *colly.Request) bool {
f.mu.RLock()
defer f.mu.RUnlock()
// Fast domain check
if !f.domainWhitelist[r.URL.Host] {
return false
}
// Fast path check
if f.pathBlacklist[r.URL.Path] {
return false
}
return true
}
func main() {
c := colly.NewCollector()
filter := NewHighPerformanceFilter()
// Configure allowed domains
filter.AddAllowedDomain("example.com")
filter.AddAllowedDomain("api.example.com")
// Configure blocked paths
filter.AddBlockedPath("/admin")
filter.AddBlockedPath("/private")
c.URLFilters = append(c.URLFilters, filter.IsAllowed)
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
Error Handling and Logging
Implement comprehensive error handling for URL validation:
package main
import (
"fmt"
"log"
"os"
"github.com/gocolly/colly/v2"
)
func main() {
// Set up logging
logFile, err := os.OpenFile("url_filtering.log",
os.O_CREATE|os.O_WRONLY|os.O_APPEND, 0666)
if err != nil {
log.Fatal("Failed to open log file:", err)
}
defer logFile.Close()
logger := log.New(logFile, "URL_FILTER: ", log.LstdFlags)
c := colly.NewCollector()
c.URLFilters = append(c.URLFilters, func(r *colly.Request) bool {
url := r.URL.String()
// Log all URL filtering decisions
defer func() {
if r := recover(); r != nil {
logger.Printf("Panic during URL filtering for %s: %v", url, r)
}
}()
// Your filtering logic here
if len(url) > 2000 {
logger.Printf("Rejected URL (too long): %s", url)
return false
}
logger.Printf("Accepted URL: %s", url)
return true
})
c.OnError(func(r *colly.Response, err error) {
logger.Printf("Error visiting %s: %s", r.Request.URL, err.Error())
})
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.Visit("https://example.com")
}
Best Practices for URL Filtering and Validation
1. Start with Conservative Filtering
Begin with strict filtering rules and gradually relax them based on your needs.
2. Use Multiple Filter Layers
Combine domain restrictions, regex patterns, and custom validation for comprehensive filtering.
3. Monitor Filter Performance
Track filtering decisions and adjust rules based on actual crawling patterns.
4. Handle Edge Cases
Account for internationalized domain names, unusual URL structures, and encoding issues.
5. Implement Rate Limiting Aware Filtering
Consider implementing different filtering strategies based on the target website's rate limiting policies, similar to techniques used in browser automation tools for handling dynamic content.
Testing URL Filters
Test your URL filtering logic thoroughly:
# Run the Go scraper with debug output
go run main.go
# Check log files for filtering decisions
tail -f url_filtering.log
# Monitor network traffic to verify filtering
netstat -an | grep :80
Advanced JavaScript Integration
For complex filtering scenarios involving JavaScript-rendered content, consider integrating with browser automation tools. While Colly excels at static content scraping, some advanced filtering may require browser automation techniques for handling dynamic content or session management approaches found in headless browser solutions.
Conclusion
Effective URL filtering and validation in Colly requires a multi-layered approach combining built-in features with custom logic. By implementing the techniques covered in this guide, you can create robust, efficient scrapers that respect website boundaries while maximizing data collection quality.
Remember to regularly review and update your filtering rules as websites evolve, and always implement proper error handling and logging to maintain scraper reliability in production environments. Start with simple domain-based filtering and gradually add complexity as your scraping requirements become more sophisticated.