Table of contents

Can Colly Handle WebSocket Connections for Real-Time Data?

Colly, the popular Go web scraping framework, is primarily designed for HTTP-based scraping and does not natively support WebSocket connections. Colly excels at crawling static web pages, handling forms, and managing HTTP sessions, but WebSocket connections require a different approach due to their persistent, bidirectional nature.

However, this doesn't mean you can't handle real-time data when building scrapers with Go. This article explores Colly's limitations regarding WebSockets and provides practical alternatives for real-time data collection.

Understanding Colly's HTTP-Only Architecture

Colly is built around Go's net/http package and focuses on traditional HTTP request-response cycles. Its core features include:

  • HTTP GET/POST requests
  • Cookie management
  • Rate limiting
  • Concurrent crawling
  • HTML parsing with goquery

The framework's architecture doesn't include WebSocket protocol support, which operates over a persistent TCP connection rather than discrete HTTP requests.

Why WebSockets Aren't Supported in Colly

WebSockets require:

  1. Protocol Upgrade: Converting from HTTP to WebSocket protocol
  2. Persistent Connections: Maintaining long-lived connections
  3. Bidirectional Communication: Handling both incoming and outgoing messages
  4. Event-Driven Architecture: Processing real-time events as they arrive

These requirements don't align with Colly's request-based model, which is optimized for crawling multiple pages efficiently.

Alternative Approaches for Real-Time Data with Go

1. Using Gorilla WebSocket with Colly

You can combine Colly for initial page discovery with the Gorilla WebSocket library for real-time connections:

package main

import (
    "log"
    "net/url"
    "os"
    "os/signal"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gorilla/websocket"
)

func main() {
    // Use Colly to discover WebSocket endpoints
    c := colly.NewCollector()

    var wsEndpoints []string

    c.OnHTML("script", func(e *colly.HTMLElement) {
        // Look for WebSocket URLs in JavaScript
        if contains(e.Text, "ws://") || contains(e.Text, "wss://") {
            // Extract WebSocket URL logic here
            wsEndpoints = append(wsEndpoints, extractWSURL(e.Text))
        }
    })

    c.Visit("https://example.com")

    // Connect to discovered WebSocket endpoints
    for _, endpoint := range wsEndpoints {
        go connectWebSocket(endpoint)
    }

    // Keep program running
    interrupt := make(chan os.Signal, 1)
    signal.Notify(interrupt, os.Interrupt)
    <-interrupt
}

func connectWebSocket(wsURL string) {
    u, _ := url.Parse(wsURL)

    conn, _, err := websocket.DefaultDialer.Dial(u.String(), nil)
    if err != nil {
        log.Fatal("dial:", err)
    }
    defer conn.Close()

    // Handle incoming messages
    for {
        _, message, err := conn.ReadMessage()
        if err != nil {
            log.Println("read:", err)
            return
        }
        log.Printf("recv: %s", message)

        // Process real-time data here
        processRealtimeData(message)
    }
}

func processRealtimeData(data []byte) {
    // Your data processing logic
    log.Printf("Processing: %s", string(data))
}

func extractWSURL(script string) string {
    // Implementation to extract WebSocket URL from JavaScript
    return "wss://example.com/websocket"
}

func contains(s, substr string) bool {
    return len(s) > 0 && len(substr) > 0 && 
           len(s) >= len(substr) && 
           s != substr && 
           s[0:len(substr)] == substr || 
           (len(s) > len(substr) && contains(s[1:], substr))
}

2. Using Chrome DevTools Protocol (CDP)

For JavaScript-heavy applications, you can use Chrome DevTools Protocol to intercept WebSocket traffic:

package main

import (
    "context"
    "encoding/json"
    "log"
    "time"

    "github.com/chromedp/cdproto/network"
    "github.com/chromedp/cdproto/runtime"
    "github.com/chromedp/chromedp"
)

func main() {
    // Create context
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    // Set up WebSocket message listener
    chromedp.ListenTarget(ctx, func(ev interface{}) {
        switch ev := ev.(type) {
        case *network.EventWebSocketFrameReceived:
            log.Printf("WebSocket message received: %s", ev.Response.PayloadData)
            // Process the real-time data
            processWebSocketMessage(ev.Response.PayloadData)
        case *network.EventWebSocketFrameSent:
            log.Printf("WebSocket message sent: %s", ev.Request.PayloadData)
        }
    })

    // Enable network events
    err := chromedp.Run(ctx,
        network.Enable(),
        chromedp.Navigate("https://example.com"),
        chromedp.Sleep(10*time.Second), // Wait for WebSocket connections
    )

    if err != nil {
        log.Fatal(err)
    }
}

func processWebSocketMessage(data string) {
    var message map[string]interface{}
    if err := json.Unmarshal([]byte(data), &message); err == nil {
        log.Printf("Parsed message: %+v", message)
        // Handle the real-time data according to your needs
    }
}

3. Hybrid Approach: Colly + WebSocket Client

Create a comprehensive solution that uses Colly for initial discovery and a separate WebSocket client for real-time data:

package main

import (
    "encoding/json"
    "log"
    "net/http"
    "regexp"
    "sync"

    "github.com/gocolly/colly/v2"
    "github.com/gorilla/websocket"
)

type ScrapingManager struct {
    collector    *colly.Collector
    wsConnections map[string]*websocket.Conn
    mu           sync.RWMutex
}

func NewScrapingManager() *ScrapingManager {
    c := colly.NewCollector()

    sm := &ScrapingManager{
        collector:     c,
        wsConnections: make(map[string]*websocket.Conn),
    }

    sm.setupCollyCallbacks()
    return sm
}

func (sm *ScrapingManager) setupCollyCallbacks() {
    // Extract WebSocket endpoints from HTML
    sm.collector.OnHTML("script", func(e *colly.HTMLElement) {
        wsURLs := sm.extractWebSocketURLs(e.Text)
        for _, url := range wsURLs {
            go sm.connectToWebSocket(url)
        }
    })

    // Handle AJAX endpoints that might upgrade to WebSocket
    sm.collector.OnHTML("[data-ws-endpoint]", func(e *colly.HTMLElement) {
        wsURL := e.Attr("data-ws-endpoint")
        if wsURL != "" {
            go sm.connectToWebSocket(wsURL)
        }
    })
}

func (sm *ScrapingManager) extractWebSocketURLs(script string) []string {
    wsRegex := regexp.MustCompile(`(wss?://[^\s"']+)`)
    matches := wsRegex.FindAllString(script, -1)
    return matches
}

func (sm *ScrapingManager) connectToWebSocket(wsURL string) {
    header := http.Header{}
    header.Set("User-Agent", "Mozilla/5.0 (compatible; Go-WebSocket-Client)")

    conn, _, err := websocket.DefaultDialer.Dial(wsURL, header)
    if err != nil {
        log.Printf("Failed to connect to %s: %v", wsURL, err)
        return
    }

    sm.mu.Lock()
    sm.wsConnections[wsURL] = conn
    sm.mu.Unlock()

    log.Printf("Connected to WebSocket: %s", wsURL)

    // Listen for messages
    for {
        _, message, err := conn.ReadMessage()
        if err != nil {
            log.Printf("WebSocket read error for %s: %v", wsURL, err)
            sm.mu.Lock()
            delete(sm.wsConnections, wsURL)
            sm.mu.Unlock()
            break
        }

        sm.handleRealtimeMessage(wsURL, message)
    }
}

func (sm *ScrapingManager) handleRealtimeMessage(source string, data []byte) {
    var message map[string]interface{}
    if err := json.Unmarshal(data, &message); err == nil {
        log.Printf("Real-time data from %s: %+v", source, message)
        // Process the data according to your business logic
        sm.processBusinessLogic(message)
    }
}

func (sm *ScrapingManager) processBusinessLogic(data map[string]interface{}) {
    // Implement your specific data processing logic here
    // This could include:
    // - Storing data in a database
    // - Triggering alerts
    // - Updating caches
    // - Forwarding to other services
}

func (sm *ScrapingManager) StartScraping(url string) error {
    return sm.collector.Visit(url)
}

func main() {
    manager := NewScrapingManager()

    if err := manager.StartScraping("https://example.com"); err != nil {
        log.Fatal(err)
    }

    // Keep the program running to maintain WebSocket connections
    select {}
}

When to Use WebSocket Alternatives

Consider WebSocket alternatives when you need:

  1. Live Financial Data: Stock prices, cryptocurrency rates
  2. Chat Applications: Real-time messaging systems
  3. Gaming Data: Live scores, player statistics
  4. IoT Monitoring: Sensor data streams
  5. Social Media Feeds: Live tweet streams, comments

Best Practices for Real-Time Scraping

1. Connection Management

type ConnectionManager struct {
    connections map[string]*websocket.Conn
    mu          sync.RWMutex
    maxRetries  int
}

func (cm *ConnectionManager) handleConnection(url string) {
    retries := 0
    for retries < cm.maxRetries {
        conn, err := cm.connect(url)
        if err != nil {
            retries++
            time.Sleep(time.Duration(retries) * time.Second)
            continue
        }

        cm.mu.Lock()
        cm.connections[url] = conn
        cm.mu.Unlock()

        cm.listenForMessages(url, conn)
        break
    }
}

2. Rate Limiting and Backoff

func (sm *ScrapingManager) connectWithBackoff(wsURL string) {
    backoff := time.Second
    maxBackoff := time.Minute

    for {
        conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil)
        if err != nil {
            log.Printf("Connection failed, retrying in %v: %v", backoff, err)
            time.Sleep(backoff)
            backoff *= 2
            if backoff > maxBackoff {
                backoff = maxBackoff
            }
            continue
        }

        backoff = time.Second // Reset on successful connection
        sm.handleConnection(wsURL, conn)
        break
    }
}

Integrating with Node.js WebSocket Solutions

For more complex real-time scenarios, you might consider integrating Go-based Colly scrapers with Node.js WebSocket clients:

const WebSocket = require('ws');
const axios = require('axios');

class RealtimeScrapingManager {
    constructor() {
        this.wsConnections = new Map();
        this.retryAttempts = 3;
    }

    async discoverWebSocketEndpoints(url) {
        try {
            const response = await axios.get(url);
            const html = response.data;

            // Extract WebSocket URLs from HTML
            const wsRegex = /wss?:\/\/[^\s"']+/g;
            const matches = html.match(wsRegex);

            return matches || [];
        } catch (error) {
            console.error('Failed to discover WebSocket endpoints:', error);
            return [];
        }
    }

    connectToWebSocket(wsUrl) {
        const ws = new WebSocket(wsUrl, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (compatible; Node-WebSocket-Client)'
            }
        });

        ws.on('open', () => {
            console.log(`Connected to WebSocket: ${wsUrl}`);
            this.wsConnections.set(wsUrl, ws);
        });

        ws.on('message', (data) => {
            try {
                const message = JSON.parse(data);
                this.processRealtimeData(wsUrl, message);
            } catch (error) {
                console.log(`Raw message from ${wsUrl}:`, data.toString());
            }
        });

        ws.on('error', (error) => {
            console.error(`WebSocket error for ${wsUrl}:`, error);
        });

        ws.on('close', () => {
            console.log(`WebSocket closed: ${wsUrl}`);
            this.wsConnections.delete(wsUrl);

            // Implement reconnection logic
            setTimeout(() => {
                this.connectToWebSocket(wsUrl);
            }, 5000);
        });
    }

    processRealtimeData(source, data) {
        console.log(`Real-time data from ${source}:`, data);

        // Process according to your business logic
        this.handleBusinessLogic(data);
    }

    handleBusinessLogic(data) {
        // Implement your specific data processing logic here
        // - Store in database
        // - Trigger alerts
        // - Update caches
        // - Forward to other services
    }

    async start(url) {
        const endpoints = await this.discoverWebSocketEndpoints(url);

        endpoints.forEach(endpoint => {
            this.connectToWebSocket(endpoint);
        });
    }
}

// Usage
const manager = new RealtimeScrapingManager();
manager.start('https://example.com');

Monitoring and Debugging WebSocket Connections

When working with WebSocket connections alongside Colly, implement proper monitoring:

package main

import (
    "fmt"
    "sync"
    "time"
)

type WebSocketMonitor struct {
    connections map[string]*ConnectionStats
    mu          sync.RWMutex
}

type ConnectionStats struct {
    URL            string
    ConnectedAt    time.Time
    MessagesReceived int64
    LastMessageAt  time.Time
    Errors         int64
}

func (wsm *WebSocketMonitor) RecordConnection(url string) {
    wsm.mu.Lock()
    defer wsm.mu.Unlock()

    wsm.connections[url] = &ConnectionStats{
        URL:         url,
        ConnectedAt: time.Now(),
    }
}

func (wsm *WebSocketMonitor) RecordMessage(url string) {
    wsm.mu.Lock()
    defer wsm.mu.Unlock()

    if stats, exists := wsm.connections[url]; exists {
        stats.MessagesReceived++
        stats.LastMessageAt = time.Now()
    }
}

func (wsm *WebSocketMonitor) RecordError(url string) {
    wsm.mu.Lock()
    defer wsm.mu.Unlock()

    if stats, exists := wsm.connections[url]; exists {
        stats.Errors++
    }
}

func (wsm *WebSocketMonitor) GetStats() map[string]*ConnectionStats {
    wsm.mu.RLock()
    defer wsm.mu.RUnlock()

    result := make(map[string]*ConnectionStats)
    for k, v := range wsm.connections {
        result[k] = v
    }
    return result
}

func (wsm *WebSocketMonitor) PrintStats() {
    stats := wsm.GetStats()

    fmt.Println("WebSocket Connection Statistics:")
    fmt.Println("================================")

    for url, stat := range stats {
        fmt.Printf("URL: %s\n", url)
        fmt.Printf("  Connected: %s\n", stat.ConnectedAt.Format(time.RFC3339))
        fmt.Printf("  Messages: %d\n", stat.MessagesReceived)
        fmt.Printf("  Last Message: %s\n", stat.LastMessageAt.Format(time.RFC3339))
        fmt.Printf("  Errors: %d\n", stat.Errors)
        fmt.Println()
    }
}

Conclusion

While Colly doesn't natively support WebSocket connections, you can effectively combine it with dedicated WebSocket libraries to create powerful real-time scraping solutions. Use Colly for initial page analysis and endpoint discovery, then implement WebSocket clients using libraries like Gorilla WebSocket or Chrome DevTools Protocol for real-time data collection.

For applications that require extensive JavaScript execution and WebSocket handling, consider how to handle AJAX requests using Puppeteer or explore how to monitor network requests in Puppeteer for more comprehensive browser automation solutions.

This hybrid approach gives you the best of both worlds: Colly's efficient HTTP crawling capabilities combined with robust real-time data collection through WebSocket connections.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon