Can Colly handle JavaScript-heavy websites for scraping?

The Challenge: Colly and JavaScript

Colly is a powerful web scraping framework for Go that excels at scraping static HTML content. However, Colly cannot handle JavaScript-heavy websites out of the box because it lacks a JavaScript rendering engine. This limitation means:

  • Content generated by JavaScript won't be visible to Colly
  • Single Page Applications (SPAs) may appear empty
  • Dynamic content loaded via AJAX calls will be missed
  • Interactive elements requiring JavaScript won't be accessible

Why JavaScript Matters for Modern Websites

Many modern websites rely heavily on JavaScript for: - Dynamic content loading: Content loaded after page initialization - AJAX requests: Data fetched asynchronously from APIs - Single Page Applications: React, Vue, Angular applications - Infinite scroll: Content that loads as users scroll - Interactive elements: Dropdowns, modals, and dynamic forms

Solutions for JavaScript-Heavy Websites

1. Combine Colly with ChromeDP (Recommended)

ChromeDP is a Go library that controls a headless Chrome browser, making it perfect for rendering JavaScript before passing content to Colly.

package main

import (
    "context"
    "fmt"
    "log"
    "time"

    "github.com/chromedp/chromedp"
    "github.com/gocolly/colly/v2"
)

func main() {
    // Create headless browser context
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    // Set timeout for the entire operation
    ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
    defer cancel()

    var renderedHTML string

    // Navigate and wait for JavaScript to render content
    err := chromedp.Run(ctx,
        chromedp.Navigate("https://example-spa.com"),
        chromedp.WaitVisible("#dynamic-content", chromedp.ByID),
        chromedp.Sleep(2*time.Second), // Additional wait for animations
        chromedp.OuterHTML("html", &renderedHTML),
    )
    if err != nil {
        log.Fatal("ChromeDP error:", err)
    }

    // Create Colly collector
    c := colly.NewCollector()

    // Set up scraping callbacks
    c.OnHTML(".product", func(e *colly.HTMLElement) {
        name := e.ChildText(".product-name")
        price := e.ChildText(".product-price")
        fmt.Printf("Product: %s, Price: %s\n", name, price)
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error: %s", err.Error())
    })

    // Load the rendered HTML into Colly
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "Mozilla/5.0 (compatible; Colly)")
    })

    // Parse the rendered HTML
    err = c.OnHTML("html", func(e *colly.HTMLElement) {
        // Process the fully rendered page
    }).LoadHTML(renderedHTML)

    if err != nil {
        log.Fatal("Colly error:", err)
    }
}

2. Advanced ChromeDP with Network Monitoring

For complex SPAs that make multiple API calls:

package main

import (
    "context"
    "fmt"
    "log"
    "time"

    "github.com/chromedp/cdproto/network"
    "github.com/chromedp/chromedp"
    "github.com/gocolly/colly/v2"
)

func scrapeWithNetworkMonitoring(url string) {
    // Create context with network monitoring
    ctx, cancel := chromedp.NewContext(context.Background())
    defer cancel()

    // Enable network events
    ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
    defer cancel()

    var renderedHTML string
    var requestCount int

    // Monitor network requests
    chromedp.ListenTarget(ctx, func(ev interface{}) {
        switch ev := ev.(type) {
        case *network.EventRequestWillBeSent:
            requestCount++
            fmt.Printf("Request #%d: %s\n", requestCount, ev.Request.URL)
        }
    })

    err := chromedp.Run(ctx,
        network.Enable(),
        chromedp.Navigate(url),
        chromedp.WaitVisible("body", chromedp.ByQuery),
        // Wait for network to settle (no requests for 2 seconds)
        chromedp.ActionFunc(func(ctx context.Context) error {
            lastCount := requestCount
            for {
                time.Sleep(2 * time.Second)
                if requestCount == lastCount {
                    break // No new requests, content likely loaded
                }
                lastCount = requestCount
            }
            return nil
        }),
        chromedp.OuterHTML("html", &renderedHTML),
    )

    if err != nil {
        log.Fatal(err)
    }

    // Process with Colly
    processWithColly(renderedHTML)
}

func processWithColly(html string) {
    c := colly.NewCollector()

    c.OnHTML("[data-testid='product']", func(e *colly.HTMLElement) {
        fmt.Printf("Found product: %s\n", e.Text)
    })

    c.LoadHTML(html)
}

3. Error Handling and Retry Logic

func robustScraping(url string, maxRetries int) error {
    for attempt := 1; attempt <= maxRetries; attempt++ {
        ctx, cancel := chromedp.NewContext(context.Background())
        defer cancel()

        ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
        defer cancel()

        var html string
        err := chromedp.Run(ctx,
            chromedp.Navigate(url),
            chromedp.WaitVisible("#content", chromedp.ByID),
            chromedp.OuterHTML("html", &html),
        )

        if err == nil {
            return processWithColly(html)
        }

        log.Printf("Attempt %d failed: %v", attempt, err)
        if attempt < maxRetries {
            time.Sleep(time.Duration(attempt) * time.Second)
        }
    }
    return fmt.Errorf("failed after %d attempts", maxRetries)
}

Alternative Approaches

1. Rod (Another Go Browser Automation Library)

// Rod offers a different API that some developers prefer
import "github.com/go-rod/rod"

browser := rod.New().MustConnect()
page := browser.MustPage("https://example.com")
page.MustWaitLoad()
html := page.MustHTML()

2. API-First Approach

Before implementing browser automation, check if the website offers APIs:

// Often more efficient than scraping
resp, err := http.Get("https://api.example.com/data")

3. Pre-rendered Services

Some sites offer pre-rendered versions for SEO (add ?_escaped_fragment_= to URLs).

Best Practices

  1. Minimize browser usage: Only use headless browsers when necessary
  2. Cache rendered content: Avoid re-rendering identical pages
  3. Set appropriate timeouts: Prevent hanging operations
  4. Monitor resource usage: Headless browsers consume significant memory
  5. Implement rate limiting: Respect website resources
  6. Use connection pooling: Reuse browser instances when possible

When to Choose Alternatives

Consider using JavaScript-native tools if you primarily scrape JavaScript-heavy sites: - Puppeteer (Node.js): More mature ecosystem - Playwright (Multiple languages): Cross-browser support - Selenium (Multiple languages): Widely supported

For Go developers, the Colly + ChromeDP combination provides excellent performance and maintainability while leveraging Go's concurrency features.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon