The Challenge: Colly and JavaScript
Colly is a powerful web scraping framework for Go that excels at scraping static HTML content. However, Colly cannot handle JavaScript-heavy websites out of the box because it lacks a JavaScript rendering engine. This limitation means:
- Content generated by JavaScript won't be visible to Colly
- Single Page Applications (SPAs) may appear empty
- Dynamic content loaded via AJAX calls will be missed
- Interactive elements requiring JavaScript won't be accessible
Why JavaScript Matters for Modern Websites
Many modern websites rely heavily on JavaScript for: - Dynamic content loading: Content loaded after page initialization - AJAX requests: Data fetched asynchronously from APIs - Single Page Applications: React, Vue, Angular applications - Infinite scroll: Content that loads as users scroll - Interactive elements: Dropdowns, modals, and dynamic forms
Solutions for JavaScript-Heavy Websites
1. Combine Colly with ChromeDP (Recommended)
ChromeDP is a Go library that controls a headless Chrome browser, making it perfect for rendering JavaScript before passing content to Colly.
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/chromedp"
"github.com/gocolly/colly/v2"
)
func main() {
// Create headless browser context
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// Set timeout for the entire operation
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()
var renderedHTML string
// Navigate and wait for JavaScript to render content
err := chromedp.Run(ctx,
chromedp.Navigate("https://example-spa.com"),
chromedp.WaitVisible("#dynamic-content", chromedp.ByID),
chromedp.Sleep(2*time.Second), // Additional wait for animations
chromedp.OuterHTML("html", &renderedHTML),
)
if err != nil {
log.Fatal("ChromeDP error:", err)
}
// Create Colly collector
c := colly.NewCollector()
// Set up scraping callbacks
c.OnHTML(".product", func(e *colly.HTMLElement) {
name := e.ChildText(".product-name")
price := e.ChildText(".product-price")
fmt.Printf("Product: %s, Price: %s\n", name, price)
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error: %s", err.Error())
})
// Load the rendered HTML into Colly
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "Mozilla/5.0 (compatible; Colly)")
})
// Parse the rendered HTML
err = c.OnHTML("html", func(e *colly.HTMLElement) {
// Process the fully rendered page
}).LoadHTML(renderedHTML)
if err != nil {
log.Fatal("Colly error:", err)
}
}
2. Advanced ChromeDP with Network Monitoring
For complex SPAs that make multiple API calls:
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/cdproto/network"
"github.com/chromedp/chromedp"
"github.com/gocolly/colly/v2"
)
func scrapeWithNetworkMonitoring(url string) {
// Create context with network monitoring
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// Enable network events
ctx, cancel = context.WithTimeout(ctx, 60*time.Second)
defer cancel()
var renderedHTML string
var requestCount int
// Monitor network requests
chromedp.ListenTarget(ctx, func(ev interface{}) {
switch ev := ev.(type) {
case *network.EventRequestWillBeSent:
requestCount++
fmt.Printf("Request #%d: %s\n", requestCount, ev.Request.URL)
}
})
err := chromedp.Run(ctx,
network.Enable(),
chromedp.Navigate(url),
chromedp.WaitVisible("body", chromedp.ByQuery),
// Wait for network to settle (no requests for 2 seconds)
chromedp.ActionFunc(func(ctx context.Context) error {
lastCount := requestCount
for {
time.Sleep(2 * time.Second)
if requestCount == lastCount {
break // No new requests, content likely loaded
}
lastCount = requestCount
}
return nil
}),
chromedp.OuterHTML("html", &renderedHTML),
)
if err != nil {
log.Fatal(err)
}
// Process with Colly
processWithColly(renderedHTML)
}
func processWithColly(html string) {
c := colly.NewCollector()
c.OnHTML("[data-testid='product']", func(e *colly.HTMLElement) {
fmt.Printf("Found product: %s\n", e.Text)
})
c.LoadHTML(html)
}
3. Error Handling and Retry Logic
func robustScraping(url string, maxRetries int) error {
for attempt := 1; attempt <= maxRetries; attempt++ {
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
ctx, cancel = context.WithTimeout(ctx, 30*time.Second)
defer cancel()
var html string
err := chromedp.Run(ctx,
chromedp.Navigate(url),
chromedp.WaitVisible("#content", chromedp.ByID),
chromedp.OuterHTML("html", &html),
)
if err == nil {
return processWithColly(html)
}
log.Printf("Attempt %d failed: %v", attempt, err)
if attempt < maxRetries {
time.Sleep(time.Duration(attempt) * time.Second)
}
}
return fmt.Errorf("failed after %d attempts", maxRetries)
}
Alternative Approaches
1. Rod (Another Go Browser Automation Library)
// Rod offers a different API that some developers prefer
import "github.com/go-rod/rod"
browser := rod.New().MustConnect()
page := browser.MustPage("https://example.com")
page.MustWaitLoad()
html := page.MustHTML()
2. API-First Approach
Before implementing browser automation, check if the website offers APIs:
// Often more efficient than scraping
resp, err := http.Get("https://api.example.com/data")
3. Pre-rendered Services
Some sites offer pre-rendered versions for SEO (add ?_escaped_fragment_=
to URLs).
Best Practices
- Minimize browser usage: Only use headless browsers when necessary
- Cache rendered content: Avoid re-rendering identical pages
- Set appropriate timeouts: Prevent hanging operations
- Monitor resource usage: Headless browsers consume significant memory
- Implement rate limiting: Respect website resources
- Use connection pooling: Reuse browser instances when possible
When to Choose Alternatives
Consider using JavaScript-native tools if you primarily scrape JavaScript-heavy sites: - Puppeteer (Node.js): More mature ecosystem - Playwright (Multiple languages): Cross-browser support - Selenium (Multiple languages): Widely supported
For Go developers, the Colly + ChromeDP combination provides excellent performance and maintainability while leveraging Go's concurrency features.