Colly is a popular web scraping framework for Golang (Go) that is designed for simplicity and efficiency. Colly is great for scraping static content, but it does not inherently support JavaScript rendering since it does not operate a browser engine under the hood.
When you need to scrape content from a website that relies on JavaScript to render its content, you traditionally need a headless browser like Headless Chrome or PhantomJS. Headless browsers are fully-fledged browsers but without the user interface, which allows them to render web pages the same way a normal browser would, including executing JavaScript.
Since Colly does not support JavaScript rendering out of the box, if you want to use it to scrape JavaScript-heavy websites, you can follow one of these approaches:
1. Use an external headless browser service
You can use a headless browser service like Browserless or Rendertron to render the JavaScript first, and then pass the HTML content to Colly for scraping. This approach involves making a request to the headless browser service, which will return the rendered HTML.
2. Integrating with a Headless Browser locally
You can use a local headless browser such as Chrome Headless or Puppeteer (for Node.js) to render the page and then pass the HTML content to Colly.
Here's an example using Chrome Headless and Colly in Go, where we use Chrome Headless to get the rendered HTML and then use Colly to scrape the content:
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/chromedp/chromedp"
"github.com/gocolly/colly"
)
func main() {
// Create a new Colly collector
c := colly.NewCollector()
// On every HTML element which has the .title class call the callback
c.OnHTML(".title", func(e *colly.HTMLElement) {
fmt.Println("Title found:", e.Text)
})
// Start Chrome Headless
ctx, cancel := chromedp.NewContext(context.Background())
defer cancel()
// Run tasks
var renderedHTML string
err := chromedp.Run(ctx,
// Visit the web page
chromedp.Navigate(`https://example.com`),
// Wait for the footer element is visible
chromedp.WaitVisible(`footer`),
// Retrieve the HTML of the webpage
chromedp.OuterHTML("html", &renderedHTML),
)
if err != nil {
log.Fatal(err)
}
// Use the rendered HTML with Colly
err = c.UnmarshalHTML(renderedHTML)
if err != nil {
log.Fatal(err)
}
// Visit the page (already rendered by Chrome Headless)
err = c.Visit(`https://example.com`)
if err != nil {
log.Fatal(err)
}
}
In the example above, chromedp
is used to navigate to the page and obtain the rendered HTML after the footer element becomes visible. Then, Colly.UnmarshalHTML
is used to parse the rendered HTML and perform the scraping.
If you're working with a language that Colly doesn't support, or if you prefer a more integrated solution, you would typically turn to a scraping library or framework that has built-in JavaScript execution capabilities such as Puppeteer for Node.js or Selenium for various languages, including Python, Java, C#, etc.
Here's a simple example using Puppeteer in JavaScript for a similar scraping task:
const puppeteer = require('puppeteer');
(async () => {
// Launch the headless browser
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Navigate to the page
await page.goto('https://example.com');
// Wait for the footer element to be visible
await page.waitForSelector('footer');
// Get the content of the page
const content = await page.content();
// Process the content with your scraping logic
// For example, you can use cheerio (similar to jQuery) to scrape data
const $ = cheerio.load(content);
$('.title').each((index, element) => {
console.log('Title found:', $(element).text());
});
// Close the browser
await browser.close();
})();
In this JavaScript example, Puppeteer takes care of both rendering and scraping the page. There's no need to use two separate libraries as in the Go example with Colly.
Each approach has its use cases, and the choice depends on the specific requirements of your scraping project and your preferred programming language.