Colly is a popular scraping framework for Golang, designed to simplify the process of extracting data from websites. When using Colly to scrape data from a page, you have multiple methods at your disposal, which can be categorized based on the type of data you're trying to extract:
1. Extracting Text:
To get the text of an element, you can use the Text
method:
// Find the element and get the text
e := c.OnHTML("selector", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
})
2. Extracting Attributes:
For extracting attributes like href
from an anchor tag or src
from an image, Colly provides the Attr
method:
// Find the element and get the attribute value
e := c.OnHTML("selector", func(e *colly.HTMLElement) {
link, _ := e.Attr("href")
fmt.Println(link)
})
3. Extracting HTML:
Sometimes, you may want to get the raw HTML of an element. You can do this with the HTML
method:
// Find the element and get its HTML content
e := c.OnHTML("selector", func(e *colly.HTMLElement) {
html, _ := e.DOM.Html()
fmt.Println(html)
})
4. Extracting Child Text/Attributes:
You can also navigate to child elements to extract their text or attributes:
// Find the parent element, then find children by selector
e := c.OnHTML("parentSelector", func(e *colly.HTMLElement) {
e.ForEach("childSelector", func(_ int, el *colly.HTMLElement) {
// Extract data from the child
childText := el.Text
childAttr, _ := el.Attr("childAttribute")
fmt.Println(childText, childAttr)
})
})
5. Handling Multiple Elements:
When dealing with multiple elements that match a selector, you can loop through them with ForEach
:
// Loop through all matching elements
e := c.OnHTML("selector", func(e *colly.HTMLElement) {
e.ForEach("itemSelector", func(_ int, el *colly.HTMLElement) {
// Extract data from each item
itemData := el.Text // or any other extraction method
fmt.Println(itemData)
})
})
6. Working with Forms:
Colly can also handle form submissions, which can be useful when you need to log in or send data through a form:
// Submit a form
c.OnHTML("formSelector", func(e *colly.HTMLElement) {
e.DOM.Submit()
})
7. Response Handling:
Apart from HTML elements, you can directly handle raw responses to extract data:
// Handle the raw response
c.OnResponse(func(r *colly.Response) {
fmt.Println(string(r.Body))
})
8. JSON Data:
If the page contains JSON data, you can use Go's encoding/json
package to unmarshal the JSON:
c.OnResponse(func(r *colly.Response) {
var data MyStruct
if err := json.Unmarshal(r.Body, &data); err == nil {
fmt.Println(data)
}
})
Conclusion:
These are some of the primary methods provided by Colly for extracting data from web pages. By combining these methods, you can navigate through the DOM of a page, access any element, and scrape the data you're interested in. Colly also provides more advanced features for handling cookies, user agents, and asynchronous scraping to build robust scraping solutions.