Colly is a popular Go package for building web scraping applications. Handling cookies within a Colly session is essential for scraping websites that require authentication or maintain session state through cookies.
Here's a step-by-step guide on how to handle cookies within a Colly session:
1. Importing Colly Package
First, ensure you have the Colly package installed and then import it in your Go code:
package main
import (
"github.com/gocolly/colly"
"log"
)
2. Create a Colly Collector
Create a new Colly collector, which is the main object that will perform the scraping:
func main() {
c := colly.NewCollector(
// Optionally set additional options, like user agent, rate limit, etc.
colly.AllowedDomains("example.com"),
)
// ...
}
3. Handling Cookies
Colly automatically manages cookies for each domain. However, you can manually set or get cookies if needed.
Set Cookies
To set a cookie before making a request:
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
)
c.OnRequest(func(r *colly.Request) {
// Manually set cookies
cookies := []*http.Cookie{
{
Name: "cookie_name",
Value: "cookie_value",
Domain: "example.com",
},
}
r.Headers.Set("Cookie", "cookie_name=cookie_value")
// Or set the cookies directly on the collector
c.SetCookies("http://example.com", cookies)
})
// Start scraping
c.Visit("http://example.com")
}
Get Cookies
To get cookies after a response:
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
)
c.OnResponse(func(r *colly.Response) {
// Access the cookies
cookies := c.Cookies(r.Request.URL.String())
for _, cookie := range cookies {
log.Println("Cookie:", cookie.Name, "Value:", cookie.Value)
}
})
// Start scraping
c.Visit("http://example.com")
}
4. Handling Cookie Jars
Colly uses an http.CookieJar
to manage cookies. Here's how to attach a custom cookie jar:
import "net/http/cookiejar"
func main() {
// Create a new collector
c := colly.NewCollector()
// Create a cookie jar
jar, _ := cookiejar.New(nil)
c.SetCookieJar(jar)
// ...
// Your scraping logic
}
5. Persisting Cookies
If you want to persist cookies between runs, you'll need to manually save and load the cookies from a file or database. Colly does not provide built-in support for this, but you can serialize the cookie jar yourself.
Example of saving and loading cookies from a file:
import (
"net/http/cookiejar"
"github.com/boltdb/bolt"
)
// Function to save cookies to file
func saveCookies(jar *cookiejar.Jar, filename string) error {
// Use your preferred method to write the cookie jar to a file
}
// Function to load cookies from file
func loadCookies(jar *cookiejar.Jar, filename string) error {
// Use your preferred method to read the cookie jar from a file
}
func main() {
c := colly.NewCollector()
jar, _ := cookiejar.New(nil)
c.SetCookieJar(jar)
// Load cookies if they exist
_ = loadCookies(jar, "cookies.db")
// After scraping
_ = saveCookies(jar, "cookies.db")
// ...
}
In the above example, you would need to implement the saveCookies
and loadCookies
functions to actually write and read the cookies to and from a file in a format that can be serialized (e.g., JSON or gob).
Remember that handling cookies, especially when they involve login sessions, may be subject to the terms of service of the website you are scraping. Always ensure you are allowed to scrape the website and that you comply with its terms of service and privacy policy.