How do I handle complex form submissions with CSRF tokens in Colly?
Cross-Site Request Forgery (CSRF) tokens are a common security mechanism used by web applications to prevent malicious attacks. When scraping websites that require form submissions with CSRF protection, you need to extract these tokens and include them in your requests. This guide provides comprehensive techniques for handling complex form submissions with CSRF tokens in Colly.
Understanding CSRF Tokens
CSRF tokens are unique, unpredictable values generated by web applications to verify that form submissions come from legitimate sources. These tokens are typically:
- Hidden form fields with names like
_token
,csrf_token
, orauthenticity_token
- Meta tags in the HTML head section
- Embedded in JavaScript variables
- Returned in JSON responses from initial requests
Basic CSRF Token Extraction and Form Submission
Here's a fundamental example of extracting a CSRF token from a form and submitting it:
package main
import (
"fmt"
"log"
"net/url"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
var csrfToken string
// Extract CSRF token from the login form
c.OnHTML("form[action='/login']", func(e *colly.HTMLElement) {
// Look for hidden CSRF token field
csrfToken = e.ChildAttr("input[name='_token']", "value")
if csrfToken == "" {
csrfToken = e.ChildAttr("input[name='csrf_token']", "value")
}
fmt.Printf("Extracted CSRF token: %s\n", csrfToken)
// Submit the form with extracted token
formData := url.Values{
"_token": {csrfToken},
"username": {"your_username"},
"password": {"your_password"},
}
err := c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
if err != nil {
log.Printf("Error submitting form: %v", err)
}
})
// Handle form submission response
c.OnResponse(func(r *colly.Response) {
fmt.Printf("Status: %d\n", r.StatusCode)
fmt.Printf("Response: %s\n", string(r.Body))
})
c.Visit("https://example.com/login")
}
Advanced CSRF Token Handling Patterns
Multiple Token Extraction Methods
Some websites use multiple methods to provide CSRF tokens. Here's how to handle various scenarios:
func extractCSRFToken(e *colly.HTMLElement) string {
var token string
// Method 1: Hidden form field
token = e.ChildAttr("input[name='_token']", "value")
if token != "" {
return token
}
// Method 2: Meta tag in head
token = e.DOM.Find("meta[name='csrf-token']").AttrOr("content", "")
if token != "" {
return token
}
// Method 3: Meta tag with different name
token = e.DOM.Find("meta[name='_token']").AttrOr("content", "")
if token != "" {
return token
}
// Method 4: JavaScript variable extraction
scriptContent := e.DOM.Find("script").Text()
if matches := regexp.MustCompile(`window\.csrfToken\s*=\s*['"](.*?)['"]`).FindStringSubmatch(scriptContent); len(matches) > 1 {
return matches[1]
}
return ""
}
Session-Based CSRF Token Management
For complex applications that require multiple form submissions, you need to maintain session state and handle token renewal:
package main
import (
"encoding/json"
"fmt"
"log"
"net/url"
"regexp"
"strings"
"sync"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
type CSRFHandler struct {
collector *colly.Collector
token string
sessionID string
}
func NewCSRFHandler() *CSRFHandler {
c := colly.NewCollector(
colly.Debugger(&debug.LogDebugger{}),
)
return &CSRFHandler{
collector: c,
}
}
func (h *CSRFHandler) extractToken(e *colly.HTMLElement) {
// Try multiple extraction methods
token := h.tryMultipleExtractionMethods(e)
if token != "" {
h.token = token
fmt.Printf("CSRF token updated: %s\n", token)
}
}
func (h *CSRFHandler) tryMultipleExtractionMethods(e *colly.HTMLElement) string {
// Hidden input field
if token := e.ChildAttr("input[name='_token']", "value"); token != "" {
return token
}
// Meta tag
if token := e.DOM.Find("meta[name='csrf-token']").AttrOr("content", ""); token != "" {
return token
}
// JavaScript variable
e.DOM.Find("script").Each(func(i int, s *colly.HTMLElement) {
content := s.Text
if matches := regexp.MustCompile(`csrf_token["']?\s*:\s*["']([^"']+)["']`).FindStringSubmatch(content); len(matches) > 1 {
return matches[1]
}
})
return ""
}
func (h *CSRFHandler) SubmitForm(actionURL string, formData map[string]string) error {
if h.token == "" {
return fmt.Errorf("CSRF token not available")
}
// Add CSRF token to form data
values := url.Values{}
for key, value := range formData {
values.Set(key, value)
}
values.Set("_token", h.token)
return h.collector.Post(actionURL, values)
}
Handling Dynamic CSRF Tokens with AJAX
Modern web applications often refresh CSRF tokens via AJAX requests. Here's how to handle this scenario:
func handleAjaxCSRFRefresh(c *colly.Collector) {
// Intercept AJAX requests that might return new tokens
c.OnResponse(func(r *colly.Response) {
contentType := r.Headers.Get("Content-Type")
if strings.Contains(contentType, "application/json") {
var jsonResponse map[string]interface{}
if err := json.Unmarshal(r.Body, &jsonResponse); err == nil {
// Check for CSRF token in JSON response
if token, exists := jsonResponse["csrf_token"]; exists {
if tokenStr, ok := token.(string); ok {
fmt.Printf("Updated CSRF token from AJAX: %s\n", tokenStr)
// Update your token variable here
}
}
}
}
})
// Make initial AJAX request to get token
c.OnHTML("script", func(e *colly.HTMLElement) {
content := e.Text
// Look for AJAX endpoint that provides tokens
if matches := regexp.MustCompile(`/api/csrf-token`).FindString(content); matches != "" {
c.Visit(e.Request.AbsoluteURL("/api/csrf-token"))
}
})
}
Complex Multi-Step Form Submission
For applications requiring multiple form submissions with token validation at each step:
func handleMultiStepForm(c *colly.Collector) {
var currentToken string
// Step 1: Initial form
c.OnHTML("form#step1", func(e *colly.HTMLElement) {
currentToken = e.ChildAttr("input[name='_token']", "value")
formData := url.Values{
"_token": {currentToken},
"step": {"1"},
"user_data": {"initial_value"},
}
c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
})
// Step 2: Intermediate form
c.OnHTML("form#step2", func(e *colly.HTMLElement) {
// Token might be refreshed
newToken := e.ChildAttr("input[name='_token']", "value")
if newToken != "" {
currentToken = newToken
}
formData := url.Values{
"_token": {currentToken},
"step": {"2"},
"additional_data": {"step2_value"},
}
c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
})
// Final step
c.OnHTML("form#final", func(e *colly.HTMLElement) {
finalToken := e.ChildAttr("input[name='_token']", "value")
if finalToken != "" {
currentToken = finalToken
}
formData := url.Values{
"_token": {currentToken},
"submit": {"final"},
}
c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
})
}
Error Handling and Token Validation
Implement robust error handling for CSRF-related issues:
func handleCSRFErrors(c *colly.Collector) {
c.OnError(func(r *colly.Response, err error) {
if r.StatusCode == 419 || r.StatusCode == 403 {
fmt.Printf("CSRF token error (Status: %d). Refreshing token...\n", r.StatusCode)
// Re-visit the form page to get a fresh token
c.Visit(r.Request.URL.String())
}
})
c.OnHTML("div.csrf-error", func(e *colly.HTMLElement) {
fmt.Printf("CSRF validation failed: %s\n", e.Text)
// Handle the error by re-fetching the form
})
}
Best Practices for CSRF Token Handling
1. Token Caching and Reuse
type TokenCache struct {
tokens map[string]string
mutex sync.RWMutex
}
func (tc *TokenCache) Set(domain, token string) {
tc.mutex.Lock()
defer tc.mutex.Unlock()
tc.tokens[domain] = token
}
func (tc *TokenCache) Get(domain string) string {
tc.mutex.RLock()
defer tc.mutex.RUnlock()
return tc.tokens[domain]
}
2. Concurrent Request Handling
When scraping multiple pages simultaneously, ensure proper token management:
func handleConcurrentRequests() {
c := colly.NewCollector()
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 2,
Delay: 1 * time.Second,
})
tokenCache := &TokenCache{
tokens: make(map[string]string),
}
c.OnHTML("form", func(e *colly.HTMLElement) {
domain := e.Request.URL.Host
token := extractCSRFToken(e)
tokenCache.Set(domain, token)
})
}
3. Debugging CSRF Issues
func debugCSRFHandling(c *colly.Collector) {
c.OnRequest(func(r *colly.Request) {
if r.Method == "POST" {
fmt.Printf("POST Request to: %s\n", r.URL)
fmt.Printf("Form data: %s\n", r.Body)
}
})
c.OnResponse(func(r *colly.Response) {
if r.StatusCode >= 400 {
fmt.Printf("Error response: %d\n", r.StatusCode)
fmt.Printf("Response body: %s\n", string(r.Body))
}
})
}
Working with File Uploads and CSRF Tokens
Many forms with CSRF protection also handle file uploads. Here's how to manage both:
func handleFileUploadWithCSRF(c *colly.Collector) {
c.OnHTML("form[enctype='multipart/form-data']", func(e *colly.HTMLElement) {
token := e.ChildAttr("input[name='_token']", "value")
// For file uploads, you'll need to construct multipart form data
// This is more complex and might require additional libraries
formAction := e.Request.AbsoluteURL(e.Attr("action"))
// Create form data with CSRF token
formData := map[string]string{
"_token": token,
"title": "File Upload",
}
// Note: File upload handling in Colly requires custom implementation
// Consider using net/http for complex multipart forms
fmt.Printf("Form action: %s, CSRF token: %s\n", formAction, token)
})
}
Integration with Authentication Systems
CSRF tokens often work alongside authentication systems. When dealing with login flows that require both session management and CSRF protection, consider using tools that provide comprehensive browser session handling capabilities for more complex scenarios where Colly's static approach might be limiting.
For applications that heavily rely on JavaScript for form generation and token management, you might need to evaluate whether handling JavaScript-rendered content would be more appropriate than Colly's DOM-based approach.
Common CSRF Token Patterns by Framework
Different web frameworks implement CSRF tokens differently:
Laravel (PHP)
// Laravel uses _token field and meta tag
token := e.ChildAttr("input[name='_token']", "value")
if token == "" {
token = e.DOM.Find("meta[name='csrf-token']").AttrOr("content", "")
}
Django (Python)
// Django uses csrfmiddlewaretoken
token := e.ChildAttr("input[name='csrfmiddlewaretoken']", "value")
Rails (Ruby)
// Rails uses authenticity_token
token := e.ChildAttr("input[name='authenticity_token']", "value")
Express.js with csurf
// Express with csurf middleware
token := e.ChildAttr("input[name='_csrf']", "value")
Performance Optimization for CSRF Handling
When scraping multiple pages with forms, optimize your CSRF token handling:
func optimizedCSRFHandling() {
c := colly.NewCollector()
// Use connection pooling for better performance
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 3,
Delay: 500 * time.Millisecond,
})
// Cache tokens per domain to avoid repeated extraction
tokenCache := make(map[string]string)
var cacheMutex sync.RWMutex
c.OnHTML("form", func(e *colly.HTMLElement) {
domain := e.Request.URL.Host
cacheMutex.RLock()
existingToken, exists := tokenCache[domain]
cacheMutex.RUnlock()
if !exists {
token := extractCSRFToken(e)
if token != "" {
cacheMutex.Lock()
tokenCache[domain] = token
cacheMutex.Unlock()
}
} else {
fmt.Printf("Using cached token for %s: %s\n", domain, existingToken)
}
})
}
## Testing CSRF Token Implementation
When developing CSRF token handling, thorough testing is essential:
```language-go
func testCSRFTokenExtraction() {
// Create a test HTML document
testHTML := `
<html>
<head>
<meta name="csrf-token" content="test-token-123">
</head>
<body>
<form action="/submit">
<input type="hidden" name="_token" value="form-token-456">
<input type="text" name="username">
<button type="submit">Submit</button>
</form>
</body>
</html>`
c := colly.NewCollector()
c.OnHTML("form", func(e *colly.HTMLElement) {
token := extractCSRFToken(e)
if token == "" {
log.Fatal("Failed to extract CSRF token")
}
fmt.Printf("Successfully extracted token: %s\n", token)
})
// Load HTML from string for testing
c.OnRequest(func(r *colly.Request) {
if r.URL.String() == "http://test.local" {
r.ResponseCharacterEncoding = "UTF-8"
}
})
}
Troubleshooting Common Issues
Token Expiration
func handleTokenExpiration(c *colly.Collector) {
c.OnResponse(func(r *colly.Response) {
// Check for token expiration responses
if strings.Contains(string(r.Body), "token expired") ||
strings.Contains(string(r.Body), "csrf token mismatch") {
fmt.Println("CSRF token expired, refreshing...")
// Navigate back to form page to get fresh token
baseURL := fmt.Sprintf("%s://%s", r.Request.URL.Scheme, r.Request.URL.Host)
c.Visit(baseURL + "/form")
}
})
}
Hidden Token in JavaScript
func extractJavaScriptToken(e *colly.HTMLElement) string {
var token string
e.DOM.Find("script").Each(func(i int, script *goquery.Selection) {
content := script.Text()
// Multiple patterns for different JS implementations
patterns := []string{
`window\.csrfToken\s*=\s*['"](.*?)['"]`,
`_token["']?\s*:\s*["']([^"']+)["']`,
`csrf_token["']?\s*:\s*["']([^"']+)["']`,
`csrfToken["']?\s*:\s*["']([^"']+)["']`,
}
for _, pattern := range patterns {
if matches := regexp.MustCompile(pattern).FindStringSubmatch(content); len(matches) > 1 {
token = matches[1]
return
}
}
})
return token
}
Conclusion
Handling CSRF tokens in Colly requires careful extraction, storage, and submission of these security tokens. The key principles include:
- Multiple extraction methods: Always try different ways to find CSRF tokens
- Framework awareness: Understand how different frameworks implement CSRF protection
- Session management: Maintain token state across multiple requests
- Error handling: Gracefully handle token validation failures and expiration
- Token refresh: Handle dynamic token updates in modern applications
- Performance optimization: Cache tokens when appropriate to reduce overhead
- Debugging and testing: Implement comprehensive logging and testing strategies
By following these patterns and best practices, you can successfully scrape websites with CSRF protection while maintaining the security expectations of the target applications. For scenarios requiring more sophisticated browser automation capabilities, consider evaluating tools that provide full JavaScript execution environments.
Remember to always respect the website's robots.txt file and terms of service when implementing these techniques, and ensure your scraping activities comply with applicable laws and regulations.