How do I handle complex form submissions with CSRF tokens in Colly?

Cross-Site Request Forgery (CSRF) tokens are a common security mechanism used by web applications to prevent malicious attacks. When scraping websites that require form submissions with CSRF protection, you need to extract these tokens and include them in your requests. This guide provides comprehensive techniques for handling complex form submissions with CSRF tokens in Colly.

Understanding CSRF Tokens

CSRF tokens are unique, unpredictable values generated by web applications to verify that form submissions come from legitimate sources. These tokens are typically:

Hidden form fields with names like _token, csrf_token, or authenticity_token
Meta tags in the HTML head section
Embedded in JavaScript variables
Returned in JSON responses from initial requests

Basic CSRF Token Extraction and Form Submission

Here's a fundamental example of extracting a CSRF token from a form and submitting it:

package main

import (
    "fmt"
    "log"
    "net/url"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    var csrfToken string

    // Extract CSRF token from the login form
    c.OnHTML("form[action='/login']", func(e *colly.HTMLElement) {
        // Look for hidden CSRF token field
        csrfToken = e.ChildAttr("input[name='_token']", "value")
        if csrfToken == "" {
            csrfToken = e.ChildAttr("input[name='csrf_token']", "value")
        }

        fmt.Printf("Extracted CSRF token: %s\n", csrfToken)

        // Submit the form with extracted token
        formData := url.Values{
            "_token":   {csrfToken},
            "username": {"your_username"},
            "password": {"your_password"},
        }

        err := c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
        if err != nil {
            log.Printf("Error submitting form: %v", err)
        }
    })

    // Handle form submission response
    c.OnResponse(func(r *colly.Response) {
        fmt.Printf("Status: %d\n", r.StatusCode)
        fmt.Printf("Response: %s\n", string(r.Body))
    })

    c.Visit("https://example.com/login")
}

Advanced CSRF Token Handling Patterns

Multiple Token Extraction Methods

Some websites use multiple methods to provide CSRF tokens. Here's how to handle various scenarios:

func extractCSRFToken(e *colly.HTMLElement) string {
    var token string

    // Method 1: Hidden form field
    token = e.ChildAttr("input[name='_token']", "value")
    if token != "" {
        return token
    }

    // Method 2: Meta tag in head
    token = e.DOM.Find("meta[name='csrf-token']").AttrOr("content", "")
    if token != "" {
        return token
    }

    // Method 3: Meta tag with different name
    token = e.DOM.Find("meta[name='_token']").AttrOr("content", "")
    if token != "" {
        return token
    }

    // Method 4: JavaScript variable extraction
    scriptContent := e.DOM.Find("script").Text()
    if matches := regexp.MustCompile(`window\.csrfToken\s*=\s*['"](.*?)['"]`).FindStringSubmatch(scriptContent); len(matches) > 1 {
        return matches[1]
    }

    return ""
}

Session-Based CSRF Token Management

For complex applications that require multiple form submissions, you need to maintain session state and handle token renewal:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/url"
    "regexp"
    "strings"
    "sync"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

type CSRFHandler struct {
    collector *colly.Collector
    token     string
    sessionID string
}

func NewCSRFHandler() *CSRFHandler {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    return &CSRFHandler{
        collector: c,
    }
}

func (h *CSRFHandler) extractToken(e *colly.HTMLElement) {
    // Try multiple extraction methods
    token := h.tryMultipleExtractionMethods(e)
    if token != "" {
        h.token = token
        fmt.Printf("CSRF token updated: %s\n", token)
    }
}

func (h *CSRFHandler) tryMultipleExtractionMethods(e *colly.HTMLElement) string {
    // Hidden input field
    if token := e.ChildAttr("input[name='_token']", "value"); token != "" {
        return token
    }

    // Meta tag
    if token := e.DOM.Find("meta[name='csrf-token']").AttrOr("content", ""); token != "" {
        return token
    }

    // JavaScript variable
    e.DOM.Find("script").Each(func(i int, s *colly.HTMLElement) {
        content := s.Text
        if matches := regexp.MustCompile(`csrf_token["']?\s*:\s*["']([^"']+)["']`).FindStringSubmatch(content); len(matches) > 1 {
            return matches[1]
        }
    })

    return ""
}

func (h *CSRFHandler) SubmitForm(actionURL string, formData map[string]string) error {
    if h.token == "" {
        return fmt.Errorf("CSRF token not available")
    }

    // Add CSRF token to form data
    values := url.Values{}
    for key, value := range formData {
        values.Set(key, value)
    }
    values.Set("_token", h.token)

    return h.collector.Post(actionURL, values)
}

Handling Dynamic CSRF Tokens with AJAX

Modern web applications often refresh CSRF tokens via AJAX requests. Here's how to handle this scenario:

func handleAjaxCSRFRefresh(c *colly.Collector) {
    // Intercept AJAX requests that might return new tokens
    c.OnResponse(func(r *colly.Response) {
        contentType := r.Headers.Get("Content-Type")
        if strings.Contains(contentType, "application/json") {
            var jsonResponse map[string]interface{}
            if err := json.Unmarshal(r.Body, &jsonResponse); err == nil {
                // Check for CSRF token in JSON response
                if token, exists := jsonResponse["csrf_token"]; exists {
                    if tokenStr, ok := token.(string); ok {
                        fmt.Printf("Updated CSRF token from AJAX: %s\n", tokenStr)
                        // Update your token variable here
                    }
                }
            }
        }
    })

    // Make initial AJAX request to get token
    c.OnHTML("script", func(e *colly.HTMLElement) {
        content := e.Text
        // Look for AJAX endpoint that provides tokens
        if matches := regexp.MustCompile(`/api/csrf-token`).FindString(content); matches != "" {
            c.Visit(e.Request.AbsoluteURL("/api/csrf-token"))
        }
    })
}

Complex Multi-Step Form Submission

For applications requiring multiple form submissions with token validation at each step:

func handleMultiStepForm(c *colly.Collector) {
    var currentToken string

    // Step 1: Initial form
    c.OnHTML("form#step1", func(e *colly.HTMLElement) {
        currentToken = e.ChildAttr("input[name='_token']", "value")

        formData := url.Values{
            "_token":    {currentToken},
            "step":      {"1"},
            "user_data": {"initial_value"},
        }

        c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
    })

    // Step 2: Intermediate form
    c.OnHTML("form#step2", func(e *colly.HTMLElement) {
        // Token might be refreshed
        newToken := e.ChildAttr("input[name='_token']", "value")
        if newToken != "" {
            currentToken = newToken
        }

        formData := url.Values{
            "_token":         {currentToken},
            "step":           {"2"},
            "additional_data": {"step2_value"},
        }

        c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
    })

    // Final step
    c.OnHTML("form#final", func(e *colly.HTMLElement) {
        finalToken := e.ChildAttr("input[name='_token']", "value")
        if finalToken != "" {
            currentToken = finalToken
        }

        formData := url.Values{
            "_token": {currentToken},
            "submit": {"final"},
        }

        c.Post(e.Request.AbsoluteURL(e.Attr("action")), formData)
    })
}

Error Handling and Token Validation

Implement robust error handling for CSRF-related issues:

func handleCSRFErrors(c *colly.Collector) {
    c.OnError(func(r *colly.Response, err error) {
        if r.StatusCode == 419 || r.StatusCode == 403 {
            fmt.Printf("CSRF token error (Status: %d). Refreshing token...\n", r.StatusCode)
            // Re-visit the form page to get a fresh token
            c.Visit(r.Request.URL.String())
        }
    })

    c.OnHTML("div.csrf-error", func(e *colly.HTMLElement) {
        fmt.Printf("CSRF validation failed: %s\n", e.Text)
        // Handle the error by re-fetching the form
    })
}

Best Practices for CSRF Token Handling

1. Token Caching and Reuse

type TokenCache struct {
    tokens map[string]string
    mutex  sync.RWMutex
}

func (tc *TokenCache) Set(domain, token string) {
    tc.mutex.Lock()
    defer tc.mutex.Unlock()
    tc.tokens[domain] = token
}

func (tc *TokenCache) Get(domain string) string {
    tc.mutex.RLock()
    defer tc.mutex.RUnlock()
    return tc.tokens[domain]
}

2. Concurrent Request Handling

When scraping multiple pages simultaneously, ensure proper token management:

func handleConcurrentRequests() {
    c := colly.NewCollector()
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    tokenCache := &TokenCache{
        tokens: make(map[string]string),
    }

    c.OnHTML("form", func(e *colly.HTMLElement) {
        domain := e.Request.URL.Host
        token := extractCSRFToken(e)
        tokenCache.Set(domain, token)
    })
}

3. Debugging CSRF Issues

func debugCSRFHandling(c *colly.Collector) {
    c.OnRequest(func(r *colly.Request) {
        if r.Method == "POST" {
            fmt.Printf("POST Request to: %s\n", r.URL)
            fmt.Printf("Form data: %s\n", r.Body)
        }
    })

    c.OnResponse(func(r *colly.Response) {
        if r.StatusCode >= 400 {
            fmt.Printf("Error response: %d\n", r.StatusCode)
            fmt.Printf("Response body: %s\n", string(r.Body))
        }
    })
}

Working with File Uploads and CSRF Tokens

Many forms with CSRF protection also handle file uploads. Here's how to manage both:

func handleFileUploadWithCSRF(c *colly.Collector) {
    c.OnHTML("form[enctype='multipart/form-data']", func(e *colly.HTMLElement) {
        token := e.ChildAttr("input[name='_token']", "value")

        // For file uploads, you'll need to construct multipart form data
        // This is more complex and might require additional libraries
        formAction := e.Request.AbsoluteURL(e.Attr("action"))

        // Create form data with CSRF token
        formData := map[string]string{
            "_token": token,
            "title":  "File Upload",
        }

        // Note: File upload handling in Colly requires custom implementation
        // Consider using net/http for complex multipart forms
        fmt.Printf("Form action: %s, CSRF token: %s\n", formAction, token)
    })
}

Integration with Authentication Systems

CSRF tokens often work alongside authentication systems. When dealing with login flows that require both session management and CSRF protection, consider using tools that provide comprehensive browser session handling capabilities for more complex scenarios where Colly's static approach might be limiting.

For applications that heavily rely on JavaScript for form generation and token management, you might need to evaluate whether handling JavaScript-rendered content would be more appropriate than Colly's DOM-based approach.

Common CSRF Token Patterns by Framework

Different web frameworks implement CSRF tokens differently:

Laravel (PHP)

// Laravel uses _token field and meta tag
token := e.ChildAttr("input[name='_token']", "value")
if token == "" {
    token = e.DOM.Find("meta[name='csrf-token']").AttrOr("content", "")
}

Django (Python)

// Django uses csrfmiddlewaretoken
token := e.ChildAttr("input[name='csrfmiddlewaretoken']", "value")

Rails (Ruby)

// Rails uses authenticity_token
token := e.ChildAttr("input[name='authenticity_token']", "value")

Express.js with csurf

// Express with csurf middleware
token := e.ChildAttr("input[name='_csrf']", "value")

Performance Optimization for CSRF Handling

When scraping multiple pages with forms, optimize your CSRF token handling:

func optimizedCSRFHandling() {
    c := colly.NewCollector()

    // Use connection pooling for better performance
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 3,
        Delay:       500 * time.Millisecond,
    })

    // Cache tokens per domain to avoid repeated extraction
    tokenCache := make(map[string]string)
    var cacheMutex sync.RWMutex

    c.OnHTML("form", func(e *colly.HTMLElement) {
        domain := e.Request.URL.Host

        cacheMutex.RLock()
        existingToken, exists := tokenCache[domain]
        cacheMutex.RUnlock()

        if !exists {
            token := extractCSRFToken(e)
            if token != "" {
                cacheMutex.Lock()
                tokenCache[domain] = token
                cacheMutex.Unlock()
            }
        } else {
            fmt.Printf("Using cached token for %s: %s\n", domain, existingToken)
        }
    })
}

## Testing CSRF Token Implementation

When developing CSRF token handling, thorough testing is essential:


```language-go
func testCSRFTokenExtraction() {
    // Create a test HTML document
    testHTML := `
    <html>
    <head>
        <meta name="csrf-token" content="test-token-123">
    </head>
    <body>
        <form action="/submit">
            <input type="hidden" name="_token" value="form-token-456">
            <input type="text" name="username">
            <button type="submit">Submit</button>
        </form>
    </body>
    </html>`

    c := colly.NewCollector()

    c.OnHTML("form", func(e *colly.HTMLElement) {
        token := extractCSRFToken(e)
        if token == "" {
            log.Fatal("Failed to extract CSRF token")
        }
        fmt.Printf("Successfully extracted token: %s\n", token)
    })

    // Load HTML from string for testing
    c.OnRequest(func(r *colly.Request) {
        if r.URL.String() == "http://test.local" {
            r.ResponseCharacterEncoding = "UTF-8"
        }
    })
}

Troubleshooting Common Issues

Token Expiration

func handleTokenExpiration(c *colly.Collector) {
    c.OnResponse(func(r *colly.Response) {
        // Check for token expiration responses
        if strings.Contains(string(r.Body), "token expired") ||
           strings.Contains(string(r.Body), "csrf token mismatch") {

            fmt.Println("CSRF token expired, refreshing...")
            // Navigate back to form page to get fresh token
            baseURL := fmt.Sprintf("%s://%s", r.Request.URL.Scheme, r.Request.URL.Host)
            c.Visit(baseURL + "/form")
        }
    })
}

Hidden Token in JavaScript

func extractJavaScriptToken(e *colly.HTMLElement) string {
    var token string

    e.DOM.Find("script").Each(func(i int, script *goquery.Selection) {
        content := script.Text()

        // Multiple patterns for different JS implementations
        patterns := []string{
            `window\.csrfToken\s*=\s*['"](.*?)['"]`,
            `_token["']?\s*:\s*["']([^"']+)["']`,
            `csrf_token["']?\s*:\s*["']([^"']+)["']`,
            `csrfToken["']?\s*:\s*["']([^"']+)["']`,
        }

        for _, pattern := range patterns {
            if matches := regexp.MustCompile(pattern).FindStringSubmatch(content); len(matches) > 1 {
                token = matches[1]
                return
            }
        }
    })

    return token
}

Conclusion

Handling CSRF tokens in Colly requires careful extraction, storage, and submission of these security tokens. The key principles include:

Multiple extraction methods: Always try different ways to find CSRF tokens
Framework awareness: Understand how different frameworks implement CSRF protection
Session management: Maintain token state across multiple requests
Error handling: Gracefully handle token validation failures and expiration
Token refresh: Handle dynamic token updates in modern applications
Performance optimization: Cache tokens when appropriate to reduce overhead
Debugging and testing: Implement comprehensive logging and testing strategies

By following these patterns and best practices, you can successfully scrape websites with CSRF protection while maintaining the security expectations of the target applications. For scenarios requiring more sophisticated browser automation capabilities, consider evaluating tools that provide full JavaScript execution environments.

Remember to always respect the website's robots.txt file and terms of service when implementing these techniques, and ensure your scraping activities comply with applicable laws and regulations.

Table of contents