Table of contents

Can I use Colly to scrape APIs that return JSON responses?

Yes, Colly is excellent for scraping APIs that return JSON responses! While Colly is primarily known as a web scraping framework for HTML content, it's equally powerful for API scraping and JSON data extraction. Colly provides robust HTTP client capabilities, making it ideal for consuming RESTful APIs, handling authentication, and processing structured JSON data.

Why Use Colly for API Scraping?

Colly offers several advantages for API scraping:

  • Built-in HTTP client with connection pooling and rate limiting
  • Automatic retry mechanisms for failed requests
  • Concurrent request handling for improved performance
  • Cookie and session management for authenticated APIs
  • Flexible response processing with custom callbacks
  • Request/response middleware for logging and debugging

Basic API Scraping with Colly

Here's a simple example of using Colly to scrape a JSON API:

package main

import (
    "encoding/json"
    "fmt"
    "log"

    "github.com/gocolly/colly/v2"
)

type Post struct {
    ID     int    `json:"id"`
    Title  string `json:"title"`
    Body   string `json:"body"`
    UserID int    `json:"userId"`
}

func main() {
    c := colly.NewCollector()

    // Handle JSON responses
    c.OnResponse(func(r *colly.Response) {
        var posts []Post
        err := json.Unmarshal(r.Body, &posts)
        if err != nil {
            log.Printf("Error parsing JSON: %v", err)
            return
        }

        for _, post := range posts {
            fmt.Printf("Post ID: %d, Title: %s\n", post.ID, post.Title)
        }
    })

    // Handle errors
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error: %s", err.Error())
    })

    // Visit the API endpoint
    c.Visit("https://jsonplaceholder.typicode.com/posts")
}

Advanced API Scraping Techniques

1. Handling Different JSON Structures

When working with various API endpoints, you'll encounter different JSON structures:

import "strings"

func setupAPICollector() *colly.Collector {
    c := colly.NewCollector()

    // Handle different endpoints based on URL
    c.OnResponse(func(r *colly.Response) {
        if strings.Contains(r.Request.URL.String(), "/users") {
            handleUsersResponse(r)
        } else if strings.Contains(r.Request.URL.String(), "/posts") {
            handlePostsResponse(r)
        }
    })

    return c
}

func handleUsersResponse(r *colly.Response) {
    type User struct {
        ID    int    `json:"id"`
        Name  string `json:"name"`
        Email string `json:"email"`
    }

    var users []User
    if err := json.Unmarshal(r.Body, &users); err != nil {
        log.Printf("Error parsing users: %v", err)
        return
    }

    for _, user := range users {
        fmt.Printf("User: %s (%s)\n", user.Name, user.Email)
    }
}

2. API Authentication

Many APIs require authentication. Here's how to handle different authentication methods:

import "encoding/base64"

// API Key Authentication
func setupAPIKeyAuth(c *colly.Collector, apiKey string) {
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("X-API-Key", apiKey)
    })
}

// Bearer Token Authentication
func setupBearerAuth(c *colly.Collector, token string) {
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("Authorization", "Bearer "+token)
    })
}

// Basic Authentication
func setupBasicAuth(c *colly.Collector, username, password string) {
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("Authorization", 
            "Basic "+base64.StdEncoding.EncodeToString(
                []byte(username+":"+password)))
    })
}

3. Handling Paginated APIs

Many APIs use pagination. Here's how to handle paginated responses:

func scrapePaginatedAPI() {
    c := colly.NewCollector()

    type APIResponse struct {
        Data     []interface{} `json:"data"`
        NextPage *string       `json:"next_page"`
        Page     int           `json:"page"`
    }

    c.OnResponse(func(r *colly.Response) {
        var response APIResponse
        if err := json.Unmarshal(r.Body, &response); err != nil {
            log.Printf("Error parsing response: %v", err)
            return
        }

        // Process current page data
        fmt.Printf("Processing page %d with %d items\n", 
            response.Page, len(response.Data))

        // Visit next page if available
        if response.NextPage != nil {
            c.Visit(*response.NextPage)
        }
    })

    // Start with first page
    c.Visit("https://api.example.com/data?page=1")
}

4. Rate Limiting and Delays

Implement proper rate limiting to avoid overwhelming APIs:

import (
    "time"
    "github.com/gocolly/colly/v2/debug"
)

func setupRateLimiting() *colly.Collector {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Limit the number of threads
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 2,
        Delay:       1 * time.Second,
    })

    return c
}

POST Requests and Form Data

Colly also supports POST requests for APIs that require data submission:

func submitAPIData() {
    c := colly.NewCollector()

    // Handle response
    c.OnResponse(func(r *colly.Response) {
        type CreateResponse struct {
            ID      int    `json:"id"`
            Message string `json:"message"`
        }

        var response CreateResponse
        json.Unmarshal(r.Body, &response)
        fmt.Printf("Created resource with ID: %d\n", response.ID)
    })

    // Prepare POST data
    postData := map[string]interface{}{
        "title":  "New Post",
        "body":   "This is the content",
        "userId": 1,
    }

    jsonData, _ := json.Marshal(postData)

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("Content-Type", "application/json")
    })

    c.PostRaw("https://jsonplaceholder.typicode.com/posts", jsonData)
}

Error Handling and Retry Logic

Implement robust error handling for API failures:

func setupErrorHandling(c *colly.Collector) {
    // Handle HTTP errors
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request URL: %s failed with response: %s\nError: %s", 
            r.Request.URL, r.Body, err)

        // Retry logic for specific status codes
        if r.StatusCode == 429 || r.StatusCode >= 500 {
            time.Sleep(5 * time.Second)
            r.Request.Retry()
        }
    })

    // Set timeout
    c.SetRequestTimeout(30 * time.Second)
}

Working with Complex JSON Structures

Handle nested JSON objects and arrays effectively:

type ComplexResponse struct {
    Meta struct {
        Total int    `json:"total"`
        Page  int    `json:"page"`
        Limit int    `json:"limit"`
    } `json:"meta"`
    Data []struct {
        ID         int      `json:"id"`
        Name       string   `json:"name"`
        Attributes struct {
            Category string   `json:"category"`
            Tags     []string `json:"tags"`
        } `json:"attributes"`
    } `json:"data"`
}

func handleComplexJSON(r *colly.Response) {
    var response ComplexResponse
    if err := json.Unmarshal(r.Body, &response); err != nil {
        log.Printf("Error parsing complex JSON: %v", err)
        return
    }

    fmt.Printf("Total records: %d, Current page: %d\n", 
        response.Meta.Total, response.Meta.Page)

    for _, item := range response.Data {
        fmt.Printf("Item: %s (Category: %s)\n", 
            item.Name, item.Attributes.Category)
        fmt.Printf("Tags: %v\n", item.Attributes.Tags)
    }
}

Comparison with Alternative Approaches

While Colly excels at API scraping, you might also consider these alternatives:

JavaScript with Axios

const axios = require('axios');

async function scrapeAPI() {
    try {
        const response = await axios.get('https://api.example.com/data', {
            headers: {
                'Authorization': 'Bearer your-token'
            }
        });

        console.log(response.data);
    } catch (error) {
        console.error('API request failed:', error);
    }
}

Python with Requests

import requests
import json

def scrape_api():
    headers = {'Authorization': 'Bearer your-token'}
    response = requests.get('https://api.example.com/data', headers=headers)

    if response.status_code == 200:
        data = response.json()
        print(json.dumps(data, indent=2))
    else:
        print(f"Request failed: {response.status_code}")

For more complex scenarios involving JavaScript-rendered content, you might need browser automation tools for handling AJAX requests.

Best Practices for API Scraping with Colly

  1. Respect Rate Limits: Always implement appropriate delays between requests
  2. Handle Authentication Properly: Store API keys securely and refresh tokens as needed
  3. Implement Retry Logic: Handle temporary failures gracefully
  4. Parse JSON Safely: Always check for parsing errors
  5. Log Requests: Use Colly's debugging features for troubleshooting
  6. Monitor API Changes: APIs can change their response format
  7. Use Structured Data Types: Define Go structs that match API responses

Complete Example: GitHub API Scraper

Here's a comprehensive example that demonstrates scraping GitHub's API:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "time"

    "github.com/gocolly/colly/v2"
    "github.com/gocolly/colly/v2/debug"
)

type GitHubResponse struct {
    Items []Repository `json:"items"`
}

type Repository struct {
    Name        string `json:"name"`
    FullName    string `json:"full_name"`
    Description string `json:"description"`
    Stars       int    `json:"stargazers_count"`
    Language    string `json:"language"`
    HTMLURL     string `json:"html_url"`
}

func main() {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*github.com*",
        Parallelism: 1,
        Delay:       2 * time.Second,
    })

    // Set headers
    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("Accept", "application/vnd.github.v3+json")
        r.Headers.Set("User-Agent", "Colly API Scraper")
    })

    // Handle successful responses
    c.OnResponse(func(r *colly.Response) {
        var response GitHubResponse
        if err := json.Unmarshal(r.Body, &response); err != nil {
            log.Printf("JSON parsing error: %v", err)
            return
        }

        for _, repo := range response.Items {
            fmt.Printf("Repository: %s\n", repo.FullName)
            fmt.Printf("Stars: %d\n", repo.Stars)
            fmt.Printf("Language: %s\n", repo.Language)
            fmt.Printf("Description: %s\n", repo.Description)
            fmt.Printf("URL: %s\n\n", repo.HTMLURL)
        }
    })

    // Error handling
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error scraping %s: %s", r.Request.URL, err)
    })

    // Scrape popular Go repositories
    c.Visit("https://api.github.com/search/repositories?q=language:go&sort=stars&order=desc&per_page=10")
}

Monitoring and Debugging API Requests

Use Colly's built-in debugging capabilities to monitor your API requests:

func setupDebugging() *colly.Collector {
    c := colly.NewCollector(
        colly.Debugger(&debug.LogDebugger{}),
    )

    // Log all requests
    c.OnRequest(func(r *colly.Request) {
        log.Printf("Visiting: %s", r.URL.String())
    })

    // Log response status
    c.OnResponse(func(r *colly.Response) {
        log.Printf("Response status: %d for %s", 
            r.StatusCode, r.Request.URL)
    })

    return c
}

Conclusion

Colly is exceptionally well-suited for API scraping tasks involving JSON responses. Its built-in HTTP client capabilities, combined with powerful callback systems and concurrent processing, make it an excellent choice for consuming RESTful APIs. Whether you're building data pipelines, monitoring services, or conducting research, Colly provides the tools needed for robust and efficient API scraping operations.

The framework's Go-native approach offers superior performance compared to Python alternatives, while its extensive middleware system allows for sophisticated request/response processing workflows that are essential for professional API scraping projects. With proper rate limiting, error handling, and authentication management, Colly can handle even the most demanding API scraping requirements.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon