Table of contents

What is the Difference Between OnHTML and OnXML Callbacks in Colly?

Colly, the popular Go web scraping framework, provides two primary callback methods for parsing and extracting data from web pages: OnHTML and OnXML. Understanding the differences between these callbacks is crucial for choosing the right approach for your web scraping projects. This article explores their distinct characteristics, use cases, and implementation patterns.

Overview of Colly Callbacks

Both OnHTML and OnXML are event-driven callbacks that allow you to define how Colly should handle specific elements or data structures when crawling web pages. The key difference lies in their parsing mechanisms and intended use cases.

OnHTML Callback: HTML-Specific Parsing

The OnHTML callback is specifically designed for parsing HTML documents using CSS selectors. It leverages the goquery library, which provides jQuery-like syntax for Go.

Key Characteristics of OnHTML

  • CSS Selector Support: Uses CSS selectors to target HTML elements
  • jQuery-like Syntax: Familiar syntax for developers with web development experience
  • HTML-Optimized: Designed specifically for HTML document parsing
  • Element Manipulation: Provides rich methods for element traversal and manipulation

OnHTML Syntax and Usage

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // OnHTML callback with CSS selector
    c.OnHTML("div.product", func(e *colly.HTMLElement) {
        title := e.ChildText("h2.title")
        price := e.ChildText("span.price")

        fmt.Printf("Product: %s - Price: %s\n", title, price)
    })

    // Extract all links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.Visit("https://example-ecommerce.com")
}

Common OnHTML Use Cases

  1. E-commerce Product Scraping
c.OnHTML("div.product-card", func(e *colly.HTMLElement) {
    product := Product{
        Name:        e.ChildText(".product-name"),
        Price:       e.ChildText(".price"),
        Rating:      e.ChildText(".rating"),
        ImageURL:    e.ChildAttr("img", "src"),
        ProductURL:  e.ChildAttr("a", "href"),
    }
    products = append(products, product)
})
  1. Blog Post Extraction
c.OnHTML("article", func(e *colly.HTMLElement) {
    post := BlogPost{
        Title:       e.ChildText("h1"),
        Content:     e.ChildText(".content"),
        Author:      e.ChildText(".author"),
        PublishDate: e.ChildText(".publish-date"),
    }
    posts = append(posts, post)
})

OnXML Callback: XML-Specific Parsing

The OnXML callback is designed for parsing XML documents and uses XPath expressions for element selection. It's built on Go's standard XML parsing capabilities.

Key Characteristics of OnXML

  • XPath Support: Uses XPath expressions for precise element targeting
  • XML-Optimized: Specifically designed for XML document parsing
  • Namespace Aware: Handles XML namespaces effectively
  • Standards Compliant: Follows XML parsing standards

OnXML Syntax and Usage

package main

import (
    "fmt"
    "github.com/gocolly/colly/v2"
)

func main() {
    c := colly.NewCollector()

    // OnXML callback with XPath expression
    c.OnXML("//item", func(e *colly.XMLElement) {
        title := e.ChildText("title")
        description := e.ChildText("description")
        pubDate := e.ChildText("pubDate")

        fmt.Printf("RSS Item: %s - %s\n", title, pubDate)
    })

    // Extract specific XML attributes
    c.OnXML("//product[@category='electronics']", func(e *colly.XMLElement) {
        id := e.Attr("id")
        name := e.ChildText("name")
        price := e.ChildText("price")

        fmt.Printf("Product ID: %s, Name: %s, Price: %s\n", id, name, price)
    })

    c.Visit("https://example.com/rss.xml")
}

Common OnXML Use Cases

  1. RSS Feed Parsing
c.OnXML("//rss/channel/item", func(e *colly.XMLElement) {
    item := RSSItem{
        Title:       e.ChildText("title"),
        Description: e.ChildText("description"),
        Link:        e.ChildText("link"),
        PubDate:     e.ChildText("pubDate"),
        GUID:        e.ChildText("guid"),
    }
    rssItems = append(rssItems, item)
})
  1. API Response Parsing
c.OnXML("//response/products/product", func(e *colly.XMLElement) {
    product := APIProduct{
        ID:          e.Attr("id"),
        Name:        e.ChildText("name"),
        Category:    e.ChildText("category"),
        Stock:       e.ChildText("stock"),
        LastUpdated: e.ChildText("last_updated"),
    }
    apiProducts = append(apiProducts, product)
})

Key Differences Between OnHTML and OnXML

1. Selector Syntax

| Aspect | OnHTML | OnXML | |--------|--------|-------| | Selector Type | CSS Selectors | XPath Expressions | | Example | div.container > p | //div[@class='container']/p | | Learning Curve | Lower (familiar to web developers) | Higher (requires XPath knowledge) |

2. Document Type Support

// OnHTML - Best for HTML documents
c.OnHTML("div.news-article", func(e *colly.HTMLElement) {
    // Process HTML content
})

// OnXML - Best for XML documents
c.OnXML("//article[@type='news']", func(e *colly.XMLElement) {
    // Process XML content
})

3. Performance Considerations

  • OnHTML: Generally faster for HTML parsing due to optimized HTML parser
  • OnXML: May have slight overhead for complex XPath expressions but offers more precise targeting

4. Element Manipulation Capabilities

// OnHTML provides rich jQuery-like methods
c.OnHTML("ul.menu", func(e *colly.HTMLElement) {
    e.ForEach("li", func(i int, el *colly.HTMLElement) {
        menuItem := el.Text
        // Process each menu item
    })
})

// OnXML focuses on standard XML operations
c.OnXML("//menu/item", func(e *colly.XMLElement) {
    menuItem := e.Text
    // Process menu item
})

Advanced Usage Patterns

Combining OnHTML and OnXML

In some scenarios, you might need to use both callbacks in the same scraper:

func main() {
    c := colly.NewCollector()

    // Handle HTML pages
    c.OnHTML("a[href$='.xml']", func(e *colly.HTMLElement) {
        xmlURL := e.Request.AbsoluteURL(e.Attr("href"))
        c.Visit(xmlURL)
    })

    // Handle XML documents
    c.OnXML("//product", func(e *colly.XMLElement) {
        // Process XML product data
        product := extractProductFromXML(e)
        saveProduct(product)
    })

    c.Visit("https://example.com/products")
}

Error Handling and Validation

c.OnHTML("div.error", func(e *colly.HTMLElement) {
    errorMsg := e.Text
    log.Printf("HTML parsing error: %s", errorMsg)
})

c.OnXML("//error", func(e *colly.XMLElement) {
    errorCode := e.Attr("code")
    errorMsg := e.Text
    log.Printf("XML error [%s]: %s", errorCode, errorMsg)
})

Best Practices and Recommendations

When to Use OnHTML

  • Scraping traditional websites and web applications
  • Working with e-commerce sites, blogs, and news websites
  • When CSS selectors provide sufficient targeting capability
  • For developers familiar with jQuery or CSS selectors

When to Use OnXML

  • Parsing RSS/Atom feeds and sitemaps
  • Working with API responses in XML format
  • Handling structured data with complex hierarchies
  • When precise XPath targeting is required
  • Processing XML documents with namespaces

Performance Optimization Tips

  1. Use Specific Selectors: More specific selectors reduce processing overhead
  2. Limit Callback Scope: Only register callbacks for elements you actually need
  3. Batch Processing: Collect data in batches rather than processing individually
// Efficient: Specific selector
c.OnHTML("div.product-grid div.product", func(e *colly.HTMLElement) {
    // Process products
})

// Less efficient: Too broad
c.OnHTML("div", func(e *colly.HTMLElement) {
    if e.Attr("class") == "product" {
        // Process products
    }
})

Error Handling Strategies

HTML Parsing Error Handling

c.OnHTML("div.product", func(e *colly.HTMLElement) {
    defer func() {
        if r := recover(); r != nil {
            log.Printf("Panic recovered in OnHTML: %v", r)
        }
    }()

    // Safe element access
    if title := e.ChildText(".title"); title != "" {
        // Process title
    }
})

XML Parsing Error Handling

c.OnXML("//product", func(e *colly.XMLElement) {
    // Validate required fields
    if id := e.Attr("id"); id == "" {
        log.Printf("Product missing required ID attribute")
        return
    }

    // Process valid product
    processProduct(e)
})

Integration with Modern Web Applications

Handling Dynamic Content

While Colly's callbacks are excellent for static content, modern web applications often require additional tools for JavaScript-rendered content. For such scenarios, you might need to consider browser automation solutions that can handle dynamic content loading and complex user interactions.

Working with SPAs (Single Page Applications)

When dealing with single-page applications that heavily rely on JavaScript, traditional HTML parsing may not capture all the content. In these cases, you might want to explore solutions for crawling single page applications using browser automation tools.

Debugging and Troubleshooting

Common Issues with OnHTML

  1. Selector Not Matching: Verify CSS selector syntax
  2. Missing Content: Check if content is dynamically loaded
  3. Performance Issues: Optimize selector specificity
// Debug callback execution
c.OnHTML("*", func(e *colly.HTMLElement) {
    log.Printf("Processing element: %s", e.Name)
})

Common Issues with OnXML

  1. XPath Syntax Errors: Validate XPath expressions
  2. Namespace Issues: Handle XML namespaces properly
  3. Document Structure: Verify XML document structure
// Debug XPath matching
c.OnXML("//*", func(e *colly.XMLElement) {
    log.Printf("XML element: %s with text: %s", e.Name, e.Text)
})

Conclusion

The choice between OnHTML and OnXML callbacks in Colly depends primarily on your document type and parsing requirements. Use OnHTML for HTML documents when you need jQuery-like functionality and CSS selector convenience. Choose OnXML for XML documents when you require precise XPath targeting and standards-compliant XML parsing.

Understanding these differences enables you to build more efficient and maintainable web scrapers that handle diverse content types effectively. Whether you're scraping modern web applications or processing structured XML feeds, Colly's callback system provides the flexibility needed for comprehensive data extraction.

For complex scenarios involving JavaScript-heavy websites or real-time data, consider complementing Colly with browser automation tools that can handle dynamic content and complex interactions more effectively.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon