Table of contents

How do I parse CSS selectors in Go HTML parsing?

CSS selectors are a powerful way to target specific HTML elements when parsing web pages in Go. Unlike XPath expressions, CSS selectors provide a familiar syntax for developers who work with frontend technologies, making them an excellent choice for web scraping and HTML manipulation tasks.

Popular Go Libraries for CSS Selector Parsing

1. GoQuery - jQuery-like Syntax

GoQuery is the most popular Go library for HTML parsing with CSS selectors, providing a jQuery-like API that makes DOM manipulation intuitive for web developers.

go get github.com/PuerkitoBio/goquery

2. Cascadia - CSS Selector Engine

Cascadia is the underlying CSS selector engine used by GoQuery, but it can also be used independently for more fine-grained control.

go get github.com/andybalholm/cascadia

Basic CSS Selector Parsing with GoQuery

Here's how to get started with CSS selectors in Go using GoQuery:

package main

import (
    "fmt"
    "log"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    html := `
    <html>
        <body>
            <div class="container">
                <h1 id="title">Main Title</h1>
                <p class="content">First paragraph</p>
                <p class="content highlight">Second paragraph</p>
                <ul>
                    <li data-id="1">Item 1</li>
                    <li data-id="2">Item 2</li>
                </ul>
            </div>
        </body>
    </html>`

    doc, err := goquery.NewDocumentFromReader(strings.NewReader(html))
    if err != nil {
        log.Fatal(err)
    }

    // Basic element selection
    doc.Find("h1").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("H1 text: %s\n", s.Text())
    })

    // Class selector
    doc.Find(".content").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Content: %s\n", s.Text())
    })

    // ID selector
    title := doc.Find("#title").Text()
    fmt.Printf("Title: %s\n", title)

    // Attribute selector
    doc.Find("li[data-id]").Each(func(i int, s *goquery.Selection) {
        dataId, _ := s.Attr("data-id")
        fmt.Printf("Item %s: %s\n", dataId, s.Text())
    })
}

Advanced CSS Selector Examples

Combining Selectors

func advancedSelectors(doc *goquery.Document) {
    // Multiple classes
    doc.Find(".content.highlight").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Highlighted content: %s\n", s.Text())
    })

    // Descendant selectors
    doc.Find("div.container p").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Paragraph in container: %s\n", s.Text())
    })

    // Child selectors
    doc.Find("ul > li").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Direct list item: %s\n", s.Text())
    })

    // Adjacent sibling selector
    doc.Find("h1 + p").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Paragraph after h1: %s\n", s.Text())
    })

    // Pseudo-selectors
    firstItem := doc.Find("li:first-child").Text()
    lastItem := doc.Find("li:last-child").Text()
    fmt.Printf("First item: %s, Last item: %s\n", firstItem, lastItem)
}

Attribute Selectors

func attributeSelectors(doc *goquery.Document) {
    // Exact attribute value
    doc.Find("[data-id='1']").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Item with data-id=1: %s\n", s.Text())
    })

    // Attribute contains value
    doc.Find("[class*='content']").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Element with 'content' in class: %s\n", s.Text())
    })

    // Attribute starts with value
    doc.Find("[class^='cont']").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Element with class starting with 'cont': %s\n", s.Text())
    })

    // Attribute ends with value
    doc.Find("[class$='nt']").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Element with class ending with 'nt': %s\n", s.Text())
    })
}

Real-World Web Scraping Example

Here's a practical example of scraping a web page using CSS selectors:

package main

import (
    "fmt"
    "log"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

type Article struct {
    Title   string
    URL     string
    Summary string
    Author  string
}

func scrapeArticles(url string) ([]Article, error) {
    resp, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        return nil, err
    }

    var articles []Article

    // Select article containers
    doc.Find("article.post").Each(func(i int, s *goquery.Selection) {
        article := Article{}

        // Extract title
        article.Title = s.Find("h2.post-title a").Text()

        // Extract URL
        url, exists := s.Find("h2.post-title a").Attr("href")
        if exists {
            article.URL = url
        }

        // Extract summary
        article.Summary = s.Find(".post-excerpt").Text()

        // Extract author
        article.Author = s.Find(".post-meta .author").Text()

        articles = append(articles, article)
    })

    return articles, nil
}

func main() {
    articles, err := scrapeArticles("https://example-blog.com")
    if err != nil {
        log.Fatal(err)
    }

    for _, article := range articles {
        fmt.Printf("Title: %s\n", article.Title)
        fmt.Printf("Author: %s\n", article.Author)
        fmt.Printf("URL: %s\n", article.URL)
        fmt.Printf("Summary: %s\n\n", article.Summary)
    }
}

Using Cascadia Directly

For more control over CSS selector parsing, you can use Cascadia directly:

package main

import (
    "fmt"
    "log"
    "strings"

    "github.com/andybalholm/cascadia"
    "golang.org/x/net/html"
)

func main() {
    htmlContent := `<div class="content"><p>Hello World</p></div>`

    doc, err := html.Parse(strings.NewReader(htmlContent))
    if err != nil {
        log.Fatal(err)
    }

    // Compile CSS selector
    selector, err := cascadia.Parse(".content p")
    if err != nil {
        log.Fatal(err)
    }

    // Find matching nodes
    nodes := cascadia.QueryAll(doc, selector)

    for _, node := range nodes {
        if node.Type == html.ElementNode {
            fmt.Printf("Found element: %s\n", node.Data)
            // Extract text content
            if node.FirstChild != nil && node.FirstChild.Type == html.TextNode {
                fmt.Printf("Text: %s\n", node.FirstChild.Data)
            }
        }
    }
}

Performance Considerations

Optimizing Selector Performance

func optimizedSelectors(doc *goquery.Document) {
    // More specific selectors are generally faster
    // Good: Use specific selectors
    specificResults := doc.Find("div.container > p.content")

    // Avoid: Overly broad selectors
    // broadResults := doc.Find("*")

    // Cache frequently used selections
    container := doc.Find(".container")
    paragraphs := container.Find("p")
    links := container.Find("a")

    fmt.Printf("Found %d paragraphs and %d links\n", 
               paragraphs.Length(), links.Length())
}

Handling Large Documents

func handleLargeDocuments(doc *goquery.Document) {
    // Limit selections to specific sections
    mainContent := doc.Find("#main-content")

    // Process in chunks if needed
    mainContent.Find("article").Each(func(i int, s *goquery.Selection) {
        // Process each article individually
        title := s.Find("h1").First().Text()
        fmt.Printf("Processing article: %s\n", title)

        // Break early if needed
        if i >= 10 {
            return // Process only first 10 articles
        }
    })
}

Error Handling and Best Practices

func robustParsing(doc *goquery.Document) {
    // Always check if elements exist
    titleSelection := doc.Find("h1.title")
    if titleSelection.Length() > 0 {
        title := titleSelection.Text()
        fmt.Printf("Title: %s\n", title)
    } else {
        fmt.Println("Title not found")
    }

    // Handle missing attributes gracefully
    doc.Find("a").Each(func(i int, s *goquery.Selection) {
        href, exists := s.Attr("href")
        if exists {
            fmt.Printf("Link: %s\n", href)
        }

        // Provide fallback text
        text := s.Text()
        if text == "" {
            text = "No text available"
        }
        fmt.Printf("Link text: %s\n", text)
    })
}

Integration with HTTP Clients

For comprehensive web scraping workflows, CSS selectors work seamlessly with Go's HTTP capabilities. When dealing with complex websites that require JavaScript execution, you might want to consider how to handle AJAX requests using Puppeteer for scenarios where server-side rendering isn't sufficient.

func scrapeWithCustomClient() {
    client := &http.Client{
        Timeout: 30 * time.Second,
    }

    req, err := http.NewRequest("GET", "https://example.com", nil)
    if err != nil {
        log.Fatal(err)
    }

    // Set custom headers
    req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; GoScraper/1.0)")

    resp, err := client.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        log.Fatal(err)
    }

    // Now use CSS selectors as usual
    doc.Find("article h2").Each(func(i int, s *goquery.Selection) {
        fmt.Printf("Article title: %s\n", s.Text())
    })
}

Testing CSS Selectors

func TestCSSSelectors(t *testing.T) {
    html := `<div class="test"><p id="para1">Test paragraph</p></div>`
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))

    // Test element exists
    selection := doc.Find("#para1")
    if selection.Length() != 1 {
        t.Errorf("Expected 1 element, got %d", selection.Length())
    }

    // Test text content
    text := selection.Text()
    expected := "Test paragraph"
    if text != expected {
        t.Errorf("Expected '%s', got '%s'", expected, text)
    }
}

CSS selectors in Go provide a powerful and intuitive way to parse HTML documents. Whether you're building web scrapers, content extractors, or data processing pipelines, mastering CSS selectors with libraries like GoQuery and Cascadia will significantly improve your development efficiency. For more complex scenarios involving dynamic content, consider exploring how to interact with DOM elements in Puppeteer as a complementary approach to server-side HTML parsing.

Remember to always respect robots.txt files and implement appropriate rate limiting when scraping websites to ensure responsible web scraping practices.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon