How do I parse CSS selectors in Go HTML parsing?
CSS selectors are a powerful way to target specific HTML elements when parsing web pages in Go. Unlike XPath expressions, CSS selectors provide a familiar syntax for developers who work with frontend technologies, making them an excellent choice for web scraping and HTML manipulation tasks.
Popular Go Libraries for CSS Selector Parsing
1. GoQuery - jQuery-like Syntax
GoQuery is the most popular Go library for HTML parsing with CSS selectors, providing a jQuery-like API that makes DOM manipulation intuitive for web developers.
go get github.com/PuerkitoBio/goquery
2. Cascadia - CSS Selector Engine
Cascadia is the underlying CSS selector engine used by GoQuery, but it can also be used independently for more fine-grained control.
go get github.com/andybalholm/cascadia
Basic CSS Selector Parsing with GoQuery
Here's how to get started with CSS selectors in Go using GoQuery:
package main
import (
"fmt"
"log"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
html := `
<html>
<body>
<div class="container">
<h1 id="title">Main Title</h1>
<p class="content">First paragraph</p>
<p class="content highlight">Second paragraph</p>
<ul>
<li data-id="1">Item 1</li>
<li data-id="2">Item 2</li>
</ul>
</div>
</body>
</html>`
doc, err := goquery.NewDocumentFromReader(strings.NewReader(html))
if err != nil {
log.Fatal(err)
}
// Basic element selection
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
fmt.Printf("H1 text: %s\n", s.Text())
})
// Class selector
doc.Find(".content").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Content: %s\n", s.Text())
})
// ID selector
title := doc.Find("#title").Text()
fmt.Printf("Title: %s\n", title)
// Attribute selector
doc.Find("li[data-id]").Each(func(i int, s *goquery.Selection) {
dataId, _ := s.Attr("data-id")
fmt.Printf("Item %s: %s\n", dataId, s.Text())
})
}
Advanced CSS Selector Examples
Combining Selectors
func advancedSelectors(doc *goquery.Document) {
// Multiple classes
doc.Find(".content.highlight").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Highlighted content: %s\n", s.Text())
})
// Descendant selectors
doc.Find("div.container p").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Paragraph in container: %s\n", s.Text())
})
// Child selectors
doc.Find("ul > li").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Direct list item: %s\n", s.Text())
})
// Adjacent sibling selector
doc.Find("h1 + p").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Paragraph after h1: %s\n", s.Text())
})
// Pseudo-selectors
firstItem := doc.Find("li:first-child").Text()
lastItem := doc.Find("li:last-child").Text()
fmt.Printf("First item: %s, Last item: %s\n", firstItem, lastItem)
}
Attribute Selectors
func attributeSelectors(doc *goquery.Document) {
// Exact attribute value
doc.Find("[data-id='1']").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Item with data-id=1: %s\n", s.Text())
})
// Attribute contains value
doc.Find("[class*='content']").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Element with 'content' in class: %s\n", s.Text())
})
// Attribute starts with value
doc.Find("[class^='cont']").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Element with class starting with 'cont': %s\n", s.Text())
})
// Attribute ends with value
doc.Find("[class$='nt']").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Element with class ending with 'nt': %s\n", s.Text())
})
}
Real-World Web Scraping Example
Here's a practical example of scraping a web page using CSS selectors:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
type Article struct {
Title string
URL string
Summary string
Author string
}
func scrapeArticles(url string) ([]Article, error) {
resp, err := http.Get(url)
if err != nil {
return nil, err
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
return nil, err
}
var articles []Article
// Select article containers
doc.Find("article.post").Each(func(i int, s *goquery.Selection) {
article := Article{}
// Extract title
article.Title = s.Find("h2.post-title a").Text()
// Extract URL
url, exists := s.Find("h2.post-title a").Attr("href")
if exists {
article.URL = url
}
// Extract summary
article.Summary = s.Find(".post-excerpt").Text()
// Extract author
article.Author = s.Find(".post-meta .author").Text()
articles = append(articles, article)
})
return articles, nil
}
func main() {
articles, err := scrapeArticles("https://example-blog.com")
if err != nil {
log.Fatal(err)
}
for _, article := range articles {
fmt.Printf("Title: %s\n", article.Title)
fmt.Printf("Author: %s\n", article.Author)
fmt.Printf("URL: %s\n", article.URL)
fmt.Printf("Summary: %s\n\n", article.Summary)
}
}
Using Cascadia Directly
For more control over CSS selector parsing, you can use Cascadia directly:
package main
import (
"fmt"
"log"
"strings"
"github.com/andybalholm/cascadia"
"golang.org/x/net/html"
)
func main() {
htmlContent := `<div class="content"><p>Hello World</p></div>`
doc, err := html.Parse(strings.NewReader(htmlContent))
if err != nil {
log.Fatal(err)
}
// Compile CSS selector
selector, err := cascadia.Parse(".content p")
if err != nil {
log.Fatal(err)
}
// Find matching nodes
nodes := cascadia.QueryAll(doc, selector)
for _, node := range nodes {
if node.Type == html.ElementNode {
fmt.Printf("Found element: %s\n", node.Data)
// Extract text content
if node.FirstChild != nil && node.FirstChild.Type == html.TextNode {
fmt.Printf("Text: %s\n", node.FirstChild.Data)
}
}
}
}
Performance Considerations
Optimizing Selector Performance
func optimizedSelectors(doc *goquery.Document) {
// More specific selectors are generally faster
// Good: Use specific selectors
specificResults := doc.Find("div.container > p.content")
// Avoid: Overly broad selectors
// broadResults := doc.Find("*")
// Cache frequently used selections
container := doc.Find(".container")
paragraphs := container.Find("p")
links := container.Find("a")
fmt.Printf("Found %d paragraphs and %d links\n",
paragraphs.Length(), links.Length())
}
Handling Large Documents
func handleLargeDocuments(doc *goquery.Document) {
// Limit selections to specific sections
mainContent := doc.Find("#main-content")
// Process in chunks if needed
mainContent.Find("article").Each(func(i int, s *goquery.Selection) {
// Process each article individually
title := s.Find("h1").First().Text()
fmt.Printf("Processing article: %s\n", title)
// Break early if needed
if i >= 10 {
return // Process only first 10 articles
}
})
}
Error Handling and Best Practices
func robustParsing(doc *goquery.Document) {
// Always check if elements exist
titleSelection := doc.Find("h1.title")
if titleSelection.Length() > 0 {
title := titleSelection.Text()
fmt.Printf("Title: %s\n", title)
} else {
fmt.Println("Title not found")
}
// Handle missing attributes gracefully
doc.Find("a").Each(func(i int, s *goquery.Selection) {
href, exists := s.Attr("href")
if exists {
fmt.Printf("Link: %s\n", href)
}
// Provide fallback text
text := s.Text()
if text == "" {
text = "No text available"
}
fmt.Printf("Link text: %s\n", text)
})
}
Integration with HTTP Clients
For comprehensive web scraping workflows, CSS selectors work seamlessly with Go's HTTP capabilities. When dealing with complex websites that require JavaScript execution, you might want to consider how to handle AJAX requests using Puppeteer for scenarios where server-side rendering isn't sufficient.
func scrapeWithCustomClient() {
client := &http.Client{
Timeout: 30 * time.Second,
}
req, err := http.NewRequest("GET", "https://example.com", nil)
if err != nil {
log.Fatal(err)
}
// Set custom headers
req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; GoScraper/1.0)")
resp, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
// Now use CSS selectors as usual
doc.Find("article h2").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Article title: %s\n", s.Text())
})
}
Testing CSS Selectors
func TestCSSSelectors(t *testing.T) {
html := `<div class="test"><p id="para1">Test paragraph</p></div>`
doc, _ := goquery.NewDocumentFromReader(strings.NewReader(html))
// Test element exists
selection := doc.Find("#para1")
if selection.Length() != 1 {
t.Errorf("Expected 1 element, got %d", selection.Length())
}
// Test text content
text := selection.Text()
expected := "Test paragraph"
if text != expected {
t.Errorf("Expected '%s', got '%s'", expected, text)
}
}
CSS selectors in Go provide a powerful and intuitive way to parse HTML documents. Whether you're building web scrapers, content extractors, or data processing pipelines, mastering CSS selectors with libraries like GoQuery and Cascadia will significantly improve your development efficiency. For more complex scenarios involving dynamic content, consider exploring how to interact with DOM elements in Puppeteer as a complementary approach to server-side HTML parsing.
Remember to always respect robots.txt files and implement appropriate rate limiting when scraping websites to ensure responsible web scraping practices.