How to select elements that contain a specific text using XPath?

XPath (XML Path Language) is a powerful query language for navigating and selecting elements in XML and HTML documents. One of its most useful features for web scraping is the ability to select elements based on their text content using the contains() function.

Basic Syntax for Text Selection

The fundamental XPath expression for selecting elements containing specific text is:

//*[contains(text(), 'Your Specific Text')]

Syntax Breakdown

  • // - Selects nodes anywhere in the document (descendant-or-self axis)
  • * - Matches any element node (wildcard)
  • contains(haystack, needle) - XPath function that returns true if the first string contains the second
  • text() - Selects the direct text content of the current node (excluding child elements)

Common XPath Text Selection Patterns

1. Basic Text Matching

# Select any element containing "Click here"
//*[contains(text(), 'Click here')]

# Select specific elements (div) containing text
//div[contains(text(), 'Welcome')]

# Case-sensitive exact match
//*[text() = 'Login']

2. Advanced Text Matching

# Normalize whitespace and match text
//*[contains(normalize-space(text()), 'Product Name')]

# Match text anywhere in element (including child elements)
//*[contains(., 'Search results')]

# Combine with attribute selection
//button[contains(text(), 'Submit') and @type='submit']

Python Examples with lxml

Here's a comprehensive example using Python's lxml library:

from lxml import html
import requests

# Sample HTML content with various scenarios
html_content = """
<html>
<body>
  <div class="header">
    <h1>Welcome to Our Website</h1>
    <nav>
      <a href="/home">Home</a>
      <a href="/products">Products</a>
      <a href="/contact">Contact Us</a>
    </nav>
  </div>

  <main>
    <p>First paragraph with some text.</p>
    <p>Second paragraph with <span>specific text</span> inside.</p>
    <div>
      <button type="submit">Click here to submit</button>
      <button type="button">Cancel</button>
    </div>
    <ul>
      <li data-product="laptop">Gaming Laptop - $999</li>
      <li data-product="mouse">Wireless Mouse - $29</li>
    </ul>
  </main>
</body>
</html>
"""

# Parse the HTML content
tree = html.fromstring(html_content)

# Example 1: Find elements with exact text match
exact_match = tree.xpath("//*[text() = 'Cancel']")
print("Exact match:", [el.text for el in exact_match])

# Example 2: Find elements containing specific text
contains_match = tree.xpath("//*[contains(text(), 'specific text')]")
print("Contains match:", [el.text for el in contains_match])

# Example 3: Find elements with text anywhere (including children)
descendant_text = tree.xpath("//*[contains(., 'Gaming Laptop')]")
print("Descendant text:", [el.text for el in descendant_text])

# Example 4: Case-insensitive matching using translate()
case_insensitive = tree.xpath("//*[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'welcome')]")
print("Case insensitive:", [el.text for el in case_insensitive])

# Example 5: Combine text and attribute conditions
combined = tree.xpath("//button[contains(text(), 'Click') and @type='submit']")
print("Combined conditions:", [el.text for el in combined])

# Example 6: Get parent element of text-containing element
parent_elements = tree.xpath("//p[contains(text(), 'specific text')]/parent::*")
print("Parent elements:", [el.tag for el in parent_elements])

JavaScript Examples with document.evaluate

In browser environments, you can use document.evaluate() to execute XPath expressions:

// Helper function to execute XPath and return results as array
function getElementsByXPath(xpath, contextNode = document) {
    let results = [];
    let query = document.evaluate(xpath, contextNode, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);

    for (let i = 0; i < query.snapshotLength; i++) {
        results.push(query.snapshotItem(i));
    }

    return results;
}

// Example 1: Find buttons containing "Submit"
const submitButtons = getElementsByXPath("//button[contains(text(), 'Submit')]");
console.log("Submit buttons:", submitButtons.map(btn => btn.textContent));

// Example 2: Find links with specific text
const contactLinks = getElementsByXPath("//a[contains(text(), 'Contact')]");
contactLinks.forEach(link => {
    console.log(`Link: ${link.textContent} -> ${link.href}`);
});

// Example 3: Case-insensitive search using translate()
const caseInsensitive = getElementsByXPath(
    "//*[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'login')]"
);

// Example 4: Find elements by partial text match with additional conditions
const specificDivs = getElementsByXPath("//div[contains(text(), 'Error') and contains(@class, 'message')]");

// Example 5: Get elements containing text in any descendant
const anyDescendant = getElementsByXPath("//*[contains(., 'Total: $')]");

// Example 6: Using XPath with modern async/await pattern
async function findElementsByText(searchText, timeout = 5000) {
    const startTime = Date.now();

    while (Date.now() - startTime < timeout) {
        const elements = getElementsByXPath(`//*[contains(text(), '${searchText}')]`);
        if (elements.length > 0) {
            return elements;
        }
        await new Promise(resolve => setTimeout(resolve, 100));
    }

    throw new Error(`Elements containing '${searchText}' not found within ${timeout}ms`);
}

// Usage with error handling
try {
    const elements = await findElementsByText('Loading...');
    console.log('Found loading elements:', elements);
} catch (error) {
    console.error('Search failed:', error.message);
}

Working with Attributes and Text

Selecting Elements by Attribute Content

You can also search for text within element attributes using the contains() function:

# Find elements where data-title attribute contains specific text
//*[contains(@data-title, 'specific text')]

# Find elements where class attribute contains specific text
//*[contains(@class, 'error-message')]

# Find elements where any attribute contains specific text
//*[@*[contains(., 'search-term')]]

Python Example with Attributes

from lxml import html

html_content = """
<div>
  <button data-action="submit" data-label="Submit Form">Click me</button>
  <img src="image.jpg" alt="Product image - Gaming Laptop" title="High-end gaming laptop">
  <a href="/products" aria-label="Browse all products">Products</a>
</div>
"""

tree = html.fromstring(html_content)

# Find elements by attribute content
action_buttons = tree.xpath("//*[contains(@data-action, 'submit')]")
gaming_images = tree.xpath("//img[contains(@alt, 'Gaming')]")
product_links = tree.xpath("//a[contains(@aria-label, 'products')]")

print("Action buttons:", [btn.get('data-label') for btn in action_buttons])
print("Gaming images:", [img.get('alt') for img in gaming_images])
print("Product links:", [link.get('href') for link in product_links])

Advanced Text Matching Techniques

1. Handling Whitespace and Normalization

# Normalize whitespace before matching
//*[contains(normalize-space(text()), 'search term')]

# Match text with flexible whitespace
//*[contains(translate(normalize-space(text()), ' ', ''), 'searchterm')]

2. Multiple Text Conditions

# Element must contain both text strings
//*[contains(text(), 'Price') and contains(text(), '$')]

# Element contains any of the text strings (OR condition)
//*[contains(text(), 'Buy Now') or contains(text(), 'Purchase')]

# Element contains text but not another text
//*[contains(text(), 'Product') and not(contains(text(), 'Sold Out'))]

3. Position-Based Selection

# First element containing specific text
(//*[contains(text(), 'Click here')])[1]

# Last element containing specific text
(//*[contains(text(), 'More info')])[last()]

# Second paragraph containing specific text
(//p[contains(text(), 'Description')])[2]

Common Pitfalls and Solutions

Issue 1: Text in Child Elements

When text is nested in child elements, text() only matches direct text content:

<div>Welcome <span>to our</span> website</div>
# This WON'T match the div above
//div[contains(text(), 'Welcome to our website')]

# This WILL match (searches all descendant text)
//div[contains(., 'Welcome to our website')]

Issue 2: Case Sensitivity

XPath text matching is case-sensitive by default. Use translate() for case-insensitive matching:

# Case-insensitive search
//*[contains(translate(text(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'login')]

# More readable approach using variables (in supporting languages)
# translate(text(), $uppercase, $lowercase)

Issue 3: Leading/Trailing Whitespace

Use normalize-space() to handle whitespace issues:

# Handles extra whitespace
//*[normalize-space(text()) = 'Submit']

# Partial match with whitespace normalization
//*[contains(normalize-space(text()), 'Click here')]

Selenium WebDriver Examples

XPath text selection is commonly used with Selenium for web automation:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("https://example.com")

# Wait for and find element containing specific text
submit_button = WebDriverWait(driver, 10).until(
    EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Submit')]"))
)

# Find multiple elements with text
error_messages = driver.find_elements(By.XPATH, "//*[contains(@class, 'error') and contains(text(), 'required')]")

# Find element by partial text match
product_link = driver.find_element(By.XPATH, "//a[contains(text(), 'Gaming Laptop')]")

driver.quit()

Best Practices

  1. Use specific selectors: Instead of //*, use specific tags like //button or //div for better performance
  2. Combine conditions: Use multiple conditions to create more precise selectors
  3. Handle edge cases: Always consider whitespace, case sensitivity, and nested text
  4. Test thoroughly: XPath expressions can be fragile; test with various content scenarios
  5. Use normalize-space(): When dealing with user-generated content that may have inconsistent whitespace

These techniques provide a solid foundation for selecting elements based on text content using XPath in various web scraping and automation scenarios.

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon