How can I use XPath in n8n for web scraping?

XPath (XML Path Language) is a powerful query language for selecting nodes from HTML and XML documents. In n8n, XPath provides precise control over data extraction, making it an essential tool for web scraping workflows. This guide covers everything you need to know about using XPath in n8n for efficient data extraction.

Understanding XPath in n8n

XPath allows you to navigate through elements and attributes in HTML documents using path expressions. Unlike CSS selectors, XPath can traverse both forward and backward in the document tree, making it more flexible for complex scraping scenarios.

In n8n, you can use XPath in several nodes: - HTML Extract node: Directly supports XPath selectors - Code node: Use JavaScript libraries like xpath or cheerio with XPath support - HTTP Request node: Combined with subsequent processing nodes

Basic XPath Syntax

Before diving into n8n implementations, here are essential XPath expressions:

// - Selects nodes anywhere in the document
/ - Selects from the root node
. - Selects the current node
.. - Selects the parent of the current node
@ - Selects attributes

Common XPath Examples

//div[@class='product']           # All divs with class 'product'
//h1/text()                        # Text content of h1 elements
//a/@href                          # href attributes from all links
//div[@id='content']//p            # All p elements inside div with id 'content'
//span[contains(text(), 'Price')]  # Spans containing 'Price' text
//table/tbody/tr[1]/td             # First row cells in a table

Using XPath in n8n HTML Extract Node

The HTML Extract node is the most straightforward way to use XPath in n8n workflows.

Step 1: Set Up HTTP Request Node

First, fetch the HTML content:

{
  "method": "GET",
  "url": "https://example.com/products",
  "options": {
    "headers": {
      "User-Agent": "n8n-workflow"
    }
  }
}

Step 2: Configure HTML Extract Node

Add an HTML Extract node and configure it:

Extraction Type: Select "XPath"
XPath Expression: Enter your selector
Output: Choose between "Text", "HTML", or "Attribute"

Example configuration for extracting product names:

{
  "extractionType": "xpath",
  "xpath": "//div[@class='product-card']//h2[@class='product-title']/text()",
  "returnArray": true
}

Step 3: Extract Multiple Fields

To extract multiple pieces of data, add multiple extraction rules:

Product Names:

//div[@class='product-card']//h2[@class='product-title']/text()

Prices:

//span[@class='price']/text()

Product URLs:

//a[@class='product-link']/@href

Images:

//img[@class='product-image']/@src

Using XPath in n8n Code Node

For more complex scenarios, use the Code node with JavaScript XPath libraries. This approach gives you full control over data extraction and transformation.

Using the xpath Library

First, you'll need to parse HTML and apply XPath queries:

// In n8n Code node
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;

// Get HTML from previous node
const html = $input.first().json.html;

// Parse HTML
const doc = new dom().parseFromString(html, 'text/html');

// Apply XPath queries
const productNodes = xpath.select("//div[@class='product-card']", doc);

const products = [];

productNodes.forEach(node => {
  const product = {
    name: xpath.select("string(.//h2[@class='product-title'])", node),
    price: xpath.select("string(.//span[@class='price'])", node),
    url: xpath.select("string(.//a[@class='product-link']/@href)", node),
    image: xpath.select("string(.//img/@src)", node)
  };
  products.push(product);
});

return products.map(product => ({ json: product }));

Advanced XPath Patterns in Code Node

Handle more complex extraction scenarios:

const xpath = require('xpath');
const dom = require('xmldom').DOMParser;

const html = $input.first().json.html;
const doc = new dom().parseFromString(html, 'text/html');

// Extract nested data with conditional logic
const items = [];

// Find all product containers
const containers = xpath.select("//div[contains(@class, 'product')]", doc);

containers.forEach(container => {
  // Extract with fallback selectors
  let price = xpath.select("string(.//span[@class='sale-price'])", container);

  if (!price) {
    price = xpath.select("string(.//span[@class='regular-price'])", container);
  }

  // Extract and clean data
  const item = {
    title: xpath.select("string(.//h3)", container).trim(),
    price: price.replace(/[^0-9.]/g, ''),
    inStock: xpath.select("boolean(.//span[@class='in-stock'])", container),
    rating: xpath.select("count(.//span[@class='star-filled'])", container),
    reviews: xpath.select("string(.//span[@class='review-count'])", container)
  };

  items.push(item);
});

return items.map(item => ({ json: item }));

XPath with Puppeteer in n8n

When scraping JavaScript-heavy websites, combine Puppeteer with XPath for dynamic content extraction. While Puppeteer doesn't natively support XPath in n8n's Puppeteer node, you can evaluate XPath in the browser context:

// In Puppeteer node's JavaScript code
const results = await page.evaluate(() => {
  // Helper function to evaluate XPath
  function getElementByXPath(xpath) {
    return document.evaluate(
      xpath,
      document,
      null,
      XPathResult.FIRST_ORDERED_NODE_TYPE,
      null
    ).singleNodeValue;
  }

  function getElementsByXPath(xpath) {
    const iterator = document.evaluate(
      xpath,
      document,
      null,
      XPathResult.ORDERED_NODE_ITERATOR_TYPE,
      null
    );

    const results = [];
    let node = iterator.iterateNext();

    while (node) {
      results.push(node);
      node = iterator.iterateNext();
    }

    return results;
  }

  // Extract data using XPath
  const products = getElementsByXPath("//div[@class='product-item']");

  return products.map(product => {
    const titleNode = document.evaluate(
      ".//h2[@class='title']/text()",
      product,
      null,
      XPathResult.STRING_TYPE,
      null
    );

    const priceNode = document.evaluate(
      ".//span[@class='price']/text()",
      product,
      null,
      XPathResult.STRING_TYPE,
      null
    );

    return {
      title: titleNode.stringValue,
      price: priceNode.stringValue
    };
  });
});

return results;

Advanced XPath Techniques

Handling Dynamic Attributes

Extract data from elements with dynamic class names or IDs:

//div[starts-with(@class, 'product-')]
//div[contains(@id, 'item-')]
//button[contains(@class, 'btn-') and contains(@class, 'primary')]

Using XPath Axes

Navigate complex document structures:

//div[@class='product']/following-sibling::div[1]  # Next sibling
//span[@class='price']/ancestor::div[@class='card'] # Parent container
//h2[@class='title']/parent::div                    # Direct parent
//table/descendant::td                              # All td descendants

Text Matching and Functions

Extract based on text content:

//li[contains(text(), 'Category:')]/following-sibling::li[1]/text()
//div[normalize-space(text())='Featured']
//span[string-length(text()) > 10]
//a[starts-with(@href, 'https://')]

Positional Selectors

Target specific elements by position:

//div[@class='product'][1]              # First product
//div[@class='product'][last()]          # Last product
//div[@class='product'][position() < 4]  # First three products
//tr[position() mod 2 = 0]               # Even rows

Error Handling and Best Practices

When working with XPath in n8n, implement robust error handling:

const xpath = require('xpath');
const dom = require('xmldom').DOMParser;

try {
  const html = $input.first().json.html;

  if (!html) {
    throw new Error('No HTML content received');
  }

  const doc = new dom({
    errorHandler: {
      warning: () => {},
      error: () => {},
      fatalError: (error) => console.error(error)
    }
  }).parseFromString(html, 'text/html');

  // Try primary selector
  let results = xpath.select("//div[@class='product-new']", doc);

  // Fallback to alternative selector
  if (results.length === 0) {
    results = xpath.select("//div[@class='product-item']", doc);
  }

  if (results.length === 0) {
    throw new Error('No products found with any selector');
  }

  // Process results
  const products = results.map(node => ({
    name: xpath.select("string(.//h2)", node) || 'N/A',
    price: xpath.select("string(.//span[@class='price'])", node) || '0'
  }));

  return products.map(p => ({ json: p }));

} catch (error) {
  return [{
    json: {
      error: error.message,
      timestamp: new Date().toISOString()
    }
  }];
}

Best Practices

Use Specific Paths: Avoid overly broad selectors like //div//div//span
Leverage Attributes: Use @id, @class, and data attributes for precision
Test Incrementally: Verify XPath expressions in browser DevTools first
Handle Missing Elements: Always provide fallbacks and default values
Optimize Performance: Use direct paths when structure is consistent
Document Selectors: Add comments explaining complex XPath expressions
Validate Data: Check extracted data format and completeness

Comparing XPath vs CSS Selectors

While CSS selectors are more common, XPath offers advantages:

When to use XPath: - Need to select parent elements - Require text-based matching - Working with XML documents - Need advanced functions (count, contains, etc.)

When to use CSS selectors: - Simpler element selection - Better performance for basic queries - More readable for simple cases

Debugging XPath in n8n

To debug XPath expressions in your n8n workflow:

Use Browser Console: Test XPath with $x("//your/xpath") in Chrome DevTools
Add Logging: Insert Code nodes to log intermediate results
Start Simple: Begin with basic expressions and add complexity gradually
Check HTML Structure: Verify the actual HTML structure matches your expectations
Use n8n's Debug Mode: Enable execution data to see what's returned at each step

Example debug node:

const xpath = require('xpath');
const dom = require('xmldom').DOMParser;

const html = $input.first().json.html;
const doc = new dom().parseFromString(html, 'text/html');

// Test multiple XPath expressions
const tests = {
  'products_class': xpath.select("//div[@class='product']", doc).length,
  'products_data_attr': xpath.select("//div[@data-type='product']", doc).length,
  'all_divs': xpath.select("//div", doc).length,
  'sample_text': xpath.select("string(//body)", doc).substring(0, 200)
};

return [{ json: tests }];

Real-World Example: Complete Scraping Workflow

Here's a complete n8n workflow for scraping an e-commerce site using XPath:

// Code node in n8n workflow
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;

const html = $input.first().json.html;
const doc = new dom().parseFromString(html, 'text/html');

// Extract pagination info
const totalPages = parseInt(
  xpath.select("string(//span[@class='total-pages'])", doc)
) || 1;

// Extract all products on current page
const productNodes = xpath.select(
  "//article[@class='product-card']",
  doc
);

const products = productNodes.map((node, index) => {
  // Extract nested information
  const specifications = {};
  const specNodes = xpath.select(".//dl[@class='specs']/dt", node);

  specNodes.forEach(dt => {
    const key = xpath.select("string(.)", dt);
    const value = xpath.select("string(./following-sibling::dd[1])", dt);
    specifications[key] = value;
  });

  return {
    id: xpath.select("string(./@data-product-id)", node),
    name: xpath.select("string(.//h3[@class='product-name'])", node),
    brand: xpath.select("string(.//span[@class='brand'])", node),
    price: {
      current: xpath.select("string(.//span[@class='current-price'])", node),
      original: xpath.select("string(.//span[@class='original-price'])", node),
      currency: xpath.select("string(.//meta[@itemprop='priceCurrency']/@content)", node)
    },
    rating: parseFloat(xpath.select("string(.//meta[@itemprop='ratingValue']/@content)", node)) || 0,
    reviewCount: parseInt(xpath.select("string(.//span[@class='review-count'])", node)) || 0,
    availability: xpath.select("boolean(.//span[@class='in-stock'])", node),
    image: xpath.select("string(.//img[@class='product-image']/@src)", node),
    url: xpath.select("string(.//a[@class='product-link']/@href)", node),
    specifications: specifications,
    position: index + 1
  };
});

return [{
  json: {
    currentPage: parseInt($input.first().json.page) || 1,
    totalPages: totalPages,
    productCount: products.length,
    products: products
  }
}];

Conclusion

XPath is a powerful tool for web scraping in n8n workflows, offering precision and flexibility that CSS selectors can't match. By combining XPath with n8n's HTML Extract node or Code node, you can build robust scraping workflows that handle complex DOM structures and extract exactly the data you need.

Start with simple XPath expressions and gradually increase complexity as needed. Always test your selectors thoroughly and implement proper error handling to ensure your n8n workflows run reliably. Whether you're scraping product data, extracting article content, or monitoring website changes, XPath in n8n provides the control and precision necessary for professional web scraping automation.

Table of contents