How can I use XPath in n8n for web scraping?
XPath (XML Path Language) is a powerful query language for selecting nodes from HTML and XML documents. In n8n, XPath provides precise control over data extraction, making it an essential tool for web scraping workflows. This guide covers everything you need to know about using XPath in n8n for efficient data extraction.
Understanding XPath in n8n
XPath allows you to navigate through elements and attributes in HTML documents using path expressions. Unlike CSS selectors, XPath can traverse both forward and backward in the document tree, making it more flexible for complex scraping scenarios.
In n8n, you can use XPath in several nodes:
- HTML Extract node: Directly supports XPath selectors
- Code node: Use JavaScript libraries like xpath
or cheerio
with XPath support
- HTTP Request node: Combined with subsequent processing nodes
Basic XPath Syntax
Before diving into n8n implementations, here are essential XPath expressions:
// - Selects nodes anywhere in the document
/ - Selects from the root node
. - Selects the current node
.. - Selects the parent of the current node
@ - Selects attributes
Common XPath Examples
//div[@class='product'] # All divs with class 'product'
//h1/text() # Text content of h1 elements
//a/@href # href attributes from all links
//div[@id='content']//p # All p elements inside div with id 'content'
//span[contains(text(), 'Price')] # Spans containing 'Price' text
//table/tbody/tr[1]/td # First row cells in a table
Using XPath in n8n HTML Extract Node
The HTML Extract node is the most straightforward way to use XPath in n8n workflows.
Step 1: Set Up HTTP Request Node
First, fetch the HTML content:
{
"method": "GET",
"url": "https://example.com/products",
"options": {
"headers": {
"User-Agent": "n8n-workflow"
}
}
}
Step 2: Configure HTML Extract Node
Add an HTML Extract node and configure it:
- Extraction Type: Select "XPath"
- XPath Expression: Enter your selector
- Output: Choose between "Text", "HTML", or "Attribute"
Example configuration for extracting product names:
{
"extractionType": "xpath",
"xpath": "//div[@class='product-card']//h2[@class='product-title']/text()",
"returnArray": true
}
Step 3: Extract Multiple Fields
To extract multiple pieces of data, add multiple extraction rules:
Product Names:
//div[@class='product-card']//h2[@class='product-title']/text()
Prices:
//span[@class='price']/text()
Product URLs:
//a[@class='product-link']/@href
Images:
//img[@class='product-image']/@src
Using XPath in n8n Code Node
For more complex scenarios, use the Code node with JavaScript XPath libraries. This approach gives you full control over data extraction and transformation.
Using the xpath Library
First, you'll need to parse HTML and apply XPath queries:
// In n8n Code node
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
// Get HTML from previous node
const html = $input.first().json.html;
// Parse HTML
const doc = new dom().parseFromString(html, 'text/html');
// Apply XPath queries
const productNodes = xpath.select("//div[@class='product-card']", doc);
const products = [];
productNodes.forEach(node => {
const product = {
name: xpath.select("string(.//h2[@class='product-title'])", node),
price: xpath.select("string(.//span[@class='price'])", node),
url: xpath.select("string(.//a[@class='product-link']/@href)", node),
image: xpath.select("string(.//img/@src)", node)
};
products.push(product);
});
return products.map(product => ({ json: product }));
Advanced XPath Patterns in Code Node
Handle more complex extraction scenarios:
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
const html = $input.first().json.html;
const doc = new dom().parseFromString(html, 'text/html');
// Extract nested data with conditional logic
const items = [];
// Find all product containers
const containers = xpath.select("//div[contains(@class, 'product')]", doc);
containers.forEach(container => {
// Extract with fallback selectors
let price = xpath.select("string(.//span[@class='sale-price'])", container);
if (!price) {
price = xpath.select("string(.//span[@class='regular-price'])", container);
}
// Extract and clean data
const item = {
title: xpath.select("string(.//h3)", container).trim(),
price: price.replace(/[^0-9.]/g, ''),
inStock: xpath.select("boolean(.//span[@class='in-stock'])", container),
rating: xpath.select("count(.//span[@class='star-filled'])", container),
reviews: xpath.select("string(.//span[@class='review-count'])", container)
};
items.push(item);
});
return items.map(item => ({ json: item }));
XPath with Puppeteer in n8n
When scraping JavaScript-heavy websites, combine Puppeteer with XPath for dynamic content extraction. While Puppeteer doesn't natively support XPath in n8n's Puppeteer node, you can evaluate XPath in the browser context:
// In Puppeteer node's JavaScript code
const results = await page.evaluate(() => {
// Helper function to evaluate XPath
function getElementByXPath(xpath) {
return document.evaluate(
xpath,
document,
null,
XPathResult.FIRST_ORDERED_NODE_TYPE,
null
).singleNodeValue;
}
function getElementsByXPath(xpath) {
const iterator = document.evaluate(
xpath,
document,
null,
XPathResult.ORDERED_NODE_ITERATOR_TYPE,
null
);
const results = [];
let node = iterator.iterateNext();
while (node) {
results.push(node);
node = iterator.iterateNext();
}
return results;
}
// Extract data using XPath
const products = getElementsByXPath("//div[@class='product-item']");
return products.map(product => {
const titleNode = document.evaluate(
".//h2[@class='title']/text()",
product,
null,
XPathResult.STRING_TYPE,
null
);
const priceNode = document.evaluate(
".//span[@class='price']/text()",
product,
null,
XPathResult.STRING_TYPE,
null
);
return {
title: titleNode.stringValue,
price: priceNode.stringValue
};
});
});
return results;
Advanced XPath Techniques
Handling Dynamic Attributes
Extract data from elements with dynamic class names or IDs:
//div[starts-with(@class, 'product-')]
//div[contains(@id, 'item-')]
//button[contains(@class, 'btn-') and contains(@class, 'primary')]
Using XPath Axes
Navigate complex document structures:
//div[@class='product']/following-sibling::div[1] # Next sibling
//span[@class='price']/ancestor::div[@class='card'] # Parent container
//h2[@class='title']/parent::div # Direct parent
//table/descendant::td # All td descendants
Text Matching and Functions
Extract based on text content:
//li[contains(text(), 'Category:')]/following-sibling::li[1]/text()
//div[normalize-space(text())='Featured']
//span[string-length(text()) > 10]
//a[starts-with(@href, 'https://')]
Positional Selectors
Target specific elements by position:
//div[@class='product'][1] # First product
//div[@class='product'][last()] # Last product
//div[@class='product'][position() < 4] # First three products
//tr[position() mod 2 = 0] # Even rows
Error Handling and Best Practices
When working with XPath in n8n, implement robust error handling:
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
try {
const html = $input.first().json.html;
if (!html) {
throw new Error('No HTML content received');
}
const doc = new dom({
errorHandler: {
warning: () => {},
error: () => {},
fatalError: (error) => console.error(error)
}
}).parseFromString(html, 'text/html');
// Try primary selector
let results = xpath.select("//div[@class='product-new']", doc);
// Fallback to alternative selector
if (results.length === 0) {
results = xpath.select("//div[@class='product-item']", doc);
}
if (results.length === 0) {
throw new Error('No products found with any selector');
}
// Process results
const products = results.map(node => ({
name: xpath.select("string(.//h2)", node) || 'N/A',
price: xpath.select("string(.//span[@class='price'])", node) || '0'
}));
return products.map(p => ({ json: p }));
} catch (error) {
return [{
json: {
error: error.message,
timestamp: new Date().toISOString()
}
}];
}
Best Practices
- Use Specific Paths: Avoid overly broad selectors like
//div//div//span
- Leverage Attributes: Use
@id
,@class
, and data attributes for precision - Test Incrementally: Verify XPath expressions in browser DevTools first
- Handle Missing Elements: Always provide fallbacks and default values
- Optimize Performance: Use direct paths when structure is consistent
- Document Selectors: Add comments explaining complex XPath expressions
- Validate Data: Check extracted data format and completeness
Comparing XPath vs CSS Selectors
While CSS selectors are more common, XPath offers advantages:
When to use XPath: - Need to select parent elements - Require text-based matching - Working with XML documents - Need advanced functions (count, contains, etc.)
When to use CSS selectors: - Simpler element selection - Better performance for basic queries - More readable for simple cases
Debugging XPath in n8n
To debug XPath expressions in your n8n workflow:
- Use Browser Console: Test XPath with
$x("//your/xpath")
in Chrome DevTools - Add Logging: Insert Code nodes to log intermediate results
- Start Simple: Begin with basic expressions and add complexity gradually
- Check HTML Structure: Verify the actual HTML structure matches your expectations
- Use n8n's Debug Mode: Enable execution data to see what's returned at each step
Example debug node:
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
const html = $input.first().json.html;
const doc = new dom().parseFromString(html, 'text/html');
// Test multiple XPath expressions
const tests = {
'products_class': xpath.select("//div[@class='product']", doc).length,
'products_data_attr': xpath.select("//div[@data-type='product']", doc).length,
'all_divs': xpath.select("//div", doc).length,
'sample_text': xpath.select("string(//body)", doc).substring(0, 200)
};
return [{ json: tests }];
Real-World Example: Complete Scraping Workflow
Here's a complete n8n workflow for scraping an e-commerce site using XPath:
// Code node in n8n workflow
const xpath = require('xpath');
const dom = require('xmldom').DOMParser;
const html = $input.first().json.html;
const doc = new dom().parseFromString(html, 'text/html');
// Extract pagination info
const totalPages = parseInt(
xpath.select("string(//span[@class='total-pages'])", doc)
) || 1;
// Extract all products on current page
const productNodes = xpath.select(
"//article[@class='product-card']",
doc
);
const products = productNodes.map((node, index) => {
// Extract nested information
const specifications = {};
const specNodes = xpath.select(".//dl[@class='specs']/dt", node);
specNodes.forEach(dt => {
const key = xpath.select("string(.)", dt);
const value = xpath.select("string(./following-sibling::dd[1])", dt);
specifications[key] = value;
});
return {
id: xpath.select("string(./@data-product-id)", node),
name: xpath.select("string(.//h3[@class='product-name'])", node),
brand: xpath.select("string(.//span[@class='brand'])", node),
price: {
current: xpath.select("string(.//span[@class='current-price'])", node),
original: xpath.select("string(.//span[@class='original-price'])", node),
currency: xpath.select("string(.//meta[@itemprop='priceCurrency']/@content)", node)
},
rating: parseFloat(xpath.select("string(.//meta[@itemprop='ratingValue']/@content)", node)) || 0,
reviewCount: parseInt(xpath.select("string(.//span[@class='review-count'])", node)) || 0,
availability: xpath.select("boolean(.//span[@class='in-stock'])", node),
image: xpath.select("string(.//img[@class='product-image']/@src)", node),
url: xpath.select("string(.//a[@class='product-link']/@href)", node),
specifications: specifications,
position: index + 1
};
});
return [{
json: {
currentPage: parseInt($input.first().json.page) || 1,
totalPages: totalPages,
productCount: products.length,
products: products
}
}];
Conclusion
XPath is a powerful tool for web scraping in n8n workflows, offering precision and flexibility that CSS selectors can't match. By combining XPath with n8n's HTML Extract node or Code node, you can build robust scraping workflows that handle complex DOM structures and extract exactly the data you need.
Start with simple XPath expressions and gradually increase complexity as needed. Always test your selectors thoroughly and implement proper error handling to ensure your n8n workflows run reliably. Whether you're scraping product data, extracting article content, or monitoring website changes, XPath in n8n provides the control and precision necessary for professional web scraping automation.