Table of contents

How do I Use Regex in n8n to Extract Specific Data from Websites?

Regular expressions (regex) are powerful pattern-matching tools that allow you to extract specific data from web scraping results in n8n workflows. When combined with n8n's automation capabilities, regex becomes invaluable for parsing HTML content, extracting emails, phone numbers, prices, dates, and other structured data patterns.

Understanding Regex in n8n Context

n8n provides multiple ways to use regex for data extraction:

  1. Code Node: Full JavaScript/Python regex capabilities
  2. Extract from Text Node: Built-in regex extraction
  3. Function Items Node: Apply regex to multiple items
  4. Set Node: Use regex with expressions

The most common approach is using the Code Node or Extract from Text node, which give you complete control over regex patterns and extraction logic.

Basic Regex Extraction with Code Node

The Code Node in n8n allows you to write JavaScript or Python code with full regex support. Here's how to extract data from scraped HTML:

JavaScript Example

// Extract all email addresses from scraped content
const html = $input.first().json.html;

// Define regex pattern for emails
const emailRegex = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g;

// Extract all matches
const emails = html.match(emailRegex) || [];

// Return structured data
return emails.map(email => ({
  json: { email: email }
}));

Python Example

import re

# Get HTML content from previous node
html = _input.first().json['html']

# Extract all phone numbers (US format)
phone_pattern = r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}'
phones = re.findall(phone_pattern, html)

# Return as structured data
return [{'json': {'phone': phone}} for phone in phones]

Common Regex Patterns for Web Scraping

Extracting Prices

// Extract prices in various formats ($99.99, $99, 99.99)
const priceRegex = /\$?\d{1,3}(?:,?\d{3})*(?:\.\d{2})?/g;
const prices = text.match(priceRegex);

Extracting URLs

// Extract all URLs from text
const urlRegex = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/g;
const urls = html.match(urlRegex);

Extracting Dates

// Extract dates in MM/DD/YYYY or MM-DD-YYYY format
const dateRegex = /\d{1,2}[\/\-]\d{1,2}[\/\-]\d{4}/g;
const dates = text.match(dateRegex);

Extracting Product SKUs or IDs

// Extract product IDs (example: PRD-12345)
const skuRegex = /PRD-\d{5}/g;
const productIds = html.match(skuRegex);

Advanced Extraction Techniques

Capturing Groups

Use capturing groups to extract specific parts of matched patterns:

// Extract product name and price from a specific HTML structure
const html = $input.first().json.html;

// Pattern with capturing groups
const productRegex = /<div class="product">.*?<h3>(.*?)<\/h3>.*?<span class="price">\$([\d.]+)<\/span>/gs;

const products = [];
let match;

while ((match = productRegex.exec(html)) !== null) {
  products.push({
    json: {
      name: match[1].trim(),
      price: parseFloat(match[2])
    }
  });
}

return products;

Named Capture Groups

Modern JavaScript supports named capture groups for better readability:

// Extract email components with named groups
const emailRegex = /(?<username>[a-zA-Z0-9._%+-]+)@(?<domain>[a-zA-Z0-9.-]+\.[a-zA-Z]{2,})/g;

const emails = [];
let match;

while ((match = emailRegex.exec(html)) !== null) {
  emails.push({
    json: {
      email: match[0],
      username: match.groups.username,
      domain: match.groups.domain
    }
  });
}

return emails;

Using Extract from Text Node

n8n's built-in "Extract from Text" node provides a user-friendly interface for regex extraction without coding:

  1. Add Extract from Text node to your workflow
  2. Select Regex as extraction mode
  3. Enter your regex pattern
  4. Specify the text field to search
  5. Choose whether to extract first match or all matches

Example configuration: - Input Field: html - Pattern: \d{3}-\d{3}-\d{4} - Extract: All matches - Output Field: phone_numbers

Combining Regex with HTML Parsing

For complex scraping tasks, combine regex with HTML parsing tools. After handling AJAX requests using Puppeteer or other browser automation, you can use regex to extract specific patterns from the retrieved HTML.

Hybrid Approach

// First, extract content from specific HTML elements
const cheerio = require('cheerio');
const $ = cheerio.load(html);

// Get text from specific elements
const descriptions = [];

$('.product-description').each((i, elem) => {
  const text = $(elem).text();

  // Use regex to extract specific patterns from cleaned text
  const modelRegex = /Model:\s*([A-Z0-9-]+)/i;
  const match = text.match(modelRegex);

  if (match) {
    descriptions.push({
      json: {
        fullText: text,
        modelNumber: match[1]
      }
    });
  }
});

return descriptions;

Regex Best Practices for n8n

1. Test Your Patterns

Always test regex patterns before deploying:

// Add validation and error handling
const text = $input.first().json.text || '';

try {
  const pattern = /your-pattern-here/g;
  const matches = text.match(pattern);

  if (!matches || matches.length === 0) {
    return [{
      json: {
        error: 'No matches found',
        input: text
      }
    }];
  }

  return matches.map(match => ({ json: { value: match } }));
} catch (error) {
  return [{
    json: {
      error: error.message,
      input: text
    }
  }];
}

2. Use Non-Greedy Matching

When extracting from HTML, use non-greedy quantifiers (*?, +?) to avoid matching too much:

// Greedy - might match too much
const badRegex = /<div>(.*)<\/div>/;

// Non-greedy - stops at first closing tag
const goodRegex = /<div>(.*?)<\/div>/;

3. Handle Multi-line Content

Use appropriate flags for multi-line content:

// 's' flag makes . match newlines
// 'g' flag for global matching
// 'i' flag for case-insensitive matching
const regex = /<article>(.*?)<\/article>/gis;

4. Clean Data After Extraction

// Extract and clean data in one step
const emailRegex = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g;
const emails = html.match(emailRegex) || [];

return emails
  .map(email => email.toLowerCase().trim())
  .filter(email => email.length > 0)
  .map(email => ({ json: { email } }));

Real-World Example: Extracting Product Information

Here's a complete n8n workflow example that scrapes product data and uses regex for extraction:

// Code Node: Extract product details from scraped HTML
const html = $input.first().json.html;

// Extract multiple data points with different regex patterns
const products = [];

// Pattern to match product blocks
const productBlockRegex = /<div class="product-item">(.*?)<\/div>/gs;
let blockMatch;

while ((blockMatch = productBlockRegex.exec(html)) !== null) {
  const block = blockMatch[1];

  // Extract name
  const nameMatch = /<h3.*?>(.*?)<\/h3>/.exec(block);
  const name = nameMatch ? nameMatch[1].trim() : null;

  // Extract price
  const priceMatch = /\$(\d+\.?\d*)/.exec(block);
  const price = priceMatch ? parseFloat(priceMatch[1]) : null;

  // Extract SKU
  const skuMatch = /SKU:\s*([A-Z0-9-]+)/i.exec(block);
  const sku = skuMatch ? skuMatch[1] : null;

  // Extract rating
  const ratingMatch = /(\d\.\d)\s*stars?/i.exec(block);
  const rating = ratingMatch ? parseFloat(ratingMatch[1]) : null;

  if (name && price) {
    products.push({
      json: {
        name,
        price,
        sku,
        rating,
        extractedAt: new Date().toISOString()
      }
    });
  }
}

return products;

Debugging Regex in n8n

When your regex doesn't work as expected:

  1. Log intermediate results: Use console.log() in Code Node to see what's being matched
  2. Test patterns externally: Use regex testers like regex101.com
  3. Start simple: Build complex patterns incrementally
  4. Check your input: Verify the HTML structure matches your expectations
// Debug helper
const html = $input.first().json.html;
const pattern = /your-pattern/g;

console.log('Input length:', html.length);
console.log('Pattern:', pattern.toString());

const matches = html.match(pattern);
console.log('Number of matches:', matches ? matches.length : 0);
console.log('First match:', matches ? matches[0] : 'none');

return matches ? matches.map(m => ({ json: { value: m } })) : [];

Integration with n8n Workflow Automation

Regex extraction becomes especially powerful when integrated into larger n8n workflows. After handling browser sessions in Puppeteer to scrape authenticated content, you can use regex to extract specific data points and route them through conditional logic or data transformation nodes.

Complete Workflow Pattern

  1. HTTP Request/Puppeteer Node: Fetch webpage
  2. Code Node: Apply regex extraction
  3. Function Items Node: Transform extracted data
  4. Filter Node: Remove invalid matches
  5. Set Node: Format output
  6. Database/API Node: Store or send results

Performance Considerations

When using regex in n8n for large-scale scraping:

  • Compile patterns once: Store regex patterns as constants
  • Limit backtracking: Avoid nested quantifiers
  • Use specific patterns: More specific patterns are faster
  • Consider alternatives: For complex HTML parsing, use Cheerio or similar libraries
// Efficient: compile pattern once
const EMAIL_PATTERN = /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g;

const items = $input.all();
const results = [];

for (const item of items) {
  const matches = item.json.text.match(EMAIL_PATTERN);
  if (matches) {
    results.push(...matches.map(m => ({ json: { email: m } })));
  }
}

return results;

Conclusion

Regular expressions in n8n provide powerful capabilities for extracting specific data from scraped websites. By combining regex with n8n's Code Node, Extract from Text node, and other automation features, you can build sophisticated data extraction workflows that parse and transform web content into structured, usable data. Master these patterns and techniques to enhance your web scraping automation and extract exactly the information you need from any website.

For more advanced scenarios, consider combining regex extraction with monitoring network requests in Puppeteer to capture data from API calls alongside scraped HTML content.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon