Table of contents

How to handle captchas when using Puppeteer?

CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) are designed to prevent automated access to websites. When using Puppeteer for web scraping or automation, encountering CAPTCHAs is a common challenge. This comprehensive guide covers various strategies to handle CAPTCHAs effectively while maintaining ethical scraping practices.

Understanding CAPTCHA Types

Before diving into solutions, it's important to understand the different types of CAPTCHAs you might encounter:

  • Text-based CAPTCHAs: Distorted text that needs to be typed
  • Image-based CAPTCHAs: Selecting specific images or objects
  • reCAPTCHA v2: Google's "I'm not a robot" checkbox
  • reCAPTCHA v3: Invisible scoring system
  • hCaptcha: Privacy-focused alternative to reCAPTCHA
  • Custom CAPTCHAs: Site-specific challenges

Detection Strategies

1. CAPTCHA Element Detection

First, you need to detect when a CAPTCHA appears on the page:

const puppeteer = require('puppeteer');

async function detectCaptcha(page) {
  // Common CAPTCHA selectors
  const captchaSelectors = [
    '.g-recaptcha',
    '#recaptcha',
    '.h-captcha',
    '.captcha',
    '[data-captcha]',
    'iframe[src*="recaptcha"]',
    'iframe[src*="hcaptcha"]'
  ];

  for (const selector of captchaSelectors) {
    const element = await page.$(selector);
    if (element) {
      console.log(`CAPTCHA detected: ${selector}`);
      return { found: true, type: selector };
    }
  }

  return { found: false };
}

// Usage example
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://example.com');

const captchaResult = await detectCaptcha(page);
if (captchaResult.found) {
  console.log('CAPTCHA detected, handling required');
}

2. Dynamic CAPTCHA Detection

Some CAPTCHAs appear dynamically after certain actions:

async function waitForCaptchaOrSuccess(page, timeout = 10000) {
  try {
    await Promise.race([
      page.waitForSelector('.success-message', { timeout }),
      page.waitForSelector('.g-recaptcha', { timeout }),
      page.waitForSelector('.h-captcha', { timeout })
    ]);

    const captcha = await detectCaptcha(page);
    return captcha.found ? 'captcha' : 'success';
  } catch (error) {
    return 'timeout';
  }
}

Avoidance Techniques

1. Stealth Configuration

Configure Puppeteer to appear more human-like:

const puppeteer = require('puppeteer');

async function createStealthBrowser() {
  const browser = await puppeteer.launch({
    headless: false, // Sometimes headless mode triggers CAPTCHAs
    args: [
      '--no-sandbox',
      '--disable-setuid-sandbox',
      '--disable-dev-shm-usage',
      '--disable-accelerated-2d-canvas',
      '--no-first-run',
      '--no-zygote',
      '--disable-gpu',
      '--disable-extensions',
      '--disable-plugins',
      '--disable-default-apps',
      '--disable-background-timer-throttling',
      '--disable-backgrounding-occluded-windows',
      '--disable-renderer-backgrounding'
    ]
  });

  const page = await browser.newPage();

  // Set realistic viewport
  await page.setViewport({ width: 1366, height: 768 });

  // Set user agent
  await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');

  return { browser, page };
}

2. Human-like Behavior Simulation

Implement delays and natural mouse movements:

async function humanLikeInteraction(page) {
  // Random delays between actions
  const randomDelay = (min, max) => Math.floor(Math.random() * (max - min + 1)) + min;

  // Human-like typing
  async function typeHuman(selector, text) {
    await page.click(selector);
    await page.waitForTimeout(randomDelay(100, 300));

    for (const char of text) {
      await page.keyboard.type(char);
      await page.waitForTimeout(randomDelay(50, 150));
    }
  }

  // Gradual mouse movement
  async function moveMouseGradually(startX, startY, endX, endY) {
    const steps = 10;
    for (let i = 0; i <= steps; i++) {
      const x = startX + (endX - startX) * (i / steps);
      const y = startY + (endY - startY) * (i / steps);
      await page.mouse.move(x, y);
      await page.waitForTimeout(randomDelay(10, 50));
    }
  }

  return { typeHuman, moveMouseGradually, randomDelay };
}

3. Request Throttling

Implement request throttling to avoid triggering rate limits:

class RequestThrottler {
  constructor(delayMs = 1000) {
    this.delayMs = delayMs;
    this.lastRequestTime = 0;
  }

  async throttle() {
    const now = Date.now();
    const timeSinceLastRequest = now - this.lastRequestTime;

    if (timeSinceLastRequest < this.delayMs) {
      const waitTime = this.delayMs - timeSinceLastRequest;
      await new Promise(resolve => setTimeout(resolve, waitTime));
    }

    this.lastRequestTime = Date.now();
  }
}

// Usage
const throttler = new RequestThrottler(2000); // 2 second delay

async function navigateWithThrottling(page, url) {
  await throttler.throttle();
  await page.goto(url);
}

Solving Techniques

1. Manual Intervention

For development and testing purposes, you can pause execution for manual CAPTCHA solving:

async function handleCaptchaManually(page) {
  const captcha = await detectCaptcha(page);

  if (captcha.found) {
    console.log('CAPTCHA detected. Please solve it manually.');
    console.log('Press Enter in the console when done...');

    // Wait for user input
    await new Promise(resolve => {
      process.stdin.once('data', () => resolve());
    });

    // Verify CAPTCHA was solved
    await page.waitForTimeout(2000);
    const stillPresent = await detectCaptcha(page);

    if (!stillPresent.found) {
      console.log('CAPTCHA solved successfully!');
      return true;
    } else {
      console.log('CAPTCHA still present. Please try again.');
      return false;
    }
  }

  return true;
}

2. Third-Party CAPTCHA Solving Services

Integrate with services like 2captcha or Anti-Captcha:

const axios = require('axios');

class CaptchaSolver {
  constructor(apiKey, service = '2captcha') {
    this.apiKey = apiKey;
    this.service = service;
    this.baseUrl = service === '2captcha' ? 'http://2captcha.com' : 'https://api.anti-captcha.com';
  }

  async solveCaptcha(captchaData) {
    try {
      // Submit CAPTCHA for solving
      const submitResponse = await axios.post(`${this.baseUrl}/in.php`, {
        key: this.apiKey,
        method: 'base64',
        body: captchaData
      });

      if (submitResponse.data.includes('OK|')) {
        const captchaId = submitResponse.data.split('|')[1];

        // Poll for result
        return await this.pollForResult(captchaId);
      }

      throw new Error('Failed to submit CAPTCHA');
    } catch (error) {
      console.error('CAPTCHA solving error:', error.message);
      return null;
    }
  }

  async pollForResult(captchaId, maxAttempts = 30) {
    for (let i = 0; i < maxAttempts; i++) {
      await new Promise(resolve => setTimeout(resolve, 5000));

      try {
        const response = await axios.get(`${this.baseUrl}/res.php`, {
          params: {
            key: this.apiKey,
            action: 'get',
            id: captchaId
          }
        });

        if (response.data.includes('OK|')) {
          return response.data.split('|')[1];
        }

        if (response.data !== 'CAPCHA_NOT_READY') {
          throw new Error(`CAPTCHA solving failed: ${response.data}`);
        }
      } catch (error) {
        console.error('Polling error:', error.message);
      }
    }

    throw new Error('CAPTCHA solving timeout');
  }
}

3. Browser Extension Integration

Use browser extensions for automatic CAPTCHA solving:

async function launchWithCaptchaExtension() {
  const browser = await puppeteer.launch({
    headless: false,
    args: [
      '--disable-extensions-except=/path/to/captcha-extension',
      '--load-extension=/path/to/captcha-extension'
    ]
  });

  const page = await browser.newPage();

  // Wait for extension to load
  await page.waitForTimeout(3000);

  return { browser, page };
}

Advanced Handling Strategies

1. Retry Logic with Exponential Backoff

class CaptchaHandler {
  constructor(maxRetries = 3) {
    this.maxRetries = maxRetries;
  }

  async handleWithRetry(page, actionFunction) {
    for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
      try {
        await actionFunction(page);

        const captcha = await detectCaptcha(page);
        if (!captcha.found) {
          return { success: true, attempts: attempt };
        }

        console.log(`CAPTCHA encountered on attempt ${attempt}`);

        if (attempt < this.maxRetries) {
          const delay = Math.pow(2, attempt) * 1000; // Exponential backoff
          console.log(`Waiting ${delay}ms before retry...`);
          await page.waitForTimeout(delay);

          // Refresh page or navigate back
          await page.reload();
        }

      } catch (error) {
        console.error(`Attempt ${attempt} failed:`, error.message);

        if (attempt === this.maxRetries) {
          throw error;
        }
      }
    }

    return { success: false, attempts: this.maxRetries };
  }
}

2. Context Switching

Use multiple browser contexts to isolate sessions:

async function handleMultipleContexts() {
  const browser = await puppeteer.launch();
  const contexts = [];

  // Create multiple contexts
  for (let i = 0; i < 3; i++) {
    const context = await browser.createIncognitoBrowserContext();
    contexts.push(context);
  }

  // Function to get a clean context
  async function getCleanContext() {
    const context = contexts.shift();
    if (context) {
      const page = await context.newPage();
      return { context, page };
    }

    // Create new context if none available
    const newContext = await browser.createIncognitoBrowserContext();
    const page = await newContext.newPage();
    return { context: newContext, page };
  }

  return { browser, getCleanContext };
}

Best Practices and Ethical Considerations

1. Rate Limiting and Respectful Scraping

class RespectfulScraper {
  constructor(options = {}) {
    this.requestDelay = options.requestDelay || 1000;
    this.maxConcurrency = options.maxConcurrency || 1;
    this.respectRobotsTxt = options.respectRobotsTxt || true;
  }

  async scrapeWithRespect(urls) {
    const results = [];

    for (const url of urls) {
      try {
        // Check robots.txt if enabled
        if (this.respectRobotsTxt) {
          const allowed = await this.checkRobotsTxt(url);
          if (!allowed) {
            console.log(`Skipping ${url} due to robots.txt restrictions`);
            continue;
          }
        }

        // Implement delay
        await new Promise(resolve => setTimeout(resolve, this.requestDelay));

        const result = await this.scrapePage(url);
        results.push(result);

      } catch (error) {
        console.error(`Error scraping ${url}:`, error.message);
      }
    }

    return results;
  }

  async checkRobotsTxt(url) {
    // Implementation to check robots.txt
    // This is a simplified version
    return true;
  }
}

2. Monitoring and Logging

class CaptchaMonitor {
  constructor() {
    this.captchaEncounters = [];
    this.successRate = 0;
  }

  logCaptchaEncounter(url, captchaType, resolved) {
    const encounter = {
      timestamp: new Date(),
      url,
      captchaType,
      resolved,
      userAgent: 'current-user-agent'
    };

    this.captchaEncounters.push(encounter);
    this.updateSuccessRate();
  }

  updateSuccessRate() {
    const total = this.captchaEncounters.length;
    const resolved = this.captchaEncounters.filter(e => e.resolved).length;
    this.successRate = total > 0 ? (resolved / total) * 100 : 0;
  }

  getStatistics() {
    return {
      totalEncounters: this.captchaEncounters.length,
      successRate: this.successRate,
      mostCommonTypes: this.getMostCommonTypes()
    };
  }

  getMostCommonTypes() {
    const typeCounts = {};
    this.captchaEncounters.forEach(e => {
      typeCounts[e.captchaType] = (typeCounts[e.captchaType] || 0) + 1;
    });

    return Object.entries(typeCounts)
      .sort(([,a], [,b]) => b - a)
      .slice(0, 5);
  }
}

Alternative Approaches

When CAPTCHAs become too challenging to handle programmatically, consider these alternatives:

1. API-First Approach

Many websites offer APIs that provide the same data without CAPTCHAs. Research whether the target site has an official API.

2. Different Data Sources

Look for alternative sources that provide similar data without CAPTCHA protection.

3. Browser Automation Tools

Consider using different browser automation tools like Playwright, which might face fewer CAPTCHA challenges due to different detection signatures.

Conclusion

Handling CAPTCHAs in Puppeteer requires a multi-faceted approach combining detection, avoidance, and solving strategies. The key is to:

  1. Minimize CAPTCHA encounters through stealth techniques and respectful scraping practices
  2. Implement robust detection to identify when CAPTCHAs appear
  3. Have fallback strategies for when CAPTCHAs cannot be avoided
  4. Monitor and adapt your approach based on success rates and patterns

Remember that CAPTCHAs exist to protect websites from abuse. Always ensure your scraping activities are ethical, legal, and respectful of the target website's terms of service. When possible, consider reaching out to website owners to discuss your use case and potentially gain legitimate access to their data.

For more advanced automation scenarios, you might also want to explore handling complex user interactions and managing browser sessions effectively to create more robust web scraping solutions.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon