How to Scrape Data from Websites that Require Two-Factor Authentication

Two-factor authentication (2FA) presents a significant challenge for web scraping automation. While 2FA is designed to prevent unauthorized access, there are legitimate scenarios where developers need to automate data extraction from 2FA-protected sites they own or have permission to access. This guide explores various approaches and best practices for handling 2FA in web scraping scenarios.

Understanding Two-Factor Authentication in Web Scraping

Two-factor authentication requires users to provide two different forms of identification: something they know (password) and something they have (phone, authenticator app, or hardware token). This creates complications for automated scraping since the second factor typically requires human intervention.

Common 2FA methods include: - SMS verification codes - Time-based One-Time Passwords (TOTP) from authenticator apps - Email verification codes - Hardware security keys - Push notifications

Approach 1: Using Application-Specific Passwords and API Tokens

The most reliable method is to use application-specific passwords or API tokens when available:

JavaScript Example with API Tokens

const axios = require('axios');

async function scrapeWithApiToken() {
    const config = {
        headers: {
            'Authorization': `Bearer ${process.env.API_TOKEN}`,
            'User-Agent': 'MyApp/1.0'
        }
    };

    try {
        const response = await axios.get('https://api.example.com/data', config);
        return response.data;
    } catch (error) {
        console.error('API request failed:', error.response?.data);
        throw error;
    }
}

// Usage
scrapeWithApiToken()
    .then(data => console.log('Scraped data:', data))
    .catch(error => console.error('Error:', error));

Python Example with Application Passwords

import requests
import os
from requests.auth import HTTPBasicAuth

def scrape_with_app_password():
    """Use application-specific password for authentication"""

    url = "https://example.com/api/data"
    username = os.getenv('USERNAME')
    app_password = os.getenv('APP_PASSWORD')

    auth = HTTPBasicAuth(username, app_password)
    headers = {
        'User-Agent': 'MyApp/1.0',
        'Accept': 'application/json'
    }

    try:
        response = requests.get(url, auth=auth, headers=headers)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"Request failed: {e}")
        raise

# Usage
try:
    data = scrape_with_app_password()
    print("Scraped data:", data)
except Exception as e:
    print(f"Error: {e}")

Approach 2: Session Management with Manual 2FA

When API access isn't available, you can use browser session management in Puppeteer to maintain authentication across scraping sessions:

Puppeteer Session Management

const puppeteer = require('puppeteer');
const fs = require('fs').promises;

class TwoFactorScraper {
    constructor() {
        this.browser = null;
        this.page = null;
        this.sessionFile = './session-cookies.json';
    }

    async initialize() {
        this.browser = await puppeteer.launch({
            headless: false, // Keep visible for manual 2FA
            userDataDir: './user-data', // Persist browser data
        });
        this.page = await this.browser.newPage();

        // Load saved session if available
        await this.loadSession();
    }

    async loadSession() {
        try {
            const cookiesString = await fs.readFile(this.sessionFile);
            const cookies = JSON.parse(cookiesString);
            await this.page.setCookie(...cookies);
            console.log('Session loaded successfully');
        } catch (error) {
            console.log('No existing session found, will need to authenticate');
        }
    }

    async saveSession() {
        const cookies = await this.page.cookies();
        await fs.writeFile(this.sessionFile, JSON.stringify(cookies, null, 2));
        console.log('Session saved successfully');
    }

    async login(username, password) {
        await this.page.goto('https://example.com/login');

        // Fill in credentials
        await this.page.type('#username', username);
        await this.page.type('#password', password);
        await this.page.click('#login-button');

        // Wait for 2FA prompt
        try {
            await this.page.waitForSelector('#two-factor-code', { timeout: 5000 });
            console.log('2FA required. Please complete authentication manually...');

            // Wait for successful login (redirect or specific element)
            await this.page.waitForSelector('#dashboard', { timeout: 60000 });
            console.log('Authentication successful!');

            // Save session for future use
            await this.saveSession();
        } catch (error) {
            console.log('No 2FA required or already authenticated');
        }
    }

    async scrapeData() {
        // Navigate to data page
        await this.page.goto('https://example.com/protected-data');

        // Check if still authenticated
        const isLoggedIn = await this.page.$('#dashboard') !== null;
        if (!isLoggedIn) {
            throw new Error('Authentication expired, please re-authenticate');
        }

        // Extract data
        const data = await this.page.evaluate(() => {
            const elements = document.querySelectorAll('.data-item');
            return Array.from(elements).map(el => ({
                title: el.querySelector('.title')?.textContent?.trim(),
                value: el.querySelector('.value')?.textContent?.trim(),
                date: el.querySelector('.date')?.textContent?.trim()
            }));
        });

        return data;
    }

    async close() {
        if (this.browser) {
            await this.browser.close();
        }
    }
}

// Usage
async function main() {
    const scraper = new TwoFactorScraper();

    try {
        await scraper.initialize();

        // Attempt to scrape (will prompt for login if needed)
        try {
            const data = await scraper.scrapeData();
            console.log('Scraped data:', data);
        } catch (error) {
            // If scraping fails, try logging in
            console.log('Authentication required...');
            await scraper.login('your-username', 'your-password');

            // Retry scraping
            const data = await scraper.scrapeData();
            console.log('Scraped data:', data);
        }

    } catch (error) {
        console.error('Scraping failed:', error);
    } finally {
        await scraper.close();
    }
}

main();

Approach 3: TOTP Automation (For Owned Applications)

If you control the 2FA setup, you can automate TOTP generation:

JavaScript TOTP Implementation

const speakeasy = require('speakeasy');
const puppeteer = require('puppeteer');

class TOTPScraper {
    constructor(totpSecret) {
        this.totpSecret = totpSecret;
        this.browser = null;
        this.page = null;
    }

    generateTOTP() {
        return speakeasy.totp({
            secret: this.totpSecret,
            encoding: 'base32'
        });
    }

    async initialize() {
        this.browser = await puppeteer.launch({ headless: true });
        this.page = await this.browser.newPage();
    }

    async loginWithTOTP(username, password) {
        await this.page.goto('https://example.com/login');

        // Standard login
        await this.page.type('#username', username);
        await this.page.type('#password', password);
        await this.page.click('#login-button');

        // Wait for 2FA prompt
        await this.page.waitForSelector('#totp-code');

        // Generate and enter TOTP
        const totpCode = this.generateTOTP();
        await this.page.type('#totp-code', totpCode);
        await this.page.click('#verify-button');

        // Wait for successful authentication
        await this.page.waitForSelector('#dashboard');
        console.log('Successfully authenticated with TOTP');
    }

    async scrapeProtectedData() {
        await this.page.goto('https://example.com/protected-data');

        const data = await this.page.evaluate(() => {
            return {
                title: document.querySelector('h1')?.textContent,
                content: document.querySelector('.content')?.textContent,
                timestamp: new Date().toISOString()
            };
        });

        return data;
    }

    async close() {
        if (this.browser) {
            await this.browser.close();
        }
    }
}

// Usage
async function automatedTOTPScraping() {
    // Note: Only use this for applications you own
    const totpSecret = process.env.TOTP_SECRET; // Your TOTP secret
    const scraper = new TOTPScraper(totpSecret);

    try {
        await scraper.initialize();
        await scraper.loginWithTOTP('username', 'password');
        const data = await scraper.scrapeProtectedData();
        console.log('Scraped data:', data);
        return data;
    } catch (error) {
        console.error('TOTP scraping failed:', error);
        throw error;
    } finally {
        await scraper.close();
    }
}

Python TOTP Example

import pyotp
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class TOTPScraper:
    def __init__(self, totp_secret):
        self.totp_secret = totp_secret
        self.driver = None

    def generate_totp(self):
        """Generate current TOTP code"""
        totp = pyotp.TOTP(self.totp_secret)
        return totp.now()

    def initialize_driver(self):
        """Initialize Chrome driver"""
        options = webdriver.ChromeOptions()
        options.add_argument('--headless')
        self.driver = webdriver.Chrome(options=options)

    def login_with_totp(self, username, password):
        """Login using username, password, and TOTP"""
        self.driver.get('https://example.com/login')

        # Enter credentials
        self.driver.find_element(By.ID, 'username').send_keys(username)
        self.driver.find_element(By.ID, 'password').send_keys(password)
        self.driver.find_element(By.ID, 'login-button').click()

        # Wait for 2FA prompt
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.ID, 'totp-code'))
        )

        # Generate and enter TOTP
        totp_code = self.generate_totp()
        self.driver.find_element(By.ID, 'totp-code').send_keys(totp_code)
        self.driver.find_element(By.ID, 'verify-button').click()

        # Wait for successful login
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.ID, 'dashboard'))
        )
        print("Successfully authenticated with TOTP")

    def scrape_data(self):
        """Scrape protected data"""
        self.driver.get('https://example.com/protected-data')

        # Extract data
        data = {
            'title': self.driver.find_element(By.TAG_NAME, 'h1').text,
            'content': self.driver.find_element(By.CLASS_NAME, 'content').text,
            'timestamp': time.time()
        }

        return data

    def close(self):
        """Clean up resources"""
        if self.driver:
            self.driver.quit()

# Usage
def main():
    totp_secret = 'YOUR_TOTP_SECRET'  # Base32 encoded secret
    scraper = TOTPScraper(totp_secret)

    try:
        scraper.initialize_driver()
        scraper.login_with_totp('username', 'password')
        data = scraper.scrape_data()
        print(f"Scraped data: {data}")
    except Exception as e:
        print(f"Error: {e}")
    finally:
        scraper.close()

if __name__ == "__main__":
    main()

Approach 4: Email-Based 2FA Automation

For email-based 2FA, you can automate email reading:

const Imap = require('imap');
const { simpleParser } = require('mailparser');

class EmailTwoFactorScraper {
    constructor(emailConfig) {
        this.emailConfig = emailConfig;
        this.imap = new Imap(emailConfig);
    }

    async getLatestVerificationCode() {
        return new Promise((resolve, reject) => {
            this.imap.once('ready', () => {
                this.imap.openBox('INBOX', false, (err, box) => {
                    if (err) return reject(err);

                    // Search for recent emails from the service
                    const searchCriteria = [
                        'UNSEEN',
                        ['FROM', 'noreply@example.com'],
                        ['SUBJECT', 'verification code']
                    ];

                    this.imap.search(searchCriteria, (err, results) => {
                        if (err) return reject(err);
                        if (!results || !results.length) {
                            return reject(new Error('No verification emails found'));
                        }

                        const fetch = this.imap.fetch(results.slice(-1), {
                            bodies: ''
                        });

                        fetch.on('message', (msg) => {
                            msg.on('body', (stream) => {
                                simpleParser(stream, (err, parsed) => {
                                    if (err) return reject(err);

                                    // Extract verification code from email
                                    const codeMatch = parsed.text.match(/verification code:?\s*(\d{6})/i);
                                    if (codeMatch) {
                                        resolve(codeMatch[1]);
                                    } else {
                                        reject(new Error('Verification code not found in email'));
                                    }
                                });
                            });
                        });

                        fetch.once('end', () => {
                            this.imap.end();
                        });
                    });
                });
            });

            this.imap.once('error', reject);
            this.imap.connect();
        });
    }
}

// Usage in scraping workflow
async function scrapeWithEmailTwoFactor() {
    const emailConfig = {
        user: 'your-email@gmail.com',
        password: 'your-app-password',
        host: 'imap.gmail.com',
        port: 993,
        tls: true
    };

    const browser = await puppeteer.launch({ headless: false });
    const page = await browser.newPage();

    try {
        // Login process
        await page.goto('https://example.com/login');
        await page.type('#username', 'your-username');
        await page.type('#password', 'your-password');
        await page.click('#login-button');

        // Wait for 2FA prompt
        await page.waitForSelector('#email-code-input');

        // Get verification code from email
        const emailScraper = new EmailTwoFactorScraper(emailConfig);
        const verificationCode = await emailScraper.getLatestVerificationCode();

        // Enter verification code
        await page.type('#email-code-input', verificationCode);
        await page.click('#verify-email-button');

        // Wait for successful login
        await page.waitForSelector('#dashboard');

        // Now proceed with scraping
        const data = await page.evaluate(() => {
            // Extract your data here
            return document.querySelector('.data-container').textContent;
        });

        console.log('Successfully scraped data:', data);
        return data;

    } catch (error) {
        console.error('Email 2FA scraping failed:', error);
        throw error;
    } finally {
        await browser.close();
    }
}

Best Practices and Considerations

Security and Legal Considerations

Only scrape sites you own or have explicit permission to access
Store credentials securely using environment variables
Use encrypted storage for session data
Implement proper error handling and logging

Rate Limiting and Politeness

// Implement delays between requests
await page.waitForTimeout(2000); // Wait 2 seconds

// Use random delays to appear more human-like
const randomDelay = Math.floor(Math.random() * 3000) + 1000;
await page.waitForTimeout(randomDelay);

Error Handling and Recovery

async function robustScrapeWithRetry(maxRetries = 3) {
    for (let attempt = 1; attempt <= maxRetries; attempt++) {
        try {
            return await scrapeWithTwoFactor();
        } catch (error) {
            console.log(`Attempt ${attempt} failed: ${error.message}`);

            if (attempt === maxRetries) {
                throw new Error(`Scraping failed after ${maxRetries} attempts`);
            }

            // Wait before retry
            await new Promise(resolve => setTimeout(resolve, 5000 * attempt));
        }
    }
}

Session Persistence

Implement proper session management to avoid repeated 2FA challenges:

// Save session state
const cookies = await page.cookies();
const localStorage = await page.evaluate(() => JSON.stringify(localStorage));
const sessionStorage = await page.evaluate(() => JSON.stringify(sessionStorage));

// Restore session state
await page.setCookie(...savedCookies);
await page.evaluate((data) => {
    const localStorageData = JSON.parse(data);
    for (const [key, value] of Object.entries(localStorageData)) {
        localStorage.setItem(key, value);
    }
}, localStorage);

Alternative Solutions

Using Dedicated Scraping Services

Consider using specialized web scraping APIs that handle authentication complexities for you. Many services provide managed solutions for scraping protected content without dealing with 2FA directly.

API Integration

Always check if the target website offers an API with proper authentication mechanisms:

# Example API call with OAuth
curl -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \
     -H "Content-Type: application/json" \
     https://api.example.com/v1/data

Monitoring and Maintenance

Implement monitoring to detect when authentication expires:

async function monitorAuthenticationStatus(page) {
    try {
        await page.goto('https://example.com/protected-endpoint');
        const isAuthenticated = await page.$('#user-dashboard') !== null;

        if (!isAuthenticated) {
            console.log('Authentication expired, re-authentication required');
            return false;
        }

        return true;
    } catch (error) {
        console.error('Authentication check failed:', error);
        return false;
    }
}

Conclusion

Scraping websites with two-factor authentication requires careful planning and implementation. The approaches outlined here range from using official APIs and application passwords (recommended) to automating TOTP generation and email verification (for owned applications only).

Remember to always respect website terms of service, implement proper error handling, and consider the legal and ethical implications of your scraping activities. For complex authentication flows, handling authentication in Puppeteer provides additional strategies for managing login processes effectively.

When possible, prefer official APIs and legitimate authentication methods over attempting to bypass security measures. This approach ensures better reliability, compliance, and long-term maintainability of your scraping solutions.

Table of contents