Table of contents

How do I handle browser crashes and timeouts in Selenium scraping?

Browser crashes and timeouts are common challenges in web scraping with Selenium. These issues can disrupt your scraping workflow and lead to data loss. This comprehensive guide covers proven strategies to handle these problems effectively, ensuring your scraping operations remain robust and reliable.

Understanding Browser Crashes and Timeouts

Browser crashes occur when the browser process terminates unexpectedly due to memory issues, JavaScript errors, or system resource constraints. Timeouts happen when operations take longer than expected, often due to slow network connections, heavy page loads, or unresponsive web elements.

Setting Up Proper Timeout Configuration

Page Load Timeouts

Configure appropriate timeout values to prevent indefinite waiting:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException

# Python example
driver = webdriver.Chrome()

# Set page load timeout (30 seconds)
driver.set_page_load_timeout(30)

# Set implicit wait (10 seconds)
driver.implicitly_wait(10)

# Set script timeout (15 seconds)
driver.set_script_timeout(15)
// JavaScript example
const { Builder, By, until } = require('selenium-webdriver');

async function setupTimeouts() {
    const driver = await new Builder().forBrowser('chrome').build();

    // Set page load timeout (30 seconds)
    await driver.manage().setTimeouts({
        pageLoad: 30000,
        implicit: 10000,
        script: 15000
    });

    return driver;
}

Element Wait Strategies

Use explicit waits instead of implicit waits for better control:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def wait_for_element(driver, locator, timeout=10):
    """Wait for element to be present and visible"""
    try:
        element = WebDriverWait(driver, timeout).until(
            EC.presence_of_element_located(locator)
        )
        return element
    except TimeoutException:
        print(f"Element not found within {timeout} seconds")
        return None

# Usage
element = wait_for_element(driver, (By.ID, "dynamic-content"))

Implementing Crash Recovery Mechanisms

Driver Recovery with Retry Logic

Create a robust driver management system with automatic recovery:

import time
from selenium import webdriver
from selenium.common.exceptions import WebDriverException, TimeoutException

class RobustWebDriver:
    def __init__(self, max_retries=3):
        self.max_retries = max_retries
        self.driver = None
        self.init_driver()

    def init_driver(self):
        """Initialize WebDriver with proper configuration"""
        options = webdriver.ChromeOptions()
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-gpu')
        options.add_argument('--remote-debugging-port=9222')

        self.driver = webdriver.Chrome(options=options)
        self.driver.set_page_load_timeout(30)
        self.driver.implicitly_wait(10)

    def safe_get(self, url, retries=0):
        """Navigate to URL with crash recovery"""
        try:
            self.driver.get(url)
            return True
        except (WebDriverException, TimeoutException) as e:
            print(f"Error loading {url}: {e}")

            if retries < self.max_retries:
                print(f"Retrying... ({retries + 1}/{self.max_retries})")
                self.recover_driver()
                time.sleep(2)
                return self.safe_get(url, retries + 1)
            else:
                print(f"Failed to load {url} after {self.max_retries} retries")
                return False

    def recover_driver(self):
        """Recover from driver crash"""
        try:
            self.driver.quit()
        except:
            pass

        time.sleep(5)  # Wait before reinitializing
        self.init_driver()

    def quit(self):
        """Safely quit driver"""
        try:
            self.driver.quit()
        except:
            pass

# Usage
robust_driver = RobustWebDriver()
success = robust_driver.safe_get("https://example.com")

JavaScript Error Handling

Monitor and handle JavaScript errors that might cause crashes:

def check_browser_logs(driver):
    """Check for JavaScript errors in browser console"""
    try:
        logs = driver.get_log('browser')
        for log in logs:
            if log['level'] == 'SEVERE':
                print(f"JavaScript error: {log['message']}")
        return len([log for log in logs if log['level'] == 'SEVERE']) == 0
    except:
        return True  # If we can't get logs, assume no errors

Memory Management and Resource Optimization

Browser Options for Stability

Configure Chrome/Firefox options to prevent crashes:

def get_stable_chrome_options():
    """Get Chrome options optimized for stability"""
    options = webdriver.ChromeOptions()

    # Memory management
    options.add_argument('--max_old_space_size=4096')
    options.add_argument('--memory-pressure-off')
    options.add_argument('--disable-background-timer-throttling')

    # Disable problematic features
    options.add_argument('--disable-extensions')
    options.add_argument('--disable-plugins')
    options.add_argument('--disable-images')  # Optional: faster loading
    options.add_argument('--disable-javascript')  # Use only if JS not needed

    # Stability improvements
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-dev-shm-usage')
    options.add_argument('--disable-gpu')
    options.add_argument('--single-process')  # Use with caution

    return options

Periodic Driver Restart

Implement periodic driver restarts to prevent memory leaks:

class PeriodicRestartDriver:
    def __init__(self, restart_interval=100):
        self.restart_interval = restart_interval
        self.request_count = 0
        self.driver = None
        self.init_driver()

    def init_driver(self):
        options = get_stable_chrome_options()
        self.driver = webdriver.Chrome(options=options)
        self.request_count = 0

    def get_with_restart(self, url):
        """Get URL with periodic restart"""
        if self.request_count >= self.restart_interval:
            print("Restarting driver for maintenance...")
            self.driver.quit()
            time.sleep(5)
            self.init_driver()

        self.driver.get(url)
        self.request_count += 1

Advanced Error Handling Patterns

Comprehensive Exception Handling

Create a robust exception handling system:

from selenium.common.exceptions import (
    WebDriverException, TimeoutException, NoSuchElementException,
    StaleElementReferenceException, ElementNotInteractableException
)

def handle_selenium_exceptions(func):
    """Decorator for handling Selenium exceptions"""
    def wrapper(*args, **kwargs):
        max_retries = 3
        for attempt in range(max_retries):
            try:
                return func(*args, **kwargs)
            except TimeoutException:
                print(f"Timeout on attempt {attempt + 1}")
                if attempt == max_retries - 1:
                    raise
                time.sleep(2)
            except StaleElementReferenceException:
                print("Stale element reference, retrying...")
                if attempt == max_retries - 1:
                    raise
                time.sleep(1)
            except WebDriverException as e:
                print(f"WebDriver error: {e}")
                if "chrome not reachable" in str(e).lower():
                    # Browser crashed, need to restart
                    raise
                if attempt == max_retries - 1:
                    raise
                time.sleep(2)
    return wrapper

@handle_selenium_exceptions
def scrape_element(driver, selector):
    """Scrape element with error handling"""
    element = driver.find_element(By.CSS_SELECTOR, selector)
    return element.text

Monitoring and Logging

Health Check Implementation

Implement health checks to detect potential issues:

import psutil
import logging

class DriverHealthMonitor:
    def __init__(self, driver):
        self.driver = driver
        self.logger = logging.getLogger(__name__)

    def check_driver_health(self):
        """Check if driver is healthy"""
        try:
            # Check if driver is responsive
            self.driver.current_url

            # Check memory usage
            memory_usage = psutil.virtual_memory().percent
            if memory_usage > 90:
                self.logger.warning(f"High memory usage: {memory_usage}%")
                return False

            # Check for zombie processes
            chrome_processes = [p for p in psutil.process_iter(['name']) 
                             if 'chrome' in p.info['name'].lower()]
            if len(chrome_processes) > 10:
                self.logger.warning(f"Too many Chrome processes: {len(chrome_processes)}")
                return False

            return True
        except:
            return False

Production-Ready Implementation

Complete Scraping Framework

Here's a production-ready framework combining all strategies:

import logging
import time
from contextlib import contextmanager
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

class ProductionSeleniumScraper:
    def __init__(self, headless=True, max_retries=3):
        self.headless = headless
        self.max_retries = max_retries
        self.driver = None
        self.logger = logging.getLogger(__name__)

    @contextmanager
    def managed_driver(self):
        """Context manager for safe driver usage"""
        try:
            self.init_driver()
            yield self.driver
        finally:
            self.cleanup()

    def init_driver(self):
        """Initialize driver with optimal settings"""
        options = webdriver.ChromeOptions()
        if self.headless:
            options.add_argument('--headless')

        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-gpu')
        options.add_argument('--window-size=1920,1080')

        self.driver = webdriver.Chrome(options=options)
        self.driver.set_page_load_timeout(30)

    def safe_scrape(self, url, scrape_function):
        """Safely execute scraping function with retry logic"""
        for attempt in range(self.max_retries):
            try:
                with self.managed_driver() as driver:
                    driver.get(url)
                    return scrape_function(driver)
            except Exception as e:
                self.logger.error(f"Scraping attempt {attempt + 1} failed: {e}")
                if attempt == self.max_retries - 1:
                    raise
                time.sleep(5)  # Wait before retry

    def cleanup(self):
        """Clean up resources"""
        if self.driver:
            try:
                self.driver.quit()
            except:
                pass
            self.driver = None

# Usage example
def scrape_page_content(driver):
    """Example scraping function"""
    wait = WebDriverWait(driver, 10)
    content = wait.until(EC.presence_of_element_located((By.TAG_NAME, "body")))
    return content.text

scraper = ProductionSeleniumScraper()
result = scraper.safe_scrape("https://example.com", scrape_page_content)

Command Line Tools for Debugging

Use these commands to monitor and debug Selenium processes:

# Check Chrome processes
ps aux | grep chrome

# Monitor memory usage
top -p $(pgrep chrome)

# Kill zombie Chrome processes
pkill -f chrome

# Check available memory
free -h

Best Practices for Timeout Management

  1. Use appropriate timeout values: Set realistic timeouts based on your target websites
  2. Implement exponential backoff: Increase wait times between retries
  3. Monitor resource usage: Keep track of memory and CPU usage
  4. Use headless mode: Reduce resource consumption when possible
  5. Clean up resources: Always properly close drivers and browsers

Alternatives and Complementary Tools

While Selenium is powerful, consider these alternatives for specific use cases:

  • Puppeteer: For JavaScript-heavy sites with better timeout handling capabilities
  • Playwright: Modern alternative with built-in retry mechanisms
  • Requests + BeautifulSoup: For simple scraping without browser automation

When dealing with complex authentication flows, you might also want to learn about handling authentication in Puppeteer for comparison.

Conclusion

Handling browser crashes and timeouts in Selenium requires a multi-layered approach combining proper configuration, error handling, resource management, and monitoring. By implementing the strategies outlined in this guide, you can build robust scraping systems that gracefully handle failures and maintain high reliability.

Remember to always test your error handling mechanisms and monitor your scraping operations in production to identify and address potential issues before they impact your data collection efforts.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon