How do I configure timeouts for page loads in Headless Chromium?
Configuring timeouts for page loads in Headless Chromium is crucial for building robust web scraping and automation applications. Without proper timeout configurations, your applications may hang indefinitely waiting for slow-loading pages or encounter unexpected failures. This comprehensive guide covers various timeout configurations across different tools and scenarios.
Understanding Timeout Types
Before diving into implementation, it's important to understand the different types of timeouts available in Headless Chromium:
- Navigation Timeout: Maximum time to wait for page navigation to complete
- Load Timeout: Time to wait for the
load
event to fire - Wait Timeout: Custom timeout for specific elements or conditions
- Request Timeout: Timeout for individual network requests
- Script Timeout: Maximum execution time for JavaScript evaluation
Configuring Timeouts with Puppeteer
Puppeteer is one of the most popular libraries for controlling Headless Chromium. Here's how to configure various timeouts:
Setting Default Navigation Timeout
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox']
});
const page = await browser.newPage();
// Set default navigation timeout to 30 seconds
page.setDefaultNavigationTimeout(30000);
// Set default timeout for all operations
page.setDefaultTimeout(60000);
try {
await page.goto('https://example.com');
console.log('Page loaded successfully');
} catch (error) {
console.error('Navigation failed:', error.message);
} finally {
await browser.close();
}
})();
Per-Navigation Timeout Configuration
// Configure timeout for specific navigation
await page.goto('https://slow-loading-site.com', {
waitUntil: 'networkidle0',
timeout: 45000 // 45 seconds timeout
});
// Different wait conditions with timeouts
await page.goto('https://example.com', {
waitUntil: 'domcontentloaded',
timeout: 15000
});
Advanced Timeout Configurations
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Configure multiple timeout types
page.setDefaultNavigationTimeout(30000);
page.setDefaultTimeout(60000);
// Set request interception with timeout
await page.setRequestInterception(true);
page.on('request', (request) => {
// Abort requests that take too long
setTimeout(() => {
if (!request.response()) {
request.abort();
}
}, 10000);
request.continue();
});
// Wait for specific elements with timeout
try {
await page.goto('https://example.com');
// Wait for element with custom timeout
await page.waitForSelector('#content', {
timeout: 20000,
visible: true
});
// Wait for function with timeout
await page.waitForFunction(
() => document.readyState === 'complete',
{ timeout: 15000 }
);
} catch (error) {
console.error('Timeout error:', error.message);
}
await browser.close();
})();
Configuring Timeouts with Playwright
Playwright offers similar timeout configuration options with some additional features:
Basic Timeout Configuration
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Set page timeout
page.setDefaultTimeout(30000);
page.setDefaultNavigationTimeout(45000);
try {
await page.goto('https://example.com', {
timeout: 60000,
waitUntil: 'networkidle'
});
console.log('Page loaded successfully');
} catch (error) {
if (error.name === 'TimeoutError') {
console.error('Page load timed out');
}
}
await browser.close();
})();
Context-Level Timeout Configuration
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
// Set timeouts at context level
const context = await browser.newContext({
// This timeout applies to all pages in this context
timeout: 30000
});
const page = await context.newPage();
// Override timeout for specific operations
await page.goto('https://example.com', { timeout: 60000 });
// Wait for load state with timeout
await page.waitForLoadState('networkidle', { timeout: 20000 });
await browser.close();
})();
Using Chrome DevTools Protocol Directly
For more granular control, you can use the Chrome DevTools Protocol directly:
const CDP = require('chrome-remote-interface');
(async () => {
let client;
try {
client = await CDP();
const { Network, Page, Runtime } = client;
// Enable domains
await Network.enable();
await Page.enable();
await Runtime.enable();
// Set navigation timeout using Page domain
await Page.navigate({
url: 'https://example.com'
});
// Wait for load event with timeout
const loadPromise = new Promise((resolve, reject) => {
const timeoutId = setTimeout(() => {
reject(new Error('Page load timeout'));
}, 30000);
Page.loadEventFired(() => {
clearTimeout(timeoutId);
resolve();
});
});
await loadPromise;
console.log('Page loaded successfully');
} catch (error) {
console.error('Error:', error.message);
} finally {
if (client) {
await client.close();
}
}
})();
Python Implementation with Pyppeteer
For Python developers, here's how to configure timeouts using Pyppeteer:
import asyncio
from pyppeteer import launch
from pyppeteer.errors import TimeoutError
async def main():
browser = await launch(headless=True)
page = await browser.newPage()
# Set default timeouts
page.setDefaultNavigationTimeout(30000)
page.setDefaultTimeout(60000)
try:
# Navigate with specific timeout
await page.goto('https://example.com', {
'timeout': 45000,
'waitUntil': 'networkidle0'
})
# Wait for selector with timeout
await page.waitForSelector('#content', {
'timeout': 20000,
'visible': True
})
print("Page loaded successfully")
except TimeoutError as e:
print(f"Timeout error: {e}")
except Exception as e:
print(f"Error: {e}")
finally:
await browser.close()
if __name__ == '__main__':
asyncio.run(main())
Best Practices for Timeout Configuration
1. Use Appropriate Timeout Values
// Different timeout strategies for different scenarios
const timeoutConfigs = {
fast: {
navigation: 15000,
default: 30000
},
normal: {
navigation: 30000,
default: 60000
},
slow: {
navigation: 60000,
default: 120000
}
};
// Apply based on target website characteristics
const config = timeoutConfigs.normal;
page.setDefaultNavigationTimeout(config.navigation);
page.setDefaultTimeout(config.default);
2. Implement Retry Logic
async function navigateWithRetry(page, url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
await page.goto(url, {
timeout: 30000,
waitUntil: 'networkidle2'
});
return; // Success
} catch (error) {
if (error.name === 'TimeoutError' && i < maxRetries - 1) {
console.log(`Retry ${i + 1} for ${url}`);
await new Promise(resolve => setTimeout(resolve, 2000));
continue;
}
throw error; // Re-throw if not timeout or max retries reached
}
}
}
3. Monitor and Adjust Timeouts
async function adaptiveTimeout(page, url, baseTimeout = 30000) {
const start = Date.now();
try {
await page.goto(url, { timeout: baseTimeout });
const loadTime = Date.now() - start;
// Adjust future timeouts based on performance
if (loadTime > baseTimeout * 0.8) {
console.log('Slow site detected, increasing timeout');
page.setDefaultNavigationTimeout(baseTimeout * 1.5);
}
} catch (error) {
if (error.name === 'TimeoutError') {
console.log('Timeout occurred, consider increasing timeout value');
}
throw error;
}
}
Handling Different Wait Conditions
Understanding when to use different wait conditions is crucial for effective timeout management:
// Wait for different load states
await page.goto('https://example.com');
// Wait for DOM content loaded
await page.waitForLoadState('domcontentloaded', { timeout: 15000 });
// Wait for all resources including images
await page.waitForLoadState('load', { timeout: 30000 });
// Wait for network to be idle
await page.waitForLoadState('networkidle', { timeout: 45000 });
Command Line Configuration
You can also configure timeouts directly through Chrome command line arguments:
# Launch Chrome with custom timeout settings
google-chrome-stable \
--headless \
--no-sandbox \
--disable-setuid-sandbox \
--virtual-time-budget=30000 \
--run-all-compositor-stages-before-draw \
--disable-background-timer-throttling \
--disable-renderer-backgrounding \
--disable-backgrounding-occluded-windows
Error Handling and Debugging
Proper error handling helps identify timeout-related issues:
async function handlePageLoad(page, url) {
try {
await page.goto(url, {
timeout: 30000,
waitUntil: 'networkidle0'
});
} catch (error) {
if (error.name === 'TimeoutError') {
console.error(`Timeout loading ${url}:`);
console.error(`- Check if the site is responsive`);
console.error(`- Consider increasing timeout value`);
console.error(`- Verify network connectivity`);
}
// Log additional debugging information
const metrics = await page.metrics();
console.log('Page metrics:', metrics);
throw error;
}
}
Integration with Web Scraping APIs
When working with web scraping services, you can often configure timeouts at the API level. For example, with WebScraping.AI API:
# Configure timeout via API parameter
curl -X GET "https://api.webscraping.ai/html" \
-H "Api-Key: YOUR_API_KEY" \
-G \
-d "url=https://example.com" \
-d "timeout=30000"
// Using fetch with timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 30000);
try {
const response = await fetch('https://api.webscraping.ai/html', {
method: 'GET',
headers: {
'Api-Key': 'YOUR_API_KEY'
},
signal: controller.signal
});
clearTimeout(timeoutId);
const data = await response.json();
} catch (error) {
if (error.name === 'AbortError') {
console.error('Request timed out');
}
}
Advanced Timeout Scenarios
Handling Slow Networks
// Configure for slow network conditions
const page = await browser.newPage();
// Simulate slow 3G network
await page.emulateNetworkConditions({
offline: false,
downloadThroughput: 500 * 1024 / 8,
uploadThroughput: 500 * 1024 / 8,
latency: 20
});
// Increase timeouts accordingly
page.setDefaultNavigationTimeout(120000);
page.setDefaultTimeout(180000);
await page.goto('https://example.com', {
waitUntil: 'networkidle2',
timeout: 150000
});
Dynamic Timeout Adjustment
class AdaptiveTimeout {
constructor(baseTimeout = 30000) {
this.baseTimeout = baseTimeout;
this.loadTimes = [];
}
async navigate(page, url) {
const start = Date.now();
const timeout = this.calculateTimeout();
try {
await page.goto(url, { timeout });
const loadTime = Date.now() - start;
this.recordLoadTime(loadTime);
return loadTime;
} catch (error) {
if (error.name === 'TimeoutError') {
// Increase timeout for future requests
this.adjustTimeout(1.2);
}
throw error;
}
}
calculateTimeout() {
if (this.loadTimes.length === 0) return this.baseTimeout;
const avgLoadTime = this.loadTimes.reduce((a, b) => a + b) / this.loadTimes.length;
return Math.max(this.baseTimeout, avgLoadTime * 1.5);
}
recordLoadTime(time) {
this.loadTimes.push(time);
if (this.loadTimes.length > 10) {
this.loadTimes.shift(); // Keep only last 10 measurements
}
}
adjustTimeout(factor) {
this.baseTimeout *= factor;
}
}
Conclusion
Configuring timeouts properly is essential for reliable web scraping and browser automation. The key is to balance between allowing enough time for pages to load completely while preventing your application from hanging on unresponsive sites. Start with reasonable default values (30-60 seconds) and adjust based on your specific use case and target websites.
Remember to implement proper error handling, consider retry mechanisms for timeout failures, and monitor your timeout configurations in production. For complex scenarios involving dynamic content, you might also want to explore how to handle timeouts in Puppeteer for more specific timeout handling strategies.
When dealing with single-page applications that load content dynamically, proper timeout configuration becomes even more critical. You can learn more about this in our guide on how to crawl a single page application (SPA) using Puppeteer.