Table of contents

How do I configure browser cache settings in Headless Chromium?

Configuring browser cache settings in Headless Chromium is essential for web scraping scenarios where you need to control how resources are cached, simulate fresh page loads, or optimize performance by leveraging cached content. This guide covers various methods to configure cache behavior using different automation tools and direct Chrome DevTools Protocol commands.

Understanding Browser Cache in Headless Chromium

Browser cache stores web resources like HTML, CSS, JavaScript, and images to improve loading performance. In headless environments, controlling cache behavior helps you:

  • Test fresh content: Disable cache to ensure you're getting the latest version of dynamic content
  • Simulate real user behavior: Use cache to replicate how actual users experience page loading
  • Optimize scraping performance: Leverage cache for repeated visits to similar pages
  • Debug caching issues: Control cache to isolate problems related to stale content

Cache Configuration with Puppeteer

Disabling Cache Entirely

The most straightforward approach is to disable caching completely:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: true,
    args: ['--disable-dev-shm-usage', '--no-sandbox']
  });

  const page = await browser.newPage();

  // Disable cache for this page
  await page.setCacheEnabled(false);

  await page.goto('https://example.com');

  // All resources will be fetched fresh
  console.log('Page loaded without cache');

  await browser.close();
})();

Selective Cache Control

For more granular control, use Chrome DevTools Protocol to configure specific cache behaviors:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Enable network domain for cache control
  const client = await page.target().createCDPSession();
  await client.send('Network.enable');

  // Clear browser cache
  await client.send('Network.clearBrowserCache');

  // Disable cache for network requests
  await client.send('Network.setCacheDisabled', { 
    cacheDisabled: true 
  });

  await page.goto('https://example.com');

  await browser.close();
})();

Cache with Custom Policies

Configure cache with specific policies for different resource types:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  const client = await page.target().createCDPSession();
  await client.send('Network.enable');

  // Intercept requests to apply custom cache logic
  await page.setRequestInterception(true);

  page.on('request', async (request) => {
    const resourceType = request.resourceType();

    // Cache images and CSS, but not HTML or JavaScript
    if (resourceType === 'image' || resourceType === 'stylesheet') {
      // Allow caching for these resources
      request.continue();
    } else {
      // Force fresh fetch for HTML and JavaScript
      request.continue({
        headers: {
          ...request.headers(),
          'Cache-Control': 'no-cache, no-store, must-revalidate'
        }
      });
    }
  });

  await page.goto('https://example.com');

  await browser.close();
})();

Cache Configuration with Selenium

Python with Chrome WebDriver

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

# Configure Chrome options for cache control
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-cache')
chrome_options.add_argument('--disable-application-cache')
chrome_options.add_argument('--disable-offline-load-stale-cache')
chrome_options.add_argument('--disk-cache-size=0')
chrome_options.add_argument('--media-cache-size=0')

# Create driver with cache disabled
driver = webdriver.Chrome(options=chrome_options)

try:
    driver.get('https://example.com')
    print("Page loaded without cache")

    # Clear any existing cache
    driver.execute_script('window.caches.keys().then(names => names.forEach(name => caches.delete(name)));')

finally:
    driver.quit()

Advanced Cache Management with CDP

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=chrome_options)

try:
    # Enable Network domain
    driver.execute_cdp_cmd('Network.enable', {})

    # Clear browser cache
    driver.execute_cdp_cmd('Network.clearBrowserCache', {})

    # Set cache disabled
    driver.execute_cdp_cmd('Network.setCacheDisabled', {'cacheDisabled': True})

    driver.get('https://example.com')

    # Get cache storage information
    cache_info = driver.execute_cdp_cmd('Storage.getCacheStorageKeys', {
        'securityOrigin': 'https://example.com'
    })
    print(f"Cache storages: {cache_info}")

finally:
    driver.quit()

Direct Chrome DevTools Protocol Usage

Node.js with CDP

const CDP = require('chrome-remote-interface');
const { spawn } = require('child_process');

// Launch Chrome with remote debugging
const chrome = spawn('google-chrome', [
  '--headless',
  '--remote-debugging-port=9222',
  '--disable-gpu',
  '--no-sandbox'
]);

setTimeout(async () => {
  try {
    const client = await CDP();
    const { Network, Page, Runtime } = client;

    // Enable necessary domains
    await Network.enable();
    await Page.enable();
    await Runtime.enable();

    // Configure cache settings
    await Network.setCacheDisabled({ cacheDisabled: true });

    // Clear existing cache
    await Network.clearBrowserCache();

    // Navigate to page
    await Page.navigate({ url: 'https://example.com' });

    // Wait for load
    await Page.loadEventFired();

    console.log('Page loaded with cache disabled');

    await client.close();
  } catch (error) {
    console.error('Error:', error);
  }

  chrome.kill();
}, 1000);

Managing Cache Storage and Service Workers

Clearing Service Worker Cache

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Clear service worker cache
  await page.evaluateOnNewDocument(() => {
    if ('serviceWorker' in navigator) {
      navigator.serviceWorker.getRegistrations().then(registrations => {
        registrations.forEach(registration => registration.unregister());
      });
    }
  });

  // Clear cache storage
  await page.evaluate(() => {
    if ('caches' in window) {
      caches.keys().then(names => {
        names.forEach(name => caches.delete(name));
      });
    }
  });

  await page.goto('https://example.com');

  await browser.close();
})();

Cache Configuration via Command Line Arguments

Chrome Launch Arguments for Cache Control

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch({
  headless: true,
  args: [
    // Disable various cache types
    '--disable-application-cache',
    '--disable-background-timer-throttling',
    '--disable-backgrounding-occluded-windows',
    '--disable-renderer-backgrounding',

    // Cache size control
    '--disk-cache-size=0',
    '--media-cache-size=0',
    '--aggressive-cache-discard',

    // Network cache control
    '--disable-back-forward-cache',
    '--disable-features=TranslateUI,BlinkGenPropertyTrees',

    // Service worker cache
    '--disable-service-worker-database'
  ]
});

Monitoring Cache Behavior

Tracking Cache Hits and Misses

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();

  const client = await page.target().createCDPSession();
  await client.send('Network.enable');

  // Track cache behavior
  const cacheStats = {
    hits: 0,
    misses: 0,
    resources: []
  };

  client.on('Network.responseReceived', ({ response }) => {
    const fromCache = response.fromDiskCache || response.fromServiceWorker;

    if (fromCache) {
      cacheStats.hits++;
    } else {
      cacheStats.misses++;
    }

    cacheStats.resources.push({
      url: response.url,
      fromCache,
      status: response.status,
      mimeType: response.mimeType
    });
  });

  await page.goto('https://example.com');

  console.log('Cache Statistics:', cacheStats);

  await browser.close();
})();

Best Practices for Cache Management

Production Scraping Considerations

  1. Performance vs. Freshness: Balance cache usage with data freshness requirements
  2. Resource Optimization: Cache static assets while forcing fresh data fetches
  3. Memory Management: Clear cache periodically for long-running scrapers
  4. Debugging: Disable cache during development and debugging

Implementation Tips

  • Use page.setCacheEnabled(false) for simple cache disabling in Puppeteer
  • Leverage Chrome DevTools Protocol for fine-grained cache control
  • Consider handling browser sessions when managing cache across multiple pages
  • Monitor network requests to understand cache behavior impact on performance

Proper cache configuration in Headless Chromium ensures your web scraping operations behave predictably and efficiently, whether you need fresh data or want to optimize performance through intelligent caching strategies.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon