How do I prevent detection of Headless Chromium by websites?

When using headless Chromium (or any headless browser) for web scraping, it's common for websites to employ detection techniques to block or serve different content to automated scripts. To prevent detection, you can take several steps that make your headless browser appear more like a regular user's browser.

Here are some strategies to prevent detection of headless Chromium:

1. Use User-Agent String

Change the default user-agent string to a common user-agent of a popular browser. Some sites check the user-agent string to determine if the visitor is using a headless browser.

# Python example with Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("window-size=1920,1080")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36")

driver = webdriver.Chrome(options=options)
driver.get("http://example.com")

2. Modify Webdriver Properties

Websites can check for certain properties that are true for headless browsers (e.g., navigator.webdriver being true). You can use JavaScript to modify these properties.

// JavaScript to inject using Selenium or Puppeteer to modify navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
    get: () => false,
});

3. Use Browser Extensions

Some browser extensions can help in masking the fact that you're using a headless browser by modifying the behavior and fingerprint of the browser.

# Python example with Selenium
options.add_extension('path/to/extension.crx')

4. Mimic Human Interaction

Websites may look for typical human interactions such as mouse movements and clicks. You can use browser automation tools to simulate these actions.

# Python example with Selenium
from selenium.webdriver.common.action_chains import ActionChains

actions = ActionChains(driver)
actions.move_to_element(some_element).perform()

5. Avoid Quick and Regular Intervals

Automated scripts often access pages at regular intervals and without delays, which is a strong signal for bot activity. Introduce randomness in your scraping patterns.

import time
import random

time.sleep(random.uniform(1, 5))

6. Rotate IPs and Use Proxies

Using a single IP address for a large number of requests can flag your activity as suspicious. Use a pool of proxies and rotate them to distribute the requests.

# Python example with requests and proxies
import requests

proxies = {
    'http': 'http://your.proxy:port',
    'https': 'http://your.proxy:port',
}
response = requests.get("http://example.com", proxies=proxies)

7. Use --disable-blink-features Automation Control

Disabling automation control features can prevent some methods of detection.

chromium-browser --headless --disable-blink-features=AutomationControlled

8. Use Stealth Plugins

For Puppeteer (a Node library which provides a high-level API to control headless Chrome), there's a stealth plugin available that applies various evasion techniques.

// JavaScript example with Puppeteer
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');

puppeteer.use(StealthPlugin());

(async () => {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();
    await page.goto('https://example.com');
    // ... your scraping code ...
    await browser.close();
})();

9. Randomize Browser Window Size

Some websites look at the window size to detect headless browsers, which often have off-screen sizes.

# Python example with Selenium
options.add_argument("window-size=1280,720")

Conclusion

Even with these techniques, there's no guarantee that you can avoid detection indefinitely as websites continuously improve their bot detection mechanisms. Always ensure that your web scraping activities comply with the website's terms of service and legal regulations. If you're scraping at a large scale, it may be more practical to use a web scraping service that handles these concerns for you.

How do I prevent detection of Headless Chromium by websites?

1. Use User-Agent String

2. Modify Webdriver Properties

3. Use Browser Extensions

4. Mimic Human Interaction

5. Avoid Quick and Regular Intervals

6. Rotate IPs and Use Proxies

7. Use --disable-blink-features Automation Control

8. Use Stealth Plugins

9. Randomize Browser Window Size

Conclusion

Related Questions

Can Headless Chromium be used with a headless browser framework like Selenium?

What are the security implications of using Headless Chromium for web scraping?

How do I scrape dynamically loaded content with Headless Chromium?

Get Started Now