When using headless Chromium (or any headless browser) for web scraping, it's common for websites to employ detection techniques to block or serve different content to automated scripts. To prevent detection, you can take several steps that make your headless browser appear more like a regular user's browser.
Here are some strategies to prevent detection of headless Chromium:
1. Use User-Agent String
Change the default user-agent string to a common user-agent of a popular browser. Some sites check the user-agent string to determine if the visitor is using a headless browser.
# Python example with Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument("window-size=1920,1080")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36")
driver = webdriver.Chrome(options=options)
driver.get("http://example.com")
2. Modify Webdriver Properties
Websites can check for certain properties that are true for headless browsers (e.g., navigator.webdriver
being true
). You can use JavaScript to modify these properties.
// JavaScript to inject using Selenium or Puppeteer to modify navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
3. Use Browser Extensions
Some browser extensions can help in masking the fact that you're using a headless browser by modifying the behavior and fingerprint of the browser.
# Python example with Selenium
options.add_extension('path/to/extension.crx')
4. Mimic Human Interaction
Websites may look for typical human interactions such as mouse movements and clicks. You can use browser automation tools to simulate these actions.
# Python example with Selenium
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)
actions.move_to_element(some_element).perform()
5. Avoid Quick and Regular Intervals
Automated scripts often access pages at regular intervals and without delays, which is a strong signal for bot activity. Introduce randomness in your scraping patterns.
import time
import random
time.sleep(random.uniform(1, 5))
6. Rotate IPs and Use Proxies
Using a single IP address for a large number of requests can flag your activity as suspicious. Use a pool of proxies and rotate them to distribute the requests.
# Python example with requests and proxies
import requests
proxies = {
'http': 'http://your.proxy:port',
'https': 'http://your.proxy:port',
}
response = requests.get("http://example.com", proxies=proxies)
7. Use --disable-blink-features Automation Control
Disabling automation control features can prevent some methods of detection.
chromium-browser --headless --disable-blink-features=AutomationControlled
8. Use Stealth Plugins
For Puppeteer (a Node library which provides a high-level API to control headless Chrome), there's a stealth plugin available that applies various evasion techniques.
// JavaScript example with Puppeteer
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com');
// ... your scraping code ...
await browser.close();
})();
9. Randomize Browser Window Size
Some websites look at the window size to detect headless browsers, which often have off-screen sizes.
# Python example with Selenium
options.add_argument("window-size=1280,720")
Conclusion
Even with these techniques, there's no guarantee that you can avoid detection indefinitely as websites continuously improve their bot detection mechanisms. Always ensure that your web scraping activities comply with the website's terms of service and legal regulations. If you're scraping at a large scale, it may be more practical to use a web scraping service that handles these concerns for you.