Scraping websites like SeLoger can be challenging due to various anti-scraping measures they employ to protect their data from being harvested by bots. To avoid detection while scraping, you can employ several methods to make your scraper behave more like a human user. However, please be aware that web scraping can violate the terms of service of a website, and acting in bad faith can have legal consequences. Always ensure that your scraping activities comply with the website's terms of service and local laws.
Here are some strategies you can use to avoid detection:
User-Agent Rotation: Websites often check the
User-Agent
string to identify if the request is coming from a browser or a bot. It's important to use a rotation of different user-agent strings to mimic different browsers.Request Delays: Making requests at a human-like interval can help avoid triggering rate limits or detection mechanisms that are looking for rapid-fire automated requests.
Proxy Usage: Utilize a pool of proxies to distribute your requests over different IP addresses. This helps to avoid IP bans and can reduce the chance of being flagged as suspicious.
Referrer Headers: Some websites check the
Referer
header to see if the request is coming from a legitimate page within their site. Set your HTTPReferer
header to a reasonable URL from the target website.Cookies: Maintain cookies between requests to simulate a real user session. If a site uses cookies for tracking user sessions, not having them can be a giveaway that you're scraping.
Headless Browsers: Tools like Puppeteer (for Node.js) or Selenium (for Python and other languages) allow you to control a real browser, which inherently has human-like request patterns and supports JavaScript-rendered pages.
CAPTCHA Solving Services: If CAPTCHAs are encountered, you can use CAPTCHA-solving services (either human-powered or AI-based) to bypass them, although this can be ethically and legally questionable.
Avoid Scraping Too Much Too Fast: Spread your scraping tasks over a longer period and limit the number of pages you scrape per session.
Here's a basic example in Python using requests
with some of the mentioned techniques:
import requests
from time import sleep
from itertools import cycle
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
# Add more user agents here
]
proxies = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
# Add more proxies here
]
proxy_cycle = cycle(proxies)
user_agent_cycle = cycle(user_agents)
def get_page(url):
proxy = next(proxy_cycle)
user_agent = next(user_agent_cycle)
headers = {
'User-Agent': user_agent,
'Referer': 'https://www.seloger.com/'
}
response = requests.get(url, headers=headers, proxies={"http": proxy, "https": proxy})
if response.status_code == 200:
return response.text
else:
# Handle request error
pass
# Use a delay between requests
sleep(random.uniform(1, 5))
# Example usage
html_content = get_page('https://www.seloger.com/list.htm')
And here is a basic example using Puppeteer in JavaScript:
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
// Set a random user agent
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
// Add more user agents here
];
await page.setUserAgent(userAgents[Math.floor(Math.random() * userAgents.length)]);
await page.goto('https://www.seloger.com');
// Add your code to interact with the page here, e.g., page.click(), page.type()
await browser.close();
})();
Remember to use these techniques responsibly and ethically. If a website provides an API, it's always better to use that for data extraction purposes as it's a legitimate way to access the data without violating the website's terms of use.