Scraping websites like Nordstrom can be quite challenging because many e-commerce sites have robust measures in place to detect and block web scrapers. While I can provide you with some general guidelines and strategies to avoid being blocked, it's important to note that you must comply with Nordstrom's Terms of Service and use scraping practices ethically and legally.
Here are some strategies to minimize the risk of being blocked while scraping:
Adhere to
robots.txt
: Check Nordstrom'srobots.txt
file (usually found athttps://www.nordstrom.com/robots.txt
) to see what their policy is regarding scraping. Respect the rules outlined in this file.User-Agent Rotation: Websites often identify bots by looking for non-browser or unusual User-Agent strings. Rotate through a list of legitimate user-agents to mimic real users.
Request Throttling: Space out your requests to avoid hitting the server too rapidly. Implement delays or random sleep intervals between requests.
IP Rotation: Use a pool of IP addresses to distribute requests. This can be achieved through proxies or VPN services.
Referrer Strings: Some websites check the referrer string to ensure that navigation is occurring naturally within the site. Make sure to set appropriate referrer strings when making requests.
Headers and Cookies: Make sure to use headers and cookies just like a regular browser would. This includes accepting and sending cookies, as well as using correct Accept-Language and Accept-Encoding headers.
Scrape During Off-Peak Hours: Try to scrape during the website's off-peak hours when there is less traffic.
Session Maintenance: Maintain sessions by using the same cookies and session data for subsequent requests.
JavaScript Rendering: Some sites require JavaScript to display data. Use tools like Selenium, Puppeteer, or a headless browser that can execute JavaScript.
CAPTCHA Solutions: If you encounter CAPTCHAs, you may need to use CAPTCHA-solving services, though this is a grey area and is often against the site's terms of service.
Be Unpredictable: Randomize the order in which you scrape pages, and avoid creating easily detectable scraping patterns.
Here is a simple example using Python with the requests
library, incorporating some of the strategies mentioned above:
import requests
import time
from fake_useragent import UserAgent
from itertools import cycle
# Proxy setup (if you have a list of proxies)
proxies = ["http://proxy1.example.com:8000", "http://proxy2.example.com:8000"]
proxy_pool = cycle(proxies)
# User-Agent setup
ua = UserAgent()
headers = {
'User-Agent': ua.random,
'Accept-Language': 'en-US,en;q=0.5',
'Referer': 'https://www.nordstrom.com/'
}
# Target URL
url = 'https://www.nordstrom.com/s/some-product-id'
# Loop to make requests
for _ in range(10):
proxy = next(proxy_pool)
try:
response = requests.get(url, headers=headers, proxies={"http": proxy, "https": proxy})
if response.status_code == 200:
# Process the page
pass
else:
print(f"Blocked or error occurred. Status code: {response.status_code}")
time.sleep(10) # Wait for 10 seconds before next request
except requests.exceptions.RequestException as e:
print(e)
Remember, even with these measures, there's no guarantee that you won't be blocked. Always be ready to adapt your strategy and respect the site's rules. If you're scraping at scale, it may be more appropriate to reach out to Nordstrom directly to see if they have an API or another way for you to access their data in a manner that they approve of.