Handling errors and retries when scraping websites like Realtor.com is crucial, as it helps in dealing with network issues, server errors, or changes in the website's structure that may cause your scraper to fail. Here’s how you can handle errors and implement retries in both Python and JavaScript:
Python (with requests
and BeautifulSoup
)
Python is a popular language for web scraping, and you can use libraries like requests
for HTTP requests and BeautifulSoup
for parsing HTML. To handle errors and retries, you can also use the requests
library's built-in Session
object, which can be configured with Retry
from urllib3
:
import requests
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
def get_html(url):
session = requests.Session()
retries = Retry(total=5, # Total number of retries
backoff_factor=1, # Time between retries
status_forcelist=[500, 502, 503, 504]) # HTTP status codes to retry
session.mount('http://', HTTPAdapter(max_retries=retries))
session.mount('https://', HTTPAdapter(max_retries=retries))
try:
response = session.get(url, timeout=(5, 14))
response.raise_for_status() # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.exceptions.HTTPError as errh:
print(f"HTTP Error: {errh}")
except requests.exceptions.ConnectionError as errc:
print(f"Error Connecting: {errc}")
except requests.exceptions.Timeout as errt:
print(f"Timeout Error: {errt}")
except requests.exceptions.RequestException as err:
print(f"OOps: Something Else, {err}")
else:
return response.text
return None
url = 'https://www.realtor.com/'
html = get_html(url)
if html:
soup = BeautifulSoup(html, 'html.parser')
# Continue with your scraping logic here...
JavaScript (with axios
and cheerio
)
In JavaScript, you can use libraries like axios
for HTTP requests and cheerio
for parsing HTML. axios
supports automatic retries with the axios-retry
library:
const axios = require('axios');
const axiosRetry = require('axios-retry');
const cheerio = require('cheerio');
axiosRetry(axios, {
retries: 3,
retryDelay: (retryCount) => {
return retryCount * 1000; // Time delay between retries
},
retryCondition: (error) => {
// True if the request should be retried
return error.response.status === 503 || error.response.status === 504;
},
});
async function getHtml(url) {
try {
const response = await axios.get(url);
return response.data;
} catch (error) {
console.error(error);
return null;
}
}
const url = 'https://www.realtor.com/';
getHtml(url)
.then(html => {
if (html) {
const $ = cheerio.load(html);
// Continue with your scraping logic here...
}
})
.catch(error => {
// Handle any other errors here
console.error(error);
});
General Tips for Scraping Realtor.com
- Respect
robots.txt
: Before scraping, always check Realtor.com'srobots.txt
file to understand and comply with their scraping policies. - User-Agent: Set a realistic
User-Agent
to avoid being identified as a bot. - Headers: Sometimes, adding headers (like
Accept-Language
,Accept-Encoding
, etc.) that mimic a real web browser can help avoid detection. - JavaScript Rendering: If Realtor.com is rendering content with JavaScript, you might need to use a library like
puppeteer
in JavaScript orselenium
in Python to handle JS execution. - IP Rotation: If you're making many requests, consider using proxy services to rotate your IP address and avoid IP bans.
- Rate Limiting: Implement rate limiting in your scraper to avoid overwhelming the server with too many requests in a short period.
- Error Logging: Log errors to a file or a database. This can help you debug and improve your scraper over time.
Always remember that web scraping can have legal and ethical implications. Ensure you are allowed to scrape Realtor.com and that your activities comply with their terms of service and any relevant laws and regulations.