When scraping websites such as Immobilien Scout24, it is important to respect the platform's terms of service and any robots.txt rules they have in place. In some cases, scraping might be disallowed or restricted, and proceeding without permission can lead to your IP address being blocked or legal action being taken against you.
If you have determined that scraping Immobilien Scout24 is permissible for your purposes, choosing the right user-agent is crucial to mimic a legitimate browser request. A user-agent is a string that a browser or other client sends to a web server to identify itself.
It's generally recommended to use a user-agent that is non-intrusive and common, which would blend in with regular traffic and not raise any red flags. A good approach is to use the user-agent of a popular browser. For example, you can use the user-agent of the latest version of Chrome, Firefox, or Safari.
Here's how you can find a user-agent string: 1. Visit a website that lists user-agent strings, like http://useragentstring.com/. 2. Use your own browser's developer tools to find your user-agent string.
To find your browser's user-agent string using developer tools: - In Chrome or Firefox, press F12 to open the developer tools. - Click on the "Network" tab. - Visit any website. - Click on any request in the "Name" column. - Look for the "User-Agent" header in the request headers section.
Here's an example of a user-agent string for Chrome (this string may be outdated by the time you read this, so it’s advisable to find a current one):
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36
Once you have a user-agent string, you can set it in your HTTP request headers when scraping. Here's how to set the user-agent in Python using the requests
library:
import requests
url = 'https://www.immobilienscout24.de'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.text) # this will print the HTML content of the page
And here's an example in JavaScript using node-fetch
:
const fetch = require('node-fetch');
const url = 'https://www.immobilienscout24.de';
const options = {
headers: {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36'
}
};
fetch(url, options)
.then(response => response.text())
.then(body => {
console.log(body); // this will log the HTML content of the page
});
Remember to install the node-fetch
module before running the JavaScript code by executing npm install node-fetch
.
Finally, it's worth repeating that you should always check Immobilien Scout24's terms of service and the robots.txt
file (usually found at https://www.immobilienscout24.de/robots.txt
) to understand their scraping policy and ensure you are not violating any rules.