The history and why is it difficult
Making an Instagram scraper used to be easy and straight-forward. There was a powerful and easy-to-use API, and you could just load an URL like https://www.instagram.com/nike/?__a=1 and get all the data.
The URL method still works, but there are a few caveats explained below.
Over the recent years, Instagram has made a lot of changes to their site to make scraping harder.
Here are some of those changes:
- Their old API was shut down. The new one is very restrictive and linked with Facebook API.
- Authentication is required to access their site from datacenter IPs
- Authentication is required after a few visits from residential IPs
You can see a history of these changes by reading these StackOverflow questions and answers:
Working ways to do it
All of the current ways of accessing Instagram data revolve around using
?__a=1 and using their internal GraphQL API.
Here are some of open-source projects doing it:
Another way to do it is to use a
sessionid token cookie while doing your requests, but such method violates Instagram TOS and will get your account banned.
How to do it on WebScraping.AI
To scrape Instagram data you need to use proxy=residential parameter on your request. We rotate proxies on every requests so Instagram won't recognise your request as a bot and won't require auth. The only downside of using residential proxies is the price: datacenter proxies are much cheaper.
An example of such request:
const request = require('request');
const requestPromise = require('request-promise');
// Click “▶ run” to try this code live.