How to Scrape Instagram in 2023

Posted by Vlad Mishkin | February 5, 2023

Updated: June 2023

Welcome to the world of Instagram scraping, a powerful tool that can revolutionize the way you access, analyze, and utilize data from one of the most popular social media platforms today. As a developer, you might have come across the challenges that Instagram’s API restrictions present, making it difficult to extract the data you need efficiently. In this article, we will delve into the ever-growing world of Instagram data scraping and explore how you can overcome the limitations imposed by the platform's API.

Instagram boasts over a billion monthly active users, making it an invaluable source of information for businesses, marketers, and researchers alike. Unfortunately, these massive amounts of data are not readily accessible due to the platform's stringent API restrictions, which limit the number of requests made per user and the types of data that can be requested. These limitations can hinder the development of innovative applications, impede market research, and slow down the overall data extraction process.

Fear not, for there is a solution! By using specialized Instagram scraping tools and techniques, you can bypass these API restrictions and unlock the full potential of Instagram’s wealth of data. In this comprehensive guide, we will explore the intricacies of Instagram scraping.

Kinds of Instagram data that can be scraped

With the right tools and techniques, you can extract a variety of valuable information from Instagram. Some of the most common types of data that can be obtained with Instagram scraping:

Profile information

This includes basic details such as username, user ID, full name, biography, profile picture URL, follower and following counts, and the total number of posts of Instagram users. You can also access additional information like the user's external website link, contact information and phone numbers (if available), and Instagram profile privacy status.

Instagram posts data

For each post, you can extract information such as post ID, shortcode, caption, media type (photo, video, or carousel), image or video URLs, date and time of posting, number of likes, and comments. Additionally, you can obtain the location where the post was made (if available) and the tagged users in the post.

Instagram comments

You can extract comments from posts, which includes the comment text, commenter's username, user ID, and the date and time the comment was made.

Instagram hashtags

Extracting hashtags from posts and their related metadata allows you to analyze popular and trending hashtags, as well as their usage patterns.

Instagram stories and highlights

You can also access information related to Instagram Stories and Highlights, such as the media URLs and metadata, the date and time of posting, and the list of users who have viewed the story (if the data is publicly available).

Instagram followers and following lists

You can extract lists of followers and the users that an account is following, which includes their usernames, user IDs, and basic profile information.

Geolocation data

If a user has tagged their location in a post, you can extract the related geolocation data, including location name, coordinates (latitude and longitude), and location ID. This information can be useful for analyzing user activities and preferences across different geographical areas.

Engagement Metrics

By aggregating data on likes, comments, and followers, you can calculate engagement metrics such as the engagement rate, average likes, and comments per post for specific Instagram users.

Ads and Sponsored Posts

With advanced scraping techniques, you can identify, and extract information related to ads and sponsored posts, such as the advertiser's details, ad creative, and ad targeting information.

FAQ

Does Instagram allow scraping?

Instagram does not generally allow web scraping. According to Instagram's terms of service and platform policy, users are prohibited from accessing or using Instagram's private API, scraping, caching, or storing any Instagram content (including user information, photos, and videos).

Using automated means, such as bots or scrapers, to access, collect, or use Instagram's data is a violation of their terms and could result in legal consequences or suspension or termination of your account. So, in this article we only cover the methods that don't require login information.

How to scrape Instagram without getting banned?

There are 2 main types of bans that can be used by Instagram: account ban and IP address ban.

  • Account ban: Your Instagram profile can be banned if you use its login information for a scraping script. To avoid this ban, we don't use any methods requiring login in this article.

  • IP ban: if you send too many requests to Instagram from the same IP address, Instagram will temporarely block it. To avoid this ban, WebScraping.AI API automatically rotates IP address on every request.

The ?__a=1 parameter in an Instagram link is a query parameter used to access the JSON data of a specific Instagram page, typically a user's profile URL or a single post. By appending ?__a=1 to the end of an Instagram URL, you can retrieve the page's data in a structured JSON format, making it easier to extract and parse information programmatically. It's a part of unofficial Instagram API and they are extremely useful if you want to scrape Instagram and extract data.

For example, if you wanted to access the JSON data of a user's Instagram account, you would add ?__a=1 to the end of the user's profile URL:

https://www.instagram.com/username/?__a=1&__d=dis

Similarly, to access the JSON data of a specific post, you would add ?__a=1 to the end of the post's URL:

https://www.instagram.com/p/shortcode/?__a=1&__d=dis

Using the ?__a=1 parameter can be helpful for developers and data scrapers looking to extract specific information from Instagram, such as user details, post metadata, or comments.

Why ?__a=1 links are not working anymore?

(Or render HTML pages instead of JSON) After the recent update, you need to pass an additional ...&__d=dis parameter:

https://www.instagram.com/nike/?__a=1&__d=dis

Also see the examples below.

Via our API, you need to use proxy=residential parameter (to use residential proxies) and js=false (to use cheaper requests without JS rendering).

Instagram detects datacenter IPs and requires login for them. It will also require login after a few calls from the same IP.

To get around it, you can use our API with proxy=residential parameter and we will rotate the IP on every request to avoid blocks.

Why window._sharedData is not available anymore on raw HTML pages?

Instagram has updated their HTML pages and got rid of that data. You can get the same data using ?__a=1 links.

In June of 2023 Instagram has started to reduce the number of requests that can be done to their private APIs by 1 IP address without login. So, after that limit is reached, the API will return this error: {"message":"Please wait a few minutes before you try again.","require_login":true,"status":"fail"}

As a workaround, you can switch to these endpoints to replace ?__a=1 URLs:

  • To get account information:
    https://www.instagram.com/api/v1/users/web_profile_info/?username=nike
  • To get post information:
    https://www.instagram.com/graphql/query?query_hash=2b0673e0dc4580674a88d426fe00ea90&variables=%7B%22shortcode%22%3A%22CpuD9unJ7kn%22%7D
    (replace CpuD9unJ7kn with your post shortcode)

Why is it difficult to scrape Instagram

Making an Instagram scraper used to be easy and straight-forward. There was a powerful and easy-to-use API, and you could just load an URL like https://www.instagram.com/nike/?__a=1 and get all the data. The URL method still works, but there are a few caveats explained below.

Over the recent years, Instagram has made a lot of changes to their site to make scraping harder.

Here are some of those changes:

  • Their old API was shut down. The new one is very restrictive and linked with Facebook API.

  • Authentication is required to access their site from datacenter IPs

  • Authentication is required after a few visits from residential IPs

You can see a history of these changes by reading these StackOverflow questions and answers about Instagram scraping:

Working ways of scraping Instagram

All the current ways of accessing Instagram data revolve around using ?__a=1 and using their internal GraphQL API.

Here are some of open-source projects doing it:

Another way to do it is to use a sessionid token cookie while doing your requests, but such method violates Instagram TOS and will get your account banned.

How to do it on WebScraping.AI

To scrape Instagram data, you need to use proxy=residential parameter on your request. We rotate proxies on every request, so Instagram won't recognise your request as a bot and won't require auth. The only downside of using residential proxies is the price: datacenter proxies are much cheaper.

An example of such request to get the profile data (click the play button to execute it):

Instagram profile and posts scraper in Python

Here is an example of how to build an Instagram web scraper. Let's use Scrapy to scrape all Instagram posts from a user profile.

Scrapy is an open-source web crawling and data extraction framework built in Python, widely used for web scraping tasks and collecting information. It provides a simple and efficient way to navigate websites, follow links, and extract structured data from web pages. Scrapy is highly customizable, allowing you to create powerful web spiders tailored to your specific needs.

We will use two main types of methods: requests and parsers.

Base code

First, let's import the libraries we need and create a spider class. A spider is a class in Scrapy that defines how to follow URLs and extract data from web pages.

import scrapy
import urllib
import json
from datetime import datetime

class InstagramAccountSpider(scrapy.Spider):
    name = 'InstagramAccount'
    allowed_domains = ['api.webscraping.ai']

Request methods

Then we need the request methods:

  • start_requests - it's where the scraping starts, here we'll request the profile information page and parse it with self.parse_account_page.

  • api_request- this method will send requests to Instagram via webscraping.ai/html API endpoint. We need to specify residential proxies as Instagram requires login on normal datacenter proxies.

  • api_request which will send requests to Instagram via api.webscraping.ai/html endpoint.

  • graphql_posts_request - this method will send requests to Instagram via GraphQL API endpoint. The account page on start_requests contains only the first page of posts, so we need to request the rest of the posts using GraphQL.

# starting with the profile page with first page of posts data
def start_requests(self):
    for username in self.usernames.split(","):
        profile_url = f"https://www.instagram.com/{username}/?__a=1"
        yield self.api_request(profile_url, self.parse_account_page)

# wrapping the URL in a api.webscraping.ai API request to avoid login
def api_request(self, target_url, parse_callback, meta=None):
    self.logger.info('Requesting: %s', target_url)
    api_params = {'api_key': self.api_key, 'proxy': 'residential', 'timeout': 20000, 'url': target_url}
    api_url = f"https://api.webscraping.ai/html?{urllib.parse.urlencode(api_params)}"
    return scrapy.Request(api_url, callback=parse_callback, meta=meta)

# posts GraphQL pagination requests
def graphql_posts_request(self, user_id, end_cursor):
    graphql_variables = {'id': user_id, 'first': 12, 'after': end_cursor}
    # query_hash is a constant for this type of query
    graphql_params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(graphql_variables)}
    url = f"https://www.instagram.com/graphql/query/?{urllib.parse.urlencode(graphql_params)}"
    return self.api_request(url, self.parse_graphql_posts, meta={'user_id': user_id})

Parser methods

Now we need the parser methods:

  • parse_account_page - this method will parse the profile page, yield the posts from the first page and start GraphQL requests to get more pages.

  • parse_graphql_posts - this method will parse the GraphQL response, yield the posts and continue to the next page.

# parsing the initial profile page
def parse_account_page(self, response):
    self.logger.info('Parsing account page...')
    all_data = json.loads(response.text)
    # self.logger.info('Parsing account data: %s', all_data)
    user_data = all_data['graphql']['user']

    for post_data in user_data['edge_owner_to_timeline_media']['edges']:
        # multiple media will be returned in case of a carousel
        for parsed_post in self.parse_post(post_data):
            yield parsed_post

    if user_data['edge_owner_to_timeline_media']['page_info']['has_next_page']:
        end_cursor = user_data['edge_owner_to_timeline_media']['page_info']['end_cursor']
        user_id = user_data['id']
        yield self.graphql_posts_request(user_id, end_cursor)

# parsing the paginated posts
def parse_graphql_posts(self, response):
    self.logger.info('Parsing GraphQL response...')
    posts_data = json.loads(response.text)
    self.logger.info('Parsing GraphQL data: %s', posts_data)
    timeline_media = posts_data['data']['user']['edge_owner_to_timeline_media']

    for post in timeline_media['edges']:
        # multiple media will be returned in case of a carousel
        for parsed_post in self.parse_post(post):
            yield parsed_post

    if timeline_media['page_info']['has_next_page']:
        user_id = response.meta['user_id']
        end_cursor = timeline_media['page_info']['end_cursor']
        yield self.graphql_posts_request(user_id, end_cursor)
    

And finally the parse_post method to parse posts from both types of pages. In case of a carousel post, we will return each media separately.

# extracting the post information from JSON
def parse_post(self, post_data):
    # self.logger.info('Parsing post data: %s', post_data)
    post_data = post_data['node']

    base_post = {
        'username': post_data['owner']['username'],
        'user_id': post_data['owner']['id'],
        'post_id': post_data['id'],
        'is_video': post_data['is_video'],
        'media_url': post_data['video_url'] if post_data['is_video'] else post_data['display_url'],
        'like_count': post_data['edge_media_preview_like']['count'],
        'comment_count': post_data['edge_media_to_comment']['count'],
        'caption': post_data['edge_media_to_caption']['edges'][0]['node']['text'] if post_data['edge_media_to_caption']['edges'] else None,
        'location': post_data['location']['name'] if post_data['location'] else None,
        'timestamp': post_data['taken_at_timestamp'],
        'date_posted': datetime.fromtimestamp(post_data['taken_at_timestamp']).strftime("%d-%m-%Y %H:%M:%S"),
        'post_url': f"https://www.instagram.com/p/{post_data['shortcode']}/",
        'thumbnail_url': post_data['thumbnail_resources'][-1]['src'],
    }

    posts = [base_post]

    # adding secondary media for carousels with multiple photos
    if "edge_sidecar_to_children" in post_data:
        for carousel_item in post_data["edge_sidecar_to_children"]["edges"]:
            carousel_post = {
                **base_post,
                'post_id': carousel_item['node']['id'],
                'thumbnail_url': carousel_item['node']['display_url'],
                'media_url': carousel_item['node']['display_url'],
            }
            posts.append(carousel_post)

    return posts

How to run this code

You can clone the repository from https://github.com/webscraping-ai/instagram-scraper-python and run the crawler code with the following command:

$ scrapy crawl InstagramAccount -o output.csv -a usernames=nike,microsoft -a api_key=test-api-key

It will take Instagram usernames as input and extract all publicly available data and posts from them.

The resulting CSV file with scrapped data will look like this:

Resulting CSV example
Resulting CSV example

Table of contents

You might also enjoy

Twitter Scraping in 2023

Twitter Scraping in 2023

Twitter is one of the most popular social media platforms, with millions of users tweeting and sharing their thoughts and opinions every day. Here is how to scrape it.

Posted by Vlad Mishkin | March 8, 2023
Web Scraping with Python

Web Scraping with Python

A tutorial about web scraping in Python with examples. We will take a look at the most popular Python tools for web scraping: Requests, BeautifulSoup, lxml and others.

Posted by Vlad Mishkin | February 5, 2023