Instagram Scraping in 2022

Posted by Vlad Mishkin | February 5, 2023

FAQ (Updated in October 2022)

(or render HTML pages instead of JSON)

After the recent update, you need to pass an additional __d=1 parameter:

https://www.instagram.com/nike/?__a=1&__d=1

Also see the examples below.

Via our API, you need to use proxy=residential parameter (to use residential proxies) and js=false (to use cheaper requests without JS rendering).

Instagram detects datacenter IPs and requires login for them. It will also require login after a few calls from the same IP.

To get around it, you can use our API with proxy=residential parameter and we will rotate the IP on every request to avoid blocks.

Why window._sharedData is not available anymore on HTML pages?

Instagram has updated their HTML pages and got rid of that data. You can get the same data using ?__a=1 links.

Why Instagram scraping is difficult

Making an Instagram scraper used to be easy and straight-forward. There was a powerful and easy-to-use API, and you could just load an URL like https://www.instagram.com/nike/?__a=1 and get all the data. The URL method still works, but there are a few caveats explained below.

Over the recent years, Instagram has made a lot of changes to their site to make scraping harder.

Here are some of those changes:

  • Their old API was shut down. The new one is very restrictive and linked with Facebook API.
  • Authentication is required to access their site from datacenter IPs
  • Authentication is required after a few visits from residential IPs

You can see a history of these changes by reading these StackOverflow questions and answers:

Working ways to do it

All the current ways of accessing Instagram data revolve around using ?__a=1 and using their internal GraphQL API.

Here are some of open-source projects doing it:

Another way to do it is to use a sessionid token cookie while doing your requests, but such method violates Instagram TOS and will get your account banned.

How to do it on WebScraping.AI

To scrape Instagram data you need to use proxy=residential parameter on your request. We rotate proxies on every requests so Instagram won't recognise your request as a bot and won't require auth. The only downside of using residential proxies is the price: datacenter proxies are much cheaper.

An example of such request to get the account data (click the play button to execute it):

Account posts scraper

Let's use Scrapy to scrape all Instagram posts from a user account.

There are 2 main types of methods: requests and parsers.

Base code

First, let's import the libraries we need and create a spider class:

import scrapy
import urllib
import json
from datetime import datetime

class InstagramAccountSpider(scrapy.Spider):
    name = 'InstagramAccount'
    allowed_domains = ['api.webscraping.ai']
Request methods

Then we need the request methods:

  • start_requests - it's where the scraping starts, here we'll request the profile information page and parse it with self.parse_account_page.
  • api_request- this method will send requests to Instagram via webscraping.ai/html API endpoint. We need to specify residential proxies as Instagram requires login on normal datacenter proxies.
  • api_request which will send requests to Instagram via api.webscraping.ai/html endpoint.
  • graphql_posts_request - this method will send requests to Instagram via GraphQL API endpoint. The account page on start_requests contains only the first page of posts, so we need to request the rest of the posts using GraphQL.
  • # starting with the profile page with first page of posts data
    def start_requests(self):
        for username in self.usernames.split(","):
            profile_url = f"https://www.instagram.com/{username}/?__a=1"
            yield self.api_request(profile_url, self.parse_account_page)
    
    # wrapping the URL in a api.webscraping.ai API request to avoid login
    def api_request(self, target_url, parse_callback, meta=None):
        self.logger.info('Requesting: %s', target_url)
        api_params = {'api_key': self.api_key, 'proxy': 'residential', 'timeout': 20000, 'url': target_url}
        api_url = f"https://api.webscraping.ai/html?{urllib.parse.urlencode(api_params)}"
        return scrapy.Request(api_url, callback=parse_callback, meta=meta)
    
    # posts GraphQL pagination requests
    def graphql_posts_request(self, user_id, end_cursor):
        graphql_variables = {'id': user_id, 'first': 12, 'after': end_cursor}
        # query_hash is a constant for this type of query
        graphql_params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(graphql_variables)}
        url = f"https://www.instagram.com/graphql/query/?{urllib.parse.urlencode(graphql_params)}"
        return self.api_request(url, self.parse_graphql_posts, meta={'user_id': user_id})
    
    Parser methods

    Now we need the parser methods:

    • parse_account_page - this method will parse the profile page, yield the posts from the first page and start GraphQL requests to get more pages.
    • parse_graphql_posts - this method will parse the GraphQL response, yield the posts and continue to the next page.
    # parsing the initial profile page
    def parse_account_page(self, response):
        self.logger.info('Parsing account page...')
        all_data = json.loads(response.text)
        # self.logger.info('Parsing account data: %s', all_data)
        user_data = all_data['graphql']['user']
    
        for post_data in user_data['edge_owner_to_timeline_media']['edges']:
            # multiple media will be returned in case of a carousel
            for parsed_post in self.parse_post(post_data):
                yield parsed_post
    
        if user_data['edge_owner_to_timeline_media']['page_info']['has_next_page']:
            end_cursor = user_data['edge_owner_to_timeline_media']['page_info']['end_cursor']
            user_id = user_data['id']
            yield self.graphql_posts_request(user_id, end_cursor)
    
    # parsing the paginated posts
    def parse_graphql_posts(self, response):
        self.logger.info('Parsing GraphQL response...')
        posts_data = json.loads(response.text)
        self.logger.info('Parsing GraphQL data: %s', posts_data)
        timeline_media = posts_data['data']['user']['edge_owner_to_timeline_media']
    
        for post in timeline_media['edges']:
            # multiple media will be returned in case of a carousel
            for parsed_post in self.parse_post(post):
                yield parsed_post
    
        if timeline_media['page_info']['has_next_page']:
            user_id = response.meta['user_id']
            end_cursor = timeline_media['page_info']['end_cursor']
            yield self.graphql_posts_request(user_id, end_cursor)
    

    And finally the parse_post method to parse posts from both types of pages. In case of a carousel post, we will return each media separately.

    # extracting the post information from JSON
    def parse_post(self, post_data):
        # self.logger.info('Parsing post data: %s', post_data)
        post_data = post_data['node']
    
        base_post = {
            'username': post_data['owner']['username'],
            'user_id': post_data['owner']['id'],
            'post_id': post_data['id'],
            'is_video': post_data['is_video'],
            'media_url': post_data['video_url'] if post_data['is_video'] else post_data['display_url'],
            'like_count': post_data['edge_media_preview_like']['count'],
            'comment_count': post_data['edge_media_to_comment']['count'],
            'caption': post_data['edge_media_to_caption']['edges'][0]['node']['text'] if post_data['edge_media_to_caption']['edges'] else None,
            'location': post_data['location']['name'] if post_data['location'] else None,
            'timestamp': post_data['taken_at_timestamp'],
            'date_posted': datetime.fromtimestamp(post_data['taken_at_timestamp']).strftime("%d-%m-%Y %H:%M:%S"),
            'post_url': f"https://www.instagram.com/p/{post_data['shortcode']}/",
            'thumbnail_url': post_data['thumbnail_resources'][-1]['src'],
        }
    
        posts = [base_post]
    
        # adding secondary media for carousels with multiple photos
        if "edge_sidecar_to_children" in post_data:
            for carousel_item in post_data["edge_sidecar_to_children"]["edges"]:
                carousel_post = {
                    **base_post,
                    'post_id': carousel_item['node']['id'],
                    'thumbnail_url': carousel_item['node']['display_url'],
                    'media_url': carousel_item['node']['display_url'],
                }
                posts.append(carousel_post)
    
        return posts
    
    How to run it

    You can clone the repository from https://github.com/webscraping-ai/instagram-scraper-python and run the crawler with the following command:

    $ scrapy crawl InstagramAccount -o output.csv -a usernames=nike,microsoft -a api_key=test-api-key
    

    Table of contents

    You might also enjoy

    Twitter Scraping in 2023

    Twitter Scraping in 2023

    Twitter is one of the most popular social media platforms, with millions of users tweeting and sharing their thoughts and opinions every day. Here is how to scrape it.

    Posted by Vlad Mishkin | March 8, 2023
    Web Scraping with Python

    Web Scraping with Python

    A tutorial about web scraping in Python with examples. We will take a look at the most popular Python tools for web scraping: Requests, BeautifulSoup, lxml and others.

    Posted by Vlad Mishkin | February 5, 2023