Datacenter picture
Scraping
10 minutes reading time

Instagram Scraping in 2021

Author photo
Vlad Mishkin
Why is it difficult

Making an Instagram scraper used to be easy and straight-forward. There was a powerful and easy-to-use API, and you could just load an URL like https://www.instagram.com/nike/?__a=1 and get all the data. The URL method still works, but there are a few caveats explained below.

Over the recent years, Instagram has made a lot of changes to their site to make scraping harder.

Here are some of those changes:

  • Their old API was shut down. The new one is very restrictive and linked with Facebook API.
  • Authentication is required to access their site from datacenter IPs
  • Authentication is required after a few visits from residential IPs

You can see a history of these changes by reading these StackOverflow questions and answers:

Working ways to do it

All the current ways of accessing Instagram data revolve around using ?__a=1 and using their internal GraphQL API.

Here are some of open-source projects doing it:

Another way to do it is to use a sessionid token cookie while doing your requests, but such method violates Instagram TOS and will get your account banned.

How to do it on WebScraping.AI

To scrape Instagram data you need to use proxy=residential parameter on your request. We rotate proxies on every requests so Instagram won't recognise your request as a bot and won't require auth. The only downside of using residential proxies is the price: datacenter proxies are much cheaper.

An example of such request to get the account data (click the play button to execute it):

Account posts scraper

Let's use Scrapy to scrape all Instagram posts from a user account.

There are 2 main types of methods: requests and parsers.

Base code

First, let's import the libraries we need and create a spider class:

import scrapy
import urllib
import json
from datetime import datetime

class InstagramAccountSpider(scrapy.Spider):
    name = 'InstagramAccount'
    allowed_domains = ['api.webscraping.ai']
Request methods

Then we need the request methods:

  • start_requests - it's where the scraping starts, here we'll request the profile information page and parse it with self.parse_account_page.
  • api_request- this method will send requests to Instagram via webscraping.ai/html API endpoint. We need to specify residential proxies as Instagram requires login on normal datacenter proxies.
  • api_request which will send requests to Instagram via api.webscraping.ai/html endpoint.
  • graphql_posts_request - this method will send requests to Instagram via GraphQL API endpoint. The account page on start_requests contains only the first page of posts, so we need to request the rest of the posts using GraphQL.
# starting with the profile page with first page of posts data
def start_requests(self):
    for username in self.usernames.split(","):
        profile_url = f"https://www.instagram.com/{username}/?__a=1"
        yield self.api_request(profile_url, self.parse_account_page)

# wrapping the URL in a api.webscraping.ai API request to avoid login
def api_request(self, target_url, parse_callback, meta=None):
    self.logger.info('Requesting: %s', target_url)
    api_params = {'api_key': self.api_key, 'proxy': 'residential', 'timeout': 20000, 'url': target_url}
    api_url = f"https://api.webscraping.ai/html?{urllib.parse.urlencode(api_params)}"
    return scrapy.Request(api_url, callback=parse_callback, meta=meta)

# posts GraphQL pagination requests
def graphql_posts_request(self, user_id, end_cursor):
    graphql_variables = {'id': user_id, 'first': 12, 'after': end_cursor}
    # query_hash is a constant for this type of query
    graphql_params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(graphql_variables)}
    url = f"https://www.instagram.com/graphql/query/?{urllib.parse.urlencode(graphql_params)}"
    return self.api_request(url, self.parse_graphql_posts, meta={'user_id': user_id})
Parser methods

Now we need the parser methods:

  • parse_account_page - this method will parse the profile page, yield the posts from the first page and start GraphQL requests to get more pages.
  • parse_graphql_posts - this method will parse the GraphQL response, yield the posts and continue to the next page.
# parsing the initial profile page
def parse_account_page(self, response):
    self.logger.info('Parsing account page...')
    all_data = json.loads(response.text)
    # self.logger.info('Parsing account data: %s', all_data)
    user_data = all_data['graphql']['user']

    for post_data in user_data['edge_owner_to_timeline_media']['edges']:
        # multiple media will be returned in case of a carousel
        for parsed_post in self.parse_post(post_data):
            yield parsed_post

    if user_data['edge_owner_to_timeline_media']['page_info']['has_next_page']:
        end_cursor = user_data['edge_owner_to_timeline_media']['page_info']['end_cursor']
        user_id = user_data['id']
        yield self.graphql_posts_request(user_id, end_cursor)

# parsing the paginated posts
def parse_graphql_posts(self, response):
    self.logger.info('Parsing GraphQL response...')
    posts_data = json.loads(response.text)
    self.logger.info('Parsing GraphQL data: %s', posts_data)
    timeline_media = posts_data['data']['user']['edge_owner_to_timeline_media']

    for post in timeline_media['edges']:
        # multiple media will be returned in case of a carousel
        for parsed_post in self.parse_post(post):
            yield parsed_post

    if timeline_media['page_info']['has_next_page']:
        user_id = response.meta['user_id']
        end_cursor = timeline_media['page_info']['end_cursor']
        yield self.graphql_posts_request(user_id, end_cursor)

And finally the parse_post method to parse posts from both types of pages. In case of a carousel post, we will return each media separately.

# extracting the post information from JSON
def parse_post(self, post_data):
    # self.logger.info('Parsing post data: %s', post_data)
    post_data = post_data['node']

    base_post = {
        'username': post_data['owner']['username'],
        'user_id': post_data['owner']['id'],
        'post_id': post_data['id'],
        'is_video': post_data['is_video'],
        'media_url': post_data['video_url'] if post_data['is_video'] else post_data['display_url'],
        'like_count': post_data['edge_media_preview_like']['count'],
        'comment_count': post_data['edge_media_to_comment']['count'],
        'caption': post_data['edge_media_to_caption']['edges'][0]['node']['text'] if post_data['edge_media_to_caption']['edges'] else None,
        'location': post_data['location']['name'] if post_data['location'] else None,
        'timestamp': post_data['taken_at_timestamp'],
        'date_posted': datetime.fromtimestamp(post_data['taken_at_timestamp']).strftime("%d-%m-%Y %H:%M:%S"),
        'post_url': f"https://www.instagram.com/p/{post_data['shortcode']}/",
        'thumbnail_url': post_data['thumbnail_resources'][-1]['src'],
    }

    posts = [base_post]

    # adding secondary media for carousels with multiple photos
    if "edge_sidecar_to_children" in post_data:
        for carousel_item in post_data["edge_sidecar_to_children"]["edges"]:
            carousel_post = {
                **base_post,
                'post_id': carousel_item['node']['id'],
                'thumbnail_url': carousel_item['node']['display_url'],
                'media_url': carousel_item['node']['display_url'],
            }
            posts.append(carousel_post)

    return posts
How to run it

You can clone the repository from https://github.com/webscraping-ai/instagram-scraper-python and run the crawler with the following command:

$ scrapy crawl InstagramAccount -o output.csv -a usernames=nike,microsoft -a api_key=test-api-key

Get Started Now

WebScraping.AI provides rotating proxies, Chrome rendering and built-in HTML parser for web scraping
Icon