Table of contents

How do I Handle Websites That Use Geolocation Restrictions?

Geolocation restrictions are a common challenge when scraping websites that serve different content based on users' geographic locations. These restrictions can block access entirely or serve limited content to users from certain regions. This guide covers various strategies to handle geolocation-restricted websites effectively.

Understanding Geolocation Restrictions

Websites implement geolocation restrictions through several methods:

  • IP-based geolocation: Determining location from the user's IP address
  • DNS geoblocking: Redirecting users to region-specific servers
  • Browser geolocation API: Using HTML5 geolocation features
  • Regional content delivery networks (CDNs): Serving content from location-specific servers

Method 1: Using Proxy Servers

Proxy servers are the most common solution for bypassing geolocation restrictions. They route your requests through servers in different geographic locations.

Residential Proxies with Puppeteer

const puppeteer = require('puppeteer');

async function scrapeWithProxy() {
  const browser = await puppeteer.launch({
    args: [
      '--proxy-server=proxy-server:port',
      '--no-sandbox',
      '--disable-setuid-sandbox'
    ]
  });

  const page = await browser.newPage();

  // Authenticate with proxy if required
  await page.authenticate({
    username: 'proxy-username',
    password: 'proxy-password'
  });

  // Set geolocation manually
  await page.setGeolocation({
    latitude: 40.7128,  // New York coordinates
    longitude: -74.0060
  });

  try {
    await page.goto('https://geo-restricted-site.com');
    const content = await page.content();
    console.log('Content retrieved successfully');
    return content;
  } catch (error) {
    console.error('Failed to access site:', error);
  } finally {
    await browser.close();
  }
}

Using HTTP Proxies with Axios

const axios = require('axios');
const HttpsProxyAgent = require('https-proxy-agent');

async function fetchWithProxy(url, proxyUrl) {
  const agent = new HttpsProxyAgent(proxyUrl);

  try {
    const response = await axios.get(url, {
      httpsAgent: agent,
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br'
      },
      timeout: 30000
    });

    return response.data;
  } catch (error) {
    console.error('Proxy request failed:', error.message);
    throw error;
  }
}

// Usage example
const proxyUrl = 'http://username:password@proxy-server:port';
fetchWithProxy('https://geo-restricted-site.com', proxyUrl)
  .then(data => console.log('Success:', data.length))
  .catch(err => console.error('Error:', err));

Method 2: Overriding Browser Geolocation

When dealing with websites that use the HTML5 Geolocation API, you can override the browser's geolocation settings.

Puppeteer Geolocation Override

const puppeteer = require('puppeteer');

async function overrideGeolocation() {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  // Grant geolocation permission
  const context = browser.defaultBrowserContext();
  await context.overridePermissions('https://target-website.com', ['geolocation']);

  // Set fake geolocation
  await page.setGeolocation({
    latitude: 51.5074,  // London coordinates
    longitude: -0.1278,
    accuracy: 100
  });

  // Override navigator.geolocation
  await page.evaluateOnNewDocument(() => {
    navigator.geolocation.getCurrentPosition = function(success, error, options) {
      success({
        coords: {
          latitude: 51.5074,
          longitude: -0.1278,
          accuracy: 100,
          altitude: null,
          altitudeAccuracy: null,
          heading: null,
          speed: null
        },
        timestamp: Date.now()
      });
    };
  });

  await page.goto('https://target-website.com');
  await browser.close();
}

JavaScript Geolocation Spoofing

// Inject this script to spoof geolocation
function spoofGeolocation(latitude, longitude) {
  Object.defineProperty(navigator.geolocation, 'getCurrentPosition', {
    value: function(success, error, options) {
      success({
        coords: {
          latitude: latitude,
          longitude: longitude,
          accuracy: 100,
          altitude: null,
          altitudeAccuracy: null,
          heading: null,
          speed: null
        },
        timestamp: Date.now()
      });
    },
    writable: false,
    configurable: false
  });

  Object.defineProperty(navigator.geolocation, 'watchPosition', {
    value: function(success, error, options) {
      return setInterval(() => {
        success({
          coords: {
            latitude: latitude,
            longitude: longitude,
            accuracy: 100,
            altitude: null,
            altitudeAccuracy: null,
            heading: null,
            speed: null
          },
          timestamp: Date.now()
        });
      }, 1000);
    },
    writable: false,
    configurable: false
  });
}

// Usage in browser context
spoofGeolocation(37.7749, -122.4194); // San Francisco coordinates

Method 3: Using WebScraping.AI for Geolocation Handling

WebScraping.AI provides built-in geolocation handling through proxy rotation and regional endpoints.

Python Example with WebScraping.AI

import requests
import json

def scrape_geo_restricted_site(url, target_country='US'):
    api_key = 'your-webscraping-ai-api-key'

    params = {
        'api_key': api_key,
        'url': url,
        'country': target_country.lower(),  # us, gb, de, fr, etc.
        'proxy': 'residential',
        'js': True,
        'timeout': 15000
    }

    try:
        response = requests.get(
            'https://api.webscraping.ai/html',
            params=params,
            timeout=30
        )

        if response.status_code == 200:
            return response.text
        else:
            print(f"API Error: {response.status_code}")
            return None

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None

# Usage example
html_content = scrape_geo_restricted_site(
    'https://geo-restricted-site.com',
    target_country='GB'
)

if html_content:
    print(f"Successfully retrieved {len(html_content)} characters")

JavaScript Example with WebScraping.AI

const axios = require('axios');

async function scrapeGeoRestricted(url, country = 'US') {
  const apiKey = 'your-webscraping-ai-api-key';

  const params = {
    api_key: apiKey,
    url: url,
    country: country.toLowerCase(),
    proxy: 'residential',
    js: true,
    device: 'desktop',
    timeout: 15000
  };

  try {
    const response = await axios.get('https://api.webscraping.ai/html', {
      params: params,
      timeout: 30000
    });

    return response.data;
  } catch (error) {
    console.error('Scraping failed:', error.message);
    if (error.response) {
      console.error('Response status:', error.response.status);
      console.error('Response data:', error.response.data);
    }
    throw error;
  }
}

// Usage with different countries
async function testMultipleRegions() {
  const url = 'https://geo-restricted-site.com';
  const countries = ['US', 'GB', 'DE', 'FR', 'CA'];

  for (const country of countries) {
    try {
      console.log(`\nTesting from ${country}...`);
      const content = await scrapeGeoRestricted(url, country);
      console.log(`Success: Retrieved ${content.length} characters`);
    } catch (error) {
      console.log(`Failed for ${country}: ${error.message}`);
    }
  }
}

Method 4: Advanced Techniques

Rotating Through Multiple Proxy Locations

const puppeteer = require('puppeteer');

class GeoRotator {
  constructor() {
    this.proxies = [
      { country: 'US', proxy: 'us-proxy:port', coords: { lat: 40.7128, lng: -74.0060 } },
      { country: 'GB', proxy: 'uk-proxy:port', coords: { lat: 51.5074, lng: -0.1278 } },
      { country: 'DE', proxy: 'de-proxy:port', coords: { lat: 52.5200, lng: 13.4050 } },
      { country: 'FR', proxy: 'fr-proxy:port', coords: { lat: 48.8566, lng: 2.3522 } }
    ];
    this.currentIndex = 0;
  }

  getNextProxy() {
    const proxy = this.proxies[this.currentIndex];
    this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
    return proxy;
  }

  async scrapeWithRotation(url, maxAttempts = 3) {
    for (let attempt = 0; attempt < maxAttempts; attempt++) {
      const proxyConfig = this.getNextProxy();

      try {
        const browser = await puppeteer.launch({
          args: [`--proxy-server=${proxyConfig.proxy}`]
        });

        const page = await browser.newPage();

        await page.setGeolocation({
          latitude: proxyConfig.coords.lat,
          longitude: proxyConfig.coords.lng
        });

        await page.goto(url, { waitUntil: 'networkidle0' });
        const content = await page.content();

        await browser.close();
        return { content, country: proxyConfig.country };

      } catch (error) {
        console.log(`Attempt ${attempt + 1} failed with ${proxyConfig.country}:`, error.message);
        if (attempt === maxAttempts - 1) throw error;
      }
    }
  }
}

// Usage
const rotator = new GeoRotator();
rotator.scrapeWithRotation('https://geo-restricted-site.com')
  .then(result => console.log(`Success from ${result.country}`))
  .catch(error => console.error('All attempts failed:', error));

Handling Region-Specific Headers and Cookies

async function scrapeWithRegionalSettings(url, region = 'US') {
  const regionalSettings = {
    'US': {
      headers: {
        'Accept-Language': 'en-US,en;q=0.9',
        'CF-IPCountry': 'US',
        'X-Forwarded-For': '203.0.113.1' // Example US IP
      },
      timezone: 'America/New_York'
    },
    'GB': {
      headers: {
        'Accept-Language': 'en-GB,en;q=0.9',
        'CF-IPCountry': 'GB',
        'X-Forwarded-For': '203.0.113.2' // Example UK IP
      },
      timezone: 'Europe/London'
    }
  };

  const settings = regionalSettings[region] || regionalSettings['US'];
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Set timezone
  await page.emulateTimezone(settings.timezone);

  // Set regional headers
  await page.setExtraHTTPHeaders(settings.headers);

  // Override timezone in JavaScript context
  await page.evaluateOnNewDocument((timezone) => {
    Object.defineProperty(Intl.DateTimeFormat.prototype, 'resolvedOptions', {
      value: function() {
        return { timeZone: timezone };
      }
    });
  }, settings.timezone);

  await page.goto(url);
  const content = await page.content();

  await browser.close();
  return content;
}

Best Practices and Considerations

Legal and Ethical Guidelines

  1. Respect robots.txt: Always check the website's robots.txt file
  2. Rate limiting: Implement delays between requests to avoid overwhelming servers
  3. Terms of service: Review and comply with website terms of service
  4. Data privacy: Handle scraped data according to applicable privacy laws

Performance Optimization

// Implement connection pooling for better performance
const puppeteer = require('puppeteer');

class BrowserPool {
  constructor(size = 3) {
    this.browsers = [];
    this.size = size;
    this.currentIndex = 0;
  }

  async initialize() {
    for (let i = 0; i < this.size; i++) {
      const browser = await puppeteer.launch({
        args: ['--no-sandbox', '--disable-setuid-sandbox']
      });
      this.browsers.push(browser);
    }
  }

  getBrowser() {
    const browser = this.browsers[this.currentIndex];
    this.currentIndex = (this.currentIndex + 1) % this.size;
    return browser;
  }

  async close() {
    await Promise.all(this.browsers.map(browser => browser.close()));
  }
}

Error Handling and Retry Logic

async function robustGeoScraping(url, options = {}) {
  const {
    maxRetries = 3,
    retryDelay = 2000,
    proxies = [],
    fallbackToDirectAccess = true
  } = options;

  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      // Try with proxy first
      if (proxies.length > 0) {
        const proxy = proxies[attempt % proxies.length];
        return await scrapeWithProxy(url, proxy);
      }

      // Fallback to direct access
      if (fallbackToDirectAccess) {
        return await scrapeDirectly(url);
      }

    } catch (error) {
      console.log(`Attempt ${attempt + 1} failed:`, error.message);

      if (attempt < maxRetries - 1) {
        await new Promise(resolve => setTimeout(resolve, retryDelay));
      }
    }
  }

  throw new Error(`Failed to scrape ${url} after ${maxRetries} attempts`);
}

Monitoring and Debugging

When dealing with geolocation restrictions, it's important to monitor network requests in Puppeteer to understand how the website detects and handles different geographic locations.

async function debugGeolocationHandling(url) {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();

  // Monitor all requests
  page.on('request', request => {
    console.log('Request:', request.url());
    console.log('Headers:', request.headers());
  });

  page.on('response', response => {
    console.log('Response:', response.url(), response.status());
  });

  await page.goto(url);

  // Check if geolocation is being used
  const geolocationSupported = await page.evaluate(() => {
    return 'geolocation' in navigator;
  });

  console.log('Geolocation supported:', geolocationSupported);
  await browser.close();
}

Browser Configuration for Different Regions

Different regions may require specific browser configurations to appear authentic.

const regionalConfigs = {
  US: {
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    acceptLanguage: 'en-US,en;q=0.9',
    timezone: 'America/New_York',
    locale: 'en-US'
  },
  GB: {
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    acceptLanguage: 'en-GB,en;q=0.9',
    timezone: 'Europe/London', 
    locale: 'en-GB'
  },
  DE: {
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    acceptLanguage: 'de-DE,de;q=0.9,en;q=0.8',
    timezone: 'Europe/Berlin',
    locale: 'de-DE'
  }
};

async function configureRegionalBrowser(region = 'US') {
  const config = regionalConfigs[region];
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  // Set user agent
  await page.setUserAgent(config.userAgent);

  // Set language
  await page.setExtraHTTPHeaders({
    'Accept-Language': config.acceptLanguage
  });

  // Set timezone
  await page.emulateTimezone(config.timezone);

  // Override locale
  await page.evaluateOnNewDocument((locale) => {
    Object.defineProperty(navigator, 'language', {
      get: function() { return locale; }
    });
    Object.defineProperty(navigator, 'languages', {
      get: function() { return [locale]; }
    });
  }, config.locale);

  return page;
}

Handling Common Geolocation Challenges

Bypassing IP Detection

# Using curl with different IP sources
curl -H "CF-Connecting-IP: 203.0.113.1" \
     -H "X-Forwarded-For: 203.0.113.1" \
     -H "X-Real-IP: 203.0.113.1" \
     https://target-website.com

Testing Geographic Restrictions

async function testGeoRestrictions(url) {
  const testLocations = [
    { country: 'US', lat: 40.7128, lng: -74.0060 },
    { country: 'GB', lat: 51.5074, lng: -0.1278 },
    { country: 'DE', lat: 52.5200, lng: 13.4050 },
    { country: 'JP', lat: 35.6762, lng: 139.6503 }
  ];

  const results = [];

  for (const location of testLocations) {
    try {
      const browser = await puppeteer.launch();
      const page = await browser.newPage();

      await page.setGeolocation({
        latitude: location.lat,
        longitude: location.lng
      });

      const response = await page.goto(url);
      const accessible = response.status() === 200;

      results.push({
        country: location.country,
        accessible: accessible,
        status: response.status()
      });

      await browser.close();
    } catch (error) {
      results.push({
        country: location.country,
        accessible: false,
        error: error.message
      });
    }
  }

  return results;
}

Conclusion

Handling geolocation restrictions requires a combination of techniques including proxy servers, geolocation API manipulation, and proper header configuration. The choice of method depends on how the target website implements its restrictions. For reliable, large-scale scraping operations, consider using specialized services like WebScraping.AI that handle geolocation complexities automatically.

Remember to always respect website terms of service and implement appropriate rate limiting when handling authentication in Puppeteer or dealing with session-based restrictions that often accompany geolocation controls. When working with complex navigation scenarios, understanding how to navigate to different pages using Puppeteer becomes essential for maintaining proper session state across geographic boundaries.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon