How Do I Scrape Multilingual Websites Using LLMs?
Scraping multilingual websites presents unique challenges that traditional web scraping methods struggle to address—different character encodings, varied HTML structures across language versions, and the complexity of extracting and translating content simultaneously. Large Language Models (LLMs) revolutionize this process by understanding content semantically across languages, enabling intelligent extraction without language-specific parsing rules.
LLMs can process websites in any language, extract structured data, translate content on-the-fly, and even understand context-dependent information that varies by locale. This makes them particularly powerful for international e-commerce monitoring, global news aggregation, and multi-regional competitive analysis.
Understanding Multilingual Scraping Challenges
Traditional Scraping Limitations
Traditional web scraping relies on selectors that often break across language versions of websites. Even when selectors remain consistent, handling different languages requires extensive preprocessing, translation APIs, and language-specific parsing logic.
from bs4 import BeautifulSoup
import requests
# Traditional approach - needs separate logic for each language
def scrape_traditional(url, language):
response = requests.get(url)
response.encoding = 'utf-8' # Must handle encoding manually
soup = BeautifulSoup(response.content, 'html.parser')
# Different selectors for different language versions
if language == 'en':
title = soup.select_one('.product-title-en').text
elif language == 'ja':
title = soup.select_one('.product-title-ja').text
elif language == 'zh':
title = soup.select_one('.product-title-zh').text
# Needs external translation API
return title # Still in original language
LLM-Based Solution
LLMs handle multilingual content natively, understanding context across languages without language-specific rules.
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
# Works for any language - extracts and understands contextually
result = client.get_fields(
url='https://example.jp/product/12345',
fields={
'product_name': 'The product name',
'price': 'Current price with currency',
'specifications': 'Technical specifications list',
'availability': 'Stock availability status'
}
)
# Data extracted accurately regardless of source language
print(result)
Extracting Data from Multilingual Websites
Basic Multilingual Data Extraction
LLMs can extract structured data from pages in any language without requiring language detection or translation preprocessing.
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def scrape_multilingual_product(url):
"""Extract product data regardless of page language"""
data = client.get_fields(
url=url,
fields={
'product_name': 'Full product name',
'brand': 'Brand or manufacturer name',
'price': 'Current selling price with currency symbol',
'original_price': 'Original price before discount if available',
'description': 'Product description',
'features': 'List of key product features',
'specifications': 'Technical specifications as key-value pairs',
'availability': 'In stock status',
'rating': 'Average customer rating',
'review_count': 'Number of customer reviews',
'shipping_info': 'Shipping time and cost details'
},
js=True # Enable JavaScript rendering for dynamic content
)
return data
# Works on Japanese site
jp_data = scrape_multilingual_product('https://example.jp/製品/12345')
# Works on German site with same code
de_data = scrape_multilingual_product('https://example.de/produkt/12345')
# Works on Arabic site (right-to-left) with same code
ar_data = scrape_multilingual_product('https://example.ae/منتج/12345')
print(f"Japanese product: {jp_data['product_name']}")
print(f"German product: {de_data['product_name']}")
print(f"Arabic product: {ar_data['product_name']}")
Using Natural Language Questions Across Languages
The question-based extraction approach is particularly powerful for multilingual sites because you can ask questions in your language about content in any language.
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
# Ask questions in English about pages in different languages
def extract_with_questions(url):
questions = {
'What is the product warranty period?',
'What payment methods are accepted?',
'What is the return policy?',
'Is international shipping available?',
'What are the delivery timeframes?'
}
results = {}
for question in questions:
answer = client.get_question(
url=url,
question=question
)
results[question] = answer
return results
# Query Chinese website in English
chinese_site_info = extract_with_questions('https://example.cn/产品')
# Query French website in English
french_site_info = extract_with_questions('https://example.fr/produit')
# Query Russian website in English
russian_site_info = extract_with_questions('https://example.ru/товар')
for question, answer in chinese_site_info.items():
print(f"Q: {question}")
print(f"A: {answer}\n")
JavaScript Implementation for Multilingual Scraping
const WebScrapingAI = require('webscraping.ai');
const client = new WebScrapingAI('YOUR_API_KEY');
async function scrapeMultilingualNews(urls) {
const articles = [];
for (const url of urls) {
try {
const data = await client.getFields(url, {
'headline': 'Article headline or title',
'author': 'Article author name',
'publish_date': 'Publication date',
'category': 'Article category or section',
'summary': 'Brief article summary (2-3 sentences)',
'main_content': 'Full article text content',
'tags': 'Article tags or keywords',
'source_language': 'What language is this article written in?'
}, {
js: true,
device: 'desktop'
});
articles.push({
url: url,
...data,
scraped_at: new Date().toISOString()
});
} catch (error) {
console.error(`Error scraping ${url}:`, error.message);
}
}
return articles;
}
// Scrape news from multiple countries and languages
const newsUrls = [
'https://example.jp/ニュース/article1', // Japanese
'https://example.kr/뉴스/article2', // Korean
'https://example.es/noticias/article3', // Spanish
'https://example.de/nachrichten/article4', // German
'https://example.in/समाचार/article5' // Hindi
];
scrapeMultilingualNews(newsUrls)
.then(articles => {
articles.forEach(article => {
console.log(`Title: ${article.headline}`);
console.log(`Language: ${article.source_language}`);
console.log(`Summary: ${article.summary}\n`);
});
});
Handling Region-Specific Website Versions
Many websites maintain separate versions for different regions, each with its own domain or subdirectory. LLMs excel at extracting comparable data across these variations.
Scraping Regional E-commerce Sites
from webscraping_ai import WebScrapingAI
import json
client = WebScrapingAI(api_key='YOUR_API_KEY')
def scrape_regional_pricing(product_id, regions):
"""
Compare product information across regional websites
"""
regional_data = {}
# Regional domain mappings
domains = {
'us': 'example.com',
'uk': 'example.co.uk',
'de': 'example.de',
'jp': 'example.jp',
'fr': 'example.fr',
'es': 'example.es',
'cn': 'example.cn',
'br': 'example.com.br'
}
fields_template = {
'product_name': 'Product name in local language',
'price': 'Current price with local currency',
'price_numeric': 'Numeric price value only',
'currency': 'Currency code (USD, EUR, GBP, etc.)',
'availability': 'Stock status',
'shipping_cost': 'Shipping cost to local addresses',
'shipping_time': 'Estimated delivery time',
'local_warranty': 'Warranty period and terms',
'special_offers': 'Any active promotions or discounts'
}
for region in regions:
if region not in domains:
continue
url = f"https://{domains[region]}/product/{product_id}"
try:
# Set country for proper proxy routing
data = client.get_fields(
url=url,
fields=fields_template,
country=region,
js=True
)
data['region'] = region
data['url'] = url
regional_data[region] = data
except Exception as e:
print(f"Error scraping {region}: {e}")
regional_data[region] = {'error': str(e)}
return regional_data
# Compare product across regions
regions = ['us', 'uk', 'de', 'jp', 'fr']
pricing_data = scrape_regional_pricing('ABC123', regions)
# Analyze regional differences
for region, data in pricing_data.items():
if 'error' not in data:
print(f"\n{region.upper()}:")
print(f" Name: {data['product_name']}")
print(f" Price: {data['price']}")
print(f" Availability: {data['availability']}")
print(f" Shipping: {data['shipping_time']}")
Handling Multiple Language Versions with Navigation
When scraping sites with language selectors, you can combine browser automation for navigation with LLM-based extraction.
const puppeteer = require('puppeteer');
const WebScrapingAI = require('webscraping.ai');
const aiClient = new WebScrapingAI('YOUR_API_KEY');
async function scrapeAllLanguageVersions(baseUrl, languageSelectors) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
const allVersions = [];
for (const [langCode, selector] of Object.entries(languageSelectors)) {
try {
// Navigate to base page
await page.goto(baseUrl, { waitUntil: 'networkidle2' });
// Click language selector
await page.click(selector);
await page.waitForTimeout(2000); // Wait for language switch
const currentUrl = page.url();
// Use AI to extract data in current language
const data = await aiClient.getFields(currentUrl, {
'page_title': 'Main page title',
'main_content': 'Primary content text',
'menu_items': 'List of navigation menu items',
'contact_info': 'Contact information including email and phone',
'language_code': 'What language is this page in?'
});
allVersions.push({
language: langCode,
url: currentUrl,
...data
});
} catch (error) {
console.error(`Error with language ${langCode}:`, error.message);
}
}
await browser.close();
return allVersions;
}
// Language selector mappings
const languageSelectors = {
'en': '#lang-selector-en',
'es': '#lang-selector-es',
'fr': '#lang-selector-fr',
'de': '#lang-selector-de',
'zh': '#lang-selector-zh'
};
scrapeAllLanguageVersions('https://example.com', languageSelectors)
.then(versions => {
console.log(JSON.stringify(versions, null, 2));
});
Extracting and Translating Content Simultaneously
One of the most powerful features of LLM-based scraping is the ability to extract and translate content in a single operation.
Direct Translation During Extraction
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def extract_and_translate(url, target_language='English'):
"""
Extract content and request it in a specific language
"""
# Add language instruction to field descriptions
fields = {
'product_name': f'Product name translated to {target_language}',
'description': f'Product description translated to {target_language}',
'features': f'List of product features translated to {target_language}',
'specifications': 'Technical specifications with labels translated to {target_language}',
'original_language': 'What language was the original page in?'
}
result = client.get_fields(url=url, fields=fields, js=True)
return result
# Scrape Japanese site, get results in English
jp_product_en = extract_and_translate(
'https://example.jp/製品/laptop-xyz',
target_language='English'
)
print(f"Original language: {jp_product_en['original_language']}")
print(f"Product name (EN): {jp_product_en['product_name']}")
print(f"Description (EN): {jp_product_en['description']}")
# Scrape same Japanese site, get results in Spanish
jp_product_es = extract_and_translate(
'https://example.jp/製品/laptop-xyz',
target_language='Spanish'
)
print(f"\nProduct name (ES): {jp_product_es['product_name']}")
print(f"Description (ES): {jp_product_es['description']}")
Question-Based Translation
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def get_translated_answers(url, questions, output_language='English'):
"""
Ask questions and get answers in specified language
"""
answers = {}
for question in questions:
# Add language instruction to question
localized_question = f"{question} (Answer in {output_language})"
answer = client.get_question(
url=url,
question=localized_question
)
answers[question] = answer
return answers
# Query German website, get English answers
questions = [
'What are the main product features?',
'What is the price and what does it include?',
'What are the shipping options?',
'What is the warranty coverage?'
]
german_site_answers = get_translated_answers(
'https://example.de/produkt/smartphone',
questions,
output_language='English'
)
for q, a in german_site_answers.items():
print(f"Q: {q}")
print(f"A: {a}\n")
# Same questions, Spanish answers from Chinese website
chinese_site_answers = get_translated_answers(
'https://example.cn/产品/智能手机',
questions,
output_language='Spanish'
)
Handling Character Encodings and Special Characters
LLMs naturally handle various character encodings, including complex scripts like Chinese, Japanese, Korean, Arabic, Hebrew, and Cyrillic.
Working with Non-Latin Scripts
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def scrape_complex_scripts(urls):
"""
Handle websites with various character systems
"""
results = []
for url in urls:
data = client.get_fields(
url=url,
fields={
'title': 'Page title in original script',
'title_romanized': 'Page title romanized/transliterated to Latin alphabet',
'content': 'Main content in original script',
'language': 'Language name in English',
'script': 'Writing system (Latin, Cyrillic, Arabic, Chinese, etc.)',
'reading_direction': 'Is text left-to-right or right-to-left?'
}
)
data['url'] = url
results.append(data)
return results
# Test with various scripts
test_urls = [
'https://example.jp/記事', # Japanese (Hiragana, Katakana, Kanji)
'https://example.kr/기사', # Korean (Hangul)
'https://example.cn/文章', # Chinese (Simplified)
'https://example.tw/文章', # Chinese (Traditional)
'https://example.ae/مقالة', # Arabic
'https://example.il/מאמר', # Hebrew
'https://example.ru/статья', # Russian (Cyrillic)
'https://example.gr/άρθρο', # Greek
'https://example.th/บทความ' # Thai
]
script_results = scrape_complex_scripts(test_urls)
for result in script_results:
print(f"\nLanguage: {result['language']}")
print(f"Script: {result['script']}")
print(f"Direction: {result['reading_direction']}")
print(f"Original: {result['title']}")
print(f"Romanized: {result['title_romanized']}")
Best Practices for Multilingual Scraping with LLMs
1. Be Explicit About Language Requirements
When extracting data, clearly specify whether you want content in the original language or translated.
# Unclear - may return mixed languages
fields = {'description': 'Product description'}
# Clear - specifies desired language
fields = {
'description_original': 'Product description in original language',
'description_english': 'Product description translated to English',
'source_language': 'What language is the original description in?'
}
2. Handle Regional Number and Date Formats
Different regions use different formats for numbers, currencies, and dates. LLMs can normalize these.
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def extract_normalized_data(url):
"""
Extract data with normalized formats
"""
return client.get_fields(
url=url,
fields={
'price_local': 'Price as displayed on page',
'price_usd': 'Price converted to USD (provide conversion if needed)',
'date_local': 'Date as shown on page',
'date_iso': 'Date in ISO 8601 format (YYYY-MM-DD)',
'number_format': 'Does this region use period or comma for decimals?',
'currency_code': 'Three-letter currency code (USD, EUR, JPY, etc.)'
}
)
# European site (comma for decimal)
eu_data = extract_normalized_data('https://example.de/produkt')
print(f"Local price: {eu_data['price_local']}") # "1.299,99 €"
print(f"USD price: {eu_data['price_usd']}") # "$1,450.00"
print(f"Local date: {eu_data['date_local']}") # "15.03.2024"
print(f"ISO date: {eu_data['date_iso']}") # "2024-03-15"
3. Implement Error Handling for Language Detection
from webscraping_ai import WebScrapingAI, WebScrapingAIError
import logging
client = WebScrapingAI(api_key='YOUR_API_KEY')
logging.basicConfig(level=logging.INFO)
def safe_multilingual_scrape(url, fields, retry_count=3):
"""
Robust scraping with error handling for multilingual sites
"""
# Add language detection field
enhanced_fields = {
**fields,
'_detected_language': 'What language is this page primarily in?',
'_has_mixed_languages': 'Does this page contain multiple languages?'
}
for attempt in range(retry_count):
try:
result = client.get_fields(
url=url,
fields=enhanced_fields,
js=True
)
logging.info(f"Successfully scraped {url}")
logging.info(f"Detected language: {result.get('_detected_language')}")
return result
except WebScrapingAIError as e:
logging.error(f"Attempt {attempt + 1} failed for {url}: {e}")
if attempt == retry_count - 1:
raise
# Wait before retry
import time
time.sleep(2 ** attempt)
return None
# Use with error handling
try:
data = safe_multilingual_scrape(
'https://example.cn/产品/123',
{
'product_name': 'Product name',
'price': 'Price with currency',
'description': 'Product description'
}
)
print(f"Extracted data: {data}")
except Exception as e:
print(f"Failed to scrape: {e}")
4. Batch Process Multiple Languages Efficiently
const WebScrapingAI = require('webscraping.ai');
const client = new WebScrapingAI('YOUR_API_KEY');
async function batchMultilingualScrape(urlsByLanguage) {
const promises = [];
for (const [language, urls] of Object.entries(urlsByLanguage)) {
for (const url of urls) {
// Create promise for each scraping task
const promise = client.getFields(url, {
'title': 'Page title',
'content': 'Main content',
'metadata': 'Meta description',
'language_detected': 'Detected page language'
}, {
js: true
}).then(data => ({
language: language,
url: url,
...data,
scraped_at: new Date().toISOString()
})).catch(error => ({
language: language,
url: url,
error: error.message
}));
promises.push(promise);
}
}
// Execute all requests concurrently
const results = await Promise.all(promises);
// Group results by language
const grouped = {};
for (const result of results) {
if (!grouped[result.language]) {
grouped[result.language] = [];
}
grouped[result.language].push(result);
}
return grouped;
}
// Batch scrape URLs from different languages
const urlsByLanguage = {
'japanese': [
'https://example.jp/page1',
'https://example.jp/page2'
],
'german': [
'https://example.de/seite1',
'https://example.de/seite2'
],
'spanish': [
'https://example.es/pagina1',
'https://example.es/pagina2'
]
};
batchMultilingualScrape(urlsByLanguage)
.then(results => {
for (const [lang, data] of Object.entries(results)) {
console.log(`\n${lang.toUpperCase()} Results:`);
data.forEach(item => {
if (item.error) {
console.log(` ❌ ${item.url}: ${item.error}`);
} else {
console.log(` ✅ ${item.url}: ${item.title}`);
}
});
}
});
Real-World Use Cases
International Price Comparison
from webscraping_ai import WebScrapingAI
from datetime import datetime
import json
client = WebScrapingAI(api_key='YOUR_API_KEY')
def compare_international_prices(product_urls):
"""
Compare product prices across different countries
"""
comparison = []
for url in product_urls:
data = client.get_fields(
url=url,
fields={
'product_name': 'Product name in English',
'local_price': 'Price as displayed (with currency symbol)',
'numeric_price': 'Numeric price value only',
'currency': 'Currency code',
'country': 'What country is this website for?',
'in_stock': 'Is the product in stock?',
'shipping_to_us': 'Can this ship to United States? If yes, what is the cost?',
'total_cost_usd': 'Estimated total cost in USD including shipping (provide estimation)',
'taxes_included': 'Are taxes included in the displayed price?'
}
)
data['url'] = url
data['checked_at'] = datetime.now().isoformat()
comparison.append(data)
# Sort by total USD cost
comparison.sort(key=lambda x: float(x.get('numeric_price', float('inf'))))
return comparison
# Compare prices across regions
urls = [
'https://example.com/product/camera', # USA
'https://example.co.uk/product/camera', # UK
'https://example.de/produkt/kamera', # Germany
'https://example.jp/製品/カメラ', # Japan
'https://example.com.au/product/camera' # Australia
]
price_comparison = compare_international_prices(urls)
print("Price Comparison Results:\n")
for item in price_comparison:
print(f"{item['country']}:")
print(f" Price: {item['local_price']} ({item['currency']})")
print(f" Est. USD: ${item['total_cost_usd']}")
print(f" Stock: {item['in_stock']}")
print(f" URL: {item['url']}\n")
Multilingual Content Aggregation
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def aggregate_global_news(news_sources):
"""
Aggregate news from international sources in multiple languages
"""
articles = []
for source in news_sources:
try:
data = client.get_fields(
url=source['url'],
fields={
'headline_original': 'Article headline in original language',
'headline_english': 'Article headline translated to English',
'summary_english': 'Brief article summary in English (2-3 sentences)',
'author': 'Article author',
'publish_date': 'Publication date in YYYY-MM-DD format',
'category': 'Article category or topic',
'sentiment': 'Is the article tone positive, negative, or neutral?',
'key_entities': 'Main people, organizations, or locations mentioned',
'source_language': 'Source language of the article'
}
)
articles.append({
'source': source['name'],
'country': source['country'],
**data
})
except Exception as e:
print(f"Error scraping {source['name']}: {e}")
return articles
# International news sources
sources = [
{'name': 'Le Monde', 'url': 'https://lemonde.fr/article/xyz', 'country': 'France'},
{'name': 'Der Spiegel', 'url': 'https://spiegel.de/artikel/xyz', 'country': 'Germany'},
{'name': 'Asahi Shimbun', 'url': 'https://asahi.com/記事/xyz', 'country': 'Japan'},
{'name': 'El País', 'url': 'https://elpais.com/articulo/xyz', 'country': 'Spain'}
]
global_news = aggregate_global_news(sources)
# Display aggregated news
for article in global_news:
print(f"\n{'='*60}")
print(f"Source: {article['source']} ({article['country']})")
print(f"Language: {article['source_language']}")
print(f"Original: {article['headline_original']}")
print(f"English: {article['headline_english']}")
print(f"Summary: {article['summary_english']}")
print(f"Sentiment: {article['sentiment']}")
Monitoring International Competitor Websites
When monitoring network requests from competitor sites, LLMs can help extract and compare features across language versions.
from webscraping_ai import WebScrapingAI
client = WebScrapingAI(api_key='YOUR_API_KEY')
def monitor_competitor_features(competitor_sites):
"""
Track competitor features across different regional markets
"""
all_features = {}
for site in competitor_sites:
features = client.get_fields(
url=site['url'],
fields={
'product_features': 'List of all product features and capabilities',
'pricing_tiers': 'Available pricing plans with features for each tier',
'unique_selling_points': 'Main competitive advantages highlighted',
'target_audience': 'Who is this product targeted at?',
'new_announcements': 'Any recent product updates or announcements',
'region_specific_features': 'Features that seem specific to this region',
'page_language': 'Page language'
}
)
all_features[site['region']] = {
'competitor': site['competitor'],
'url': site['url'],
**features
}
return all_features
# Monitor competitors across regions
competitors = [
{
'competitor': 'Competitor A',
'region': 'USA',
'url': 'https://competitora.com/product'
},
{
'competitor': 'Competitor A',
'region': 'Germany',
'url': 'https://competitora.de/produkt'
},
{
'competitor': 'Competitor A',
'region': 'Japan',
'url': 'https://competitora.jp/製品'
}
]
feature_comparison = monitor_competitor_features(competitors)
# Analyze regional differences
for region, data in feature_comparison.items():
print(f"\n{region}:")
print(f" Language: {data['page_language']}")
print(f" Features: {data['product_features']}")
print(f" Region-specific: {data['region_specific_features']}")
Conclusion
LLMs transform multilingual web scraping from a complex, maintenance-intensive challenge into a straightforward, scalable process. By understanding content semantically rather than relying on brittle selectors and language-specific rules, LLM-based scraping enables developers to extract, translate, and structure data from websites in any language using a single, unified approach.
The ability to ask questions in one language about content in another, automatically normalize regional formats, and handle complex character encodings makes LLMs particularly valuable for international business intelligence, global e-commerce monitoring, and multilingual content aggregation. As businesses expand globally, LLM-powered scraping tools provide the flexibility and intelligence needed to gather insights from the entire international web, not just English-language sources.
Whether you're monitoring competitor prices across regions, aggregating news from international sources, or building a global product catalog, LLMs provide a powerful, maintainable solution for multilingual web scraping that adapts to change and scales across languages effortlessly.