When working with the Python requests
library, charset handling occurs at two levels: specifying what charset you want to receive (for requests) and setting the charset for data you're sending (for responses). The charset is typically handled automatically, but you can override it when needed.
Understanding Charset in HTTP Requests
The charset determines how text is encoded in HTTP requests and responses. By default, requests
automatically detects and handles charset based on server responses, but manual control is sometimes necessary for:
- Servers that don't specify charset correctly
- Sending data in specific encodings
- Working with international content
- Handling legacy systems with non-UTF-8 encodings
1. Requesting Specific Charset (GET Requests)
Use the Accept-Charset
header to tell the server which charsets your client can handle:
import requests
# Request UTF-8 encoding from server
url = 'https://example.com/'
headers = {
'Accept-Charset': 'utf-8'
}
response = requests.get(url, headers=headers)
print(f"Response encoding: {response.encoding}")
print(response.text)
You can also specify multiple acceptable charsets:
import requests
headers = {
'Accept-Charset': 'utf-8, iso-8859-1;q=0.8, *;q=0.1'
}
response = requests.get('https://example.com/', headers=headers)
2. Setting Charset for POST Data
Form Data (application/x-www-form-urlencoded)
When sending form data, specify the charset in the Content-Type
header:
import requests
url = 'https://httpbin.org/post'
headers = {
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8'
}
# Using string data
data = 'name=José&city=São Paulo'
response = requests.post(url, headers=headers, data=data.encode('utf-8'))
# Or using dictionary (requests handles encoding automatically)
form_data = {'name': 'José', 'city': 'São Paulo'}
response = requests.post(url, data=form_data) # charset handled automatically
JSON Data
For JSON payloads, specify UTF-8 charset explicitly:
import requests
import json
url = 'https://httpbin.org/post'
headers = {
'Content-Type': 'application/json; charset=utf-8'
}
data = {
'name': 'José María',
'description': 'Специальные символы',
'emoji': '🌟'
}
# Method 1: Manual encoding
json_data = json.dumps(data, ensure_ascii=False).encode('utf-8')
response = requests.post(url, headers=headers, data=json_data)
# Method 2: Let requests handle it (recommended)
response = requests.post(url, json=data) # Automatically sets charset=utf-8
Plain Text Data
For plain text content:
import requests
url = 'https://httpbin.org/post'
headers = {
'Content-Type': 'text/plain; charset=utf-8'
}
text_data = "Hello, 世界! Привет мир!"
response = requests.post(url, headers=headers, data=text_data.encode('utf-8'))
3. Handling Response Charset
Automatic Detection
requests
automatically detects charset from the Content-Type
header:
import requests
response = requests.get('https://example.com/')
print(f"Detected encoding: {response.encoding}")
print(f"Apparent encoding: {response.apparent_encoding}") # chardet-based detection
Manual Override
Override the detected encoding when servers provide incorrect charset information:
import requests
response = requests.get('https://example.com/')
# Check what was detected
print(f"Original encoding: {response.encoding}")
# Override if needed
response.encoding = 'utf-8'
content = response.text
# Or use apparent encoding (usually more accurate)
response.encoding = response.apparent_encoding
content = response.text
Binary Content
For binary data or when you want full control:
import requests
response = requests.get('https://example.com/')
# Get raw bytes
raw_bytes = response.content
# Decode manually
try:
text = raw_bytes.decode('utf-8')
except UnicodeDecodeError:
# Fallback to apparent encoding
text = raw_bytes.decode(response.apparent_encoding)
4. Common Use Cases
Scraping Non-English Websites
import requests
# For Chinese websites
response = requests.get('https://example.cn/')
if response.encoding in ['ISO-8859-1', 'ascii']:
# Server didn't specify encoding properly
response.encoding = response.apparent_encoding
chinese_content = response.text
Sending International Form Data
import requests
url = 'https://example.com/submit'
form_data = {
'name': 'François',
'city': 'Москва',
'comment': '这是中文评论'
}
# requests automatically handles Unicode in form data
response = requests.post(url, data=form_data)
Working with CSV Data
import requests
import csv
from io import StringIO
response = requests.get('https://example.com/data.csv')
response.encoding = 'utf-8' # Ensure proper encoding
# Parse CSV with correct encoding
csv_data = csv.reader(StringIO(response.text))
for row in csv_data:
print(row)
5. Best Practices
- Let requests handle it: Use
json=
parameter for JSON data instead of manual encoding - Check apparent_encoding: Use
response.apparent_encoding
for better charset detection - Handle errors: Always wrap encoding operations in try-catch blocks
- Test with international content: Verify your code works with non-ASCII characters
- Use UTF-8 by default: UTF-8 is the most widely supported encoding
import requests
def safe_request(url, **kwargs):
"""Make a request with proper charset handling"""
response = requests.get(url, **kwargs)
# Use apparent encoding if detection seems wrong
if response.encoding in ['ISO-8859-1', 'ascii'] and response.apparent_encoding:
response.encoding = response.apparent_encoding
return response
# Usage
response = safe_request('https://international-site.com/')
print(response.text)
By understanding these charset handling techniques, you can ensure your web scraping and API integration code works correctly with international content and various server configurations.