How can I scrape websites that use GraphQL APIs with Rust?
Scraping websites that use GraphQL APIs with Rust requires a different approach than traditional HTML scraping. GraphQL APIs provide structured data through queries, making them more efficient for data extraction when you know the exact data structure you need. Rust's strong type system and excellent HTTP libraries make it an ideal choice for GraphQL API scraping.
Understanding GraphQL APIs
GraphQL is a query language and runtime for APIs that allows clients to request exactly the data they need. Unlike REST APIs with multiple endpoints, GraphQL typically uses a single endpoint with POST requests containing query strings that specify the required data structure.
Essential Rust Dependencies
To scrape GraphQL APIs effectively in Rust, you'll need these key dependencies in your Cargo.toml
:
[dependencies]
reqwest = { version = "0.11", features = ["json"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
graphql_client = "0.13"
anyhow = "1.0"
Basic GraphQL Query with Reqwest
Here's a fundamental example of making GraphQL requests using the reqwest
library:
use reqwest::Client;
use serde::{Deserialize, Serialize};
use serde_json::json;
use std::collections::HashMap;
#[derive(Serialize, Deserialize, Debug)]
struct GraphQLResponse<T> {
data: Option<T>,
errors: Option<Vec<GraphQLError>>,
}
#[derive(Serialize, Deserialize, Debug)]
struct GraphQLError {
message: String,
locations: Option<Vec<HashMap<String, i32>>>,
path: Option<Vec<String>>,
}
#[derive(Serialize)]
struct GraphQLRequest {
query: String,
variables: Option<serde_json::Value>,
}
async fn execute_graphql_query<T>(
client: &Client,
url: &str,
query: &str,
variables: Option<serde_json::Value>,
) -> Result<T, Box<dyn std::error::Error>>
where
T: for<'de> Deserialize<'de>,
{
let request = GraphQLRequest {
query: query.to_string(),
variables,
};
let response = client
.post(url)
.header("Content-Type", "application/json")
.json(&request)
.send()
.await?;
let graphql_response: GraphQLResponse<T> = response.json().await?;
if let Some(errors) = graphql_response.errors {
return Err(format!("GraphQL errors: {:?}", errors).into());
}
graphql_response.data.ok_or("No data in response".into())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let url = "https://api.example.com/graphql";
let query = r#"
query {
users {
id
name
email
}
}
"#;
#[derive(Deserialize, Debug)]
struct User {
id: String,
name: String,
email: String,
}
#[derive(Deserialize, Debug)]
struct UsersResponse {
users: Vec<User>,
}
let result: UsersResponse = execute_graphql_query(&client, url, query, None).await?;
println!("Users: {:?}", result.users);
Ok(())
}
Using GraphQL Client for Type Safety
The graphql_client
crate provides compile-time type safety by generating Rust types from GraphQL schemas:
use graphql_client::{GraphQLQuery, Response};
use reqwest::Client;
#[derive(GraphQLQuery)]
#[graphql(
schema_path = "schema.graphql",
query_path = "queries/users.graphql",
response_derives = "Debug"
)]
struct GetUsers;
async fn fetch_users_with_client() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let variables = get_users::Variables {};
let request_body = GetUsers::build_query(variables);
let response = client
.post("https://api.example.com/graphql")
.header("User-Agent", "rust-graphql-scraper/1.0")
.json(&request_body)
.send()
.await?;
let response_body: Response<get_users::ResponseData> = response.json().await?;
if let Some(errors) = response_body.errors {
eprintln!("GraphQL errors: {:?}", errors);
}
if let Some(data) = response_body.data {
println!("Fetched {} users", data.users.len());
for user in data.users {
println!("User: {} ({})", user.name, user.email);
}
}
Ok(())
}
Handling Authentication
Many GraphQL APIs require authentication. Here's how to handle different authentication methods:
Bearer Token Authentication
use reqwest::header::{HeaderMap, HeaderValue, AUTHORIZATION};
async fn authenticated_graphql_request() -> Result<(), Box<dyn std::error::Error>> {
let mut headers = HeaderMap::new();
headers.insert(
AUTHORIZATION,
HeaderValue::from_str("Bearer your_access_token_here")?,
);
let client = Client::builder()
.default_headers(headers)
.build()?;
let query = r#"
query {
me {
id
profile {
displayName
}
}
}
"#;
// Execute query with authenticated client
let response = execute_graphql_query::<serde_json::Value>(
&client,
"https://api.example.com/graphql",
query,
None,
).await?;
println!("User profile: {:?}", response);
Ok(())
}
API Key Authentication
async fn api_key_authentication() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
let url = "https://api.example.com/graphql";
let query = r#"
query($apiKey: String!) {
data(apiKey: $apiKey) {
items {
id
title
}
}
}
"#;
let variables = json!({
"apiKey": "your_api_key_here"
});
let result: serde_json::Value = execute_graphql_query(
&client,
url,
query,
Some(variables),
).await?;
println!("API response: {:?}", result);
Ok(())
}
Advanced GraphQL Scraping Techniques
Pagination Handling
Many GraphQL APIs implement cursor-based pagination. Here's how to handle it:
use serde_json::json;
#[derive(Deserialize, Debug)]
struct PageInfo {
has_next_page: bool,
end_cursor: Option<String>,
}
#[derive(Deserialize, Debug)]
struct PaginatedResponse {
items: Vec<serde_json::Value>,
page_info: PageInfo,
}
async fn scrape_all_pages(
client: &Client,
url: &str,
) -> Result<Vec<serde_json::Value>, Box<dyn std::error::Error>> {
let mut all_items = Vec::new();
let mut cursor: Option<String> = None;
loop {
let query = r#"
query($cursor: String) {
items(first: 100, after: $cursor) {
nodes {
id
title
createdAt
}
pageInfo {
hasNextPage
endCursor
}
}
}
"#;
let variables = if let Some(ref cursor_val) = cursor {
json!({ "cursor": cursor_val })
} else {
json!({})
};
#[derive(Deserialize, Debug)]
struct ItemsResponse {
items: PaginatedItems,
}
#[derive(Deserialize, Debug)]
struct PaginatedItems {
nodes: Vec<serde_json::Value>,
#[serde(rename = "pageInfo")]
page_info: PageInfo,
}
let response: ItemsResponse = execute_graphql_query(
client,
url,
query,
Some(variables),
).await?;
all_items.extend(response.items.nodes);
if !response.items.page_info.has_next_page {
break;
}
cursor = response.items.page_info.end_cursor;
// Add delay to respect rate limits
tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;
}
Ok(all_items)
}
Error Handling and Retry Logic
Implement robust error handling with exponential backoff:
use tokio::time::{sleep, Duration};
async fn execute_with_retry<T>(
client: &Client,
url: &str,
query: &str,
variables: Option<serde_json::Value>,
max_retries: u32,
) -> Result<T, Box<dyn std::error::Error>>
where
T: for<'de> Deserialize<'de>,
{
let mut attempt = 0;
loop {
match execute_graphql_query(client, url, query, variables.clone()).await {
Ok(result) => return Ok(result),
Err(e) => {
attempt += 1;
if attempt >= max_retries {
return Err(e);
}
let delay = Duration::from_millis(2_u64.pow(attempt) * 1000);
eprintln!("Request failed (attempt {}), retrying in {:?}: {}", attempt, delay, e);
sleep(delay).await;
}
}
}
}
Rate Limiting and Ethical Scraping
When scraping GraphQL APIs, it's crucial to implement rate limiting to avoid overwhelming the server:
use std::sync::Arc;
use tokio::sync::Semaphore;
use tokio::time::{interval, Duration};
struct RateLimitedScraper {
client: Client,
semaphore: Arc<Semaphore>,
}
impl RateLimitedScraper {
fn new(requests_per_second: usize) -> Self {
Self {
client: Client::new(),
semaphore: Arc::new(Semaphore::new(requests_per_second)),
}
}
async fn execute_query<T>(
&self,
url: &str,
query: &str,
variables: Option<serde_json::Value>,
) -> Result<T, Box<dyn std::error::Error>>
where
T: for<'de> Deserialize<'de>,
{
let _permit = self.semaphore.acquire().await?;
let result = execute_graphql_query(&self.client, url, query, variables).await?;
// Add small delay between requests
sleep(Duration::from_millis(100)).await;
Ok(result)
}
}
Integration with Browser Automation
For GraphQL APIs that require complex authentication or are embedded in web applications, you might need to combine Rust with browser automation tools. While Rust doesn't have direct browser automation libraries like Puppeteer, you can monitor network requests in Puppeteer to capture GraphQL requests and then replicate them in Rust.
Best Practices
- Schema Introspection: Use GraphQL's introspection capabilities to understand the API structure:
async fn introspect_schema(client: &Client, url: &str) -> Result<(), Box<dyn std::error::Error>> {
let introspection_query = r#"
query IntrospectionQuery {
__schema {
types {
name
kind
description
}
}
}
"#;
let result: serde_json::Value = execute_graphql_query(
client,
url,
introspection_query,
None,
).await?;
println!("Schema: {:#}", result);
Ok(())
}
Query Optimization: Request only the fields you need to minimize bandwidth and response time.
Caching: Implement response caching for frequently accessed data to reduce API calls.
Error Recovery: Handle partial failures gracefully and implement appropriate fallback mechanisms.
Conclusion
Scraping GraphQL APIs with Rust provides excellent performance and type safety. The combination of reqwest
for HTTP requests, serde
for JSON handling, and graphql_client
for type-safe queries creates a robust foundation for GraphQL scraping projects. Remember to always respect the API's terms of service, implement proper rate limiting, and handle errors gracefully to create reliable scraping applications.
For complex scenarios involving JavaScript-heavy applications, consider combining Rust's GraphQL capabilities with browser automation tools to handle AJAX requests using Puppeteer before implementing the GraphQL queries in Rust for optimal performance.