Table of contents

How can I scrape websites that use GraphQL APIs with Rust?

Scraping websites that use GraphQL APIs with Rust requires a different approach than traditional HTML scraping. GraphQL APIs provide structured data through queries, making them more efficient for data extraction when you know the exact data structure you need. Rust's strong type system and excellent HTTP libraries make it an ideal choice for GraphQL API scraping.

Understanding GraphQL APIs

GraphQL is a query language and runtime for APIs that allows clients to request exactly the data they need. Unlike REST APIs with multiple endpoints, GraphQL typically uses a single endpoint with POST requests containing query strings that specify the required data structure.

Essential Rust Dependencies

To scrape GraphQL APIs effectively in Rust, you'll need these key dependencies in your Cargo.toml:

[dependencies]
reqwest = { version = "0.11", features = ["json"] }
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
tokio = { version = "1.0", features = ["full"] }
graphql_client = "0.13"
anyhow = "1.0"

Basic GraphQL Query with Reqwest

Here's a fundamental example of making GraphQL requests using the reqwest library:

use reqwest::Client;
use serde::{Deserialize, Serialize};
use serde_json::json;
use std::collections::HashMap;

#[derive(Serialize, Deserialize, Debug)]
struct GraphQLResponse<T> {
    data: Option<T>,
    errors: Option<Vec<GraphQLError>>,
}

#[derive(Serialize, Deserialize, Debug)]
struct GraphQLError {
    message: String,
    locations: Option<Vec<HashMap<String, i32>>>,
    path: Option<Vec<String>>,
}

#[derive(Serialize)]
struct GraphQLRequest {
    query: String,
    variables: Option<serde_json::Value>,
}

async fn execute_graphql_query<T>(
    client: &Client,
    url: &str,
    query: &str,
    variables: Option<serde_json::Value>,
) -> Result<T, Box<dyn std::error::Error>>
where
    T: for<'de> Deserialize<'de>,
{
    let request = GraphQLRequest {
        query: query.to_string(),
        variables,
    };

    let response = client
        .post(url)
        .header("Content-Type", "application/json")
        .json(&request)
        .send()
        .await?;

    let graphql_response: GraphQLResponse<T> = response.json().await?;

    if let Some(errors) = graphql_response.errors {
        return Err(format!("GraphQL errors: {:?}", errors).into());
    }

    graphql_response.data.ok_or("No data in response".into())
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let url = "https://api.example.com/graphql";

    let query = r#"
        query {
            users {
                id
                name
                email
            }
        }
    "#;

    #[derive(Deserialize, Debug)]
    struct User {
        id: String,
        name: String,
        email: String,
    }

    #[derive(Deserialize, Debug)]
    struct UsersResponse {
        users: Vec<User>,
    }

    let result: UsersResponse = execute_graphql_query(&client, url, query, None).await?;
    println!("Users: {:?}", result.users);

    Ok(())
}

Using GraphQL Client for Type Safety

The graphql_client crate provides compile-time type safety by generating Rust types from GraphQL schemas:

use graphql_client::{GraphQLQuery, Response};
use reqwest::Client;

#[derive(GraphQLQuery)]
#[graphql(
    schema_path = "schema.graphql",
    query_path = "queries/users.graphql",
    response_derives = "Debug"
)]
struct GetUsers;

async fn fetch_users_with_client() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();

    let variables = get_users::Variables {};
    let request_body = GetUsers::build_query(variables);

    let response = client
        .post("https://api.example.com/graphql")
        .header("User-Agent", "rust-graphql-scraper/1.0")
        .json(&request_body)
        .send()
        .await?;

    let response_body: Response<get_users::ResponseData> = response.json().await?;

    if let Some(errors) = response_body.errors {
        eprintln!("GraphQL errors: {:?}", errors);
    }

    if let Some(data) = response_body.data {
        println!("Fetched {} users", data.users.len());
        for user in data.users {
            println!("User: {} ({})", user.name, user.email);
        }
    }

    Ok(())
}

Handling Authentication

Many GraphQL APIs require authentication. Here's how to handle different authentication methods:

Bearer Token Authentication

use reqwest::header::{HeaderMap, HeaderValue, AUTHORIZATION};

async fn authenticated_graphql_request() -> Result<(), Box<dyn std::error::Error>> {
    let mut headers = HeaderMap::new();
    headers.insert(
        AUTHORIZATION,
        HeaderValue::from_str("Bearer your_access_token_here")?,
    );

    let client = Client::builder()
        .default_headers(headers)
        .build()?;

    let query = r#"
        query {
            me {
                id
                profile {
                    displayName
                }
            }
        }
    "#;

    // Execute query with authenticated client
    let response = execute_graphql_query::<serde_json::Value>(
        &client,
        "https://api.example.com/graphql",
        query,
        None,
    ).await?;

    println!("User profile: {:?}", response);
    Ok(())
}

API Key Authentication

async fn api_key_authentication() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::new();
    let url = "https://api.example.com/graphql";

    let query = r#"
        query($apiKey: String!) {
            data(apiKey: $apiKey) {
                items {
                    id
                    title
                }
            }
        }
    "#;

    let variables = json!({
        "apiKey": "your_api_key_here"
    });

    let result: serde_json::Value = execute_graphql_query(
        &client,
        url,
        query,
        Some(variables),
    ).await?;

    println!("API response: {:?}", result);
    Ok(())
}

Advanced GraphQL Scraping Techniques

Pagination Handling

Many GraphQL APIs implement cursor-based pagination. Here's how to handle it:

use serde_json::json;

#[derive(Deserialize, Debug)]
struct PageInfo {
    has_next_page: bool,
    end_cursor: Option<String>,
}

#[derive(Deserialize, Debug)]
struct PaginatedResponse {
    items: Vec<serde_json::Value>,
    page_info: PageInfo,
}

async fn scrape_all_pages(
    client: &Client,
    url: &str,
) -> Result<Vec<serde_json::Value>, Box<dyn std::error::Error>> {
    let mut all_items = Vec::new();
    let mut cursor: Option<String> = None;

    loop {
        let query = r#"
            query($cursor: String) {
                items(first: 100, after: $cursor) {
                    nodes {
                        id
                        title
                        createdAt
                    }
                    pageInfo {
                        hasNextPage
                        endCursor
                    }
                }
            }
        "#;

        let variables = if let Some(ref cursor_val) = cursor {
            json!({ "cursor": cursor_val })
        } else {
            json!({})
        };

        #[derive(Deserialize, Debug)]
        struct ItemsResponse {
            items: PaginatedItems,
        }

        #[derive(Deserialize, Debug)]
        struct PaginatedItems {
            nodes: Vec<serde_json::Value>,
            #[serde(rename = "pageInfo")]
            page_info: PageInfo,
        }

        let response: ItemsResponse = execute_graphql_query(
            client,
            url,
            query,
            Some(variables),
        ).await?;

        all_items.extend(response.items.nodes);

        if !response.items.page_info.has_next_page {
            break;
        }

        cursor = response.items.page_info.end_cursor;

        // Add delay to respect rate limits
        tokio::time::sleep(tokio::time::Duration::from_millis(100)).await;
    }

    Ok(all_items)
}

Error Handling and Retry Logic

Implement robust error handling with exponential backoff:

use tokio::time::{sleep, Duration};

async fn execute_with_retry<T>(
    client: &Client,
    url: &str,
    query: &str,
    variables: Option<serde_json::Value>,
    max_retries: u32,
) -> Result<T, Box<dyn std::error::Error>>
where
    T: for<'de> Deserialize<'de>,
{
    let mut attempt = 0;

    loop {
        match execute_graphql_query(client, url, query, variables.clone()).await {
            Ok(result) => return Ok(result),
            Err(e) => {
                attempt += 1;
                if attempt >= max_retries {
                    return Err(e);
                }

                let delay = Duration::from_millis(2_u64.pow(attempt) * 1000);
                eprintln!("Request failed (attempt {}), retrying in {:?}: {}", attempt, delay, e);
                sleep(delay).await;
            }
        }
    }
}

Rate Limiting and Ethical Scraping

When scraping GraphQL APIs, it's crucial to implement rate limiting to avoid overwhelming the server:

use std::sync::Arc;
use tokio::sync::Semaphore;
use tokio::time::{interval, Duration};

struct RateLimitedScraper {
    client: Client,
    semaphore: Arc<Semaphore>,
}

impl RateLimitedScraper {
    fn new(requests_per_second: usize) -> Self {
        Self {
            client: Client::new(),
            semaphore: Arc::new(Semaphore::new(requests_per_second)),
        }
    }

    async fn execute_query<T>(
        &self,
        url: &str,
        query: &str,
        variables: Option<serde_json::Value>,
    ) -> Result<T, Box<dyn std::error::Error>>
    where
        T: for<'de> Deserialize<'de>,
    {
        let _permit = self.semaphore.acquire().await?;

        let result = execute_graphql_query(&self.client, url, query, variables).await?;

        // Add small delay between requests
        sleep(Duration::from_millis(100)).await;

        Ok(result)
    }
}

Integration with Browser Automation

For GraphQL APIs that require complex authentication or are embedded in web applications, you might need to combine Rust with browser automation tools. While Rust doesn't have direct browser automation libraries like Puppeteer, you can monitor network requests in Puppeteer to capture GraphQL requests and then replicate them in Rust.

Best Practices

  1. Schema Introspection: Use GraphQL's introspection capabilities to understand the API structure:
async fn introspect_schema(client: &Client, url: &str) -> Result<(), Box<dyn std::error::Error>> {
    let introspection_query = r#"
        query IntrospectionQuery {
            __schema {
                types {
                    name
                    kind
                    description
                }
            }
        }
    "#;

    let result: serde_json::Value = execute_graphql_query(
        client,
        url,
        introspection_query,
        None,
    ).await?;

    println!("Schema: {:#}", result);
    Ok(())
}
  1. Query Optimization: Request only the fields you need to minimize bandwidth and response time.

  2. Caching: Implement response caching for frequently accessed data to reduce API calls.

  3. Error Recovery: Handle partial failures gracefully and implement appropriate fallback mechanisms.

Conclusion

Scraping GraphQL APIs with Rust provides excellent performance and type safety. The combination of reqwest for HTTP requests, serde for JSON handling, and graphql_client for type-safe queries creates a robust foundation for GraphQL scraping projects. Remember to always respect the API's terms of service, implement proper rate limiting, and handle errors gracefully to create reliable scraping applications.

For complex scenarios involving JavaScript-heavy applications, consider combining Rust's GraphQL capabilities with browser automation tools to handle AJAX requests using Puppeteer before implementing the GraphQL queries in Rust for optimal performance.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon