Table of contents

What is a Dictionary in C# and how can I use it for web scraping?

A Dictionary<TKey, TValue> in C# is a powerful collection type that stores key-value pairs, allowing fast lookups, insertions, and deletions. In web scraping, dictionaries are essential for organizing scraped data, caching HTTP responses, storing configuration settings, and mapping relationships between scraped elements.

Understanding Dictionary in C

A Dictionary is part of the System.Collections.Generic namespace and provides O(1) average-time complexity for lookups, making it ideal for handling large datasets common in web scraping operations.

Basic Dictionary Syntax

using System.Collections.Generic;

// Create a dictionary with string keys and string values
Dictionary<string, string> productData = new Dictionary<string, string>();

// Add key-value pairs
productData.Add("title", "Wireless Mouse");
productData.Add("price", "$29.99");
productData.Add("rating", "4.5");

// Alternative initialization syntax
var productInfo = new Dictionary<string, string>
{
    { "title", "Wireless Mouse" },
    { "price", "$29.99" },
    { "rating", "4.5" }
};

// Access values by key
string productTitle = productData["title"];

// Check if key exists
if (productData.ContainsKey("price"))
{
    Console.WriteLine($"Price: {productData["price"]}");
}

Using Dictionary for Web Scraping

1. Storing Scraped Product Data

When scraping e-commerce websites, dictionaries are perfect for organizing product information:

using System;
using System.Collections.Generic;
using System.Net.Http;
using HtmlAgilityPack;

public class ProductScraper
{
    public async Task<List<Dictionary<string, string>>> ScrapeProducts(string url)
    {
        var products = new List<Dictionary<string, string>>();

        using (var client = new HttpClient())
        {
            var html = await client.GetStringAsync(url);
            var doc = new HtmlDocument();
            doc.LoadHtml(html);

            var productNodes = doc.DocumentNode.SelectNodes("//div[@class='product']");

            if (productNodes != null)
            {
                foreach (var node in productNodes)
                {
                    var product = new Dictionary<string, string>
                    {
                        ["name"] = node.SelectSingleNode(".//h2[@class='title']")?.InnerText?.Trim() ?? "",
                        ["price"] = node.SelectSingleNode(".//span[@class='price']")?.InnerText?.Trim() ?? "",
                        ["description"] = node.SelectSingleNode(".//p[@class='desc']")?.InnerText?.Trim() ?? "",
                        ["url"] = node.SelectSingleNode(".//a")?.GetAttributeValue("href", "") ?? "",
                        ["image"] = node.SelectSingleNode(".//img")?.GetAttributeValue("src", "") ?? ""
                    };

                    products.Add(product);
                }
            }
        }

        return products;
    }
}

2. Caching HTTP Responses

Dictionaries are excellent for implementing simple HTTP response caches to avoid redundant requests:

public class CachedScraper
{
    private Dictionary<string, string> _cache = new Dictionary<string, string>();
    private HttpClient _client = new HttpClient();

    public async Task<string> GetPageContent(string url)
    {
        // Check if content is already cached
        if (_cache.ContainsKey(url))
        {
            Console.WriteLine($"Retrieved from cache: {url}");
            return _cache[url];
        }

        // Fetch and cache the content
        Console.WriteLine($"Fetching from web: {url}");
        var content = await _client.GetStringAsync(url);
        _cache[url] = content;

        return content;
    }

    public void ClearCache()
    {
        _cache.Clear();
    }
}

3. Storing Configuration and Headers

Use dictionaries to manage HTTP headers and scraping configurations:

public class WebScraperConfig
{
    public Dictionary<string, string> Headers { get; set; }
    public Dictionary<string, int> Settings { get; set; }

    public WebScraperConfig()
    {
        Headers = new Dictionary<string, string>
        {
            ["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            ["Accept"] = "text/html,application/xhtml+xml",
            ["Accept-Language"] = "en-US,en;q=0.9",
            ["Accept-Encoding"] = "gzip, deflate, br"
        };

        Settings = new Dictionary<string, int>
        {
            ["timeout"] = 30000,
            ["maxRetries"] = 3,
            ["delayMs"] = 1000
        };
    }

    public HttpClient CreateConfiguredClient()
    {
        var client = new HttpClient();
        client.Timeout = TimeSpan.FromMilliseconds(Settings["timeout"]);

        foreach (var header in Headers)
        {
            client.DefaultRequestHeaders.Add(header.Key, header.Value);
        }

        return client;
    }
}

4. Mapping and Transforming Data

Dictionaries can map scraped data to clean, standardized formats:

public class DataMapper
{
    private Dictionary<string, string> _fieldMapping = new Dictionary<string, string>
    {
        ["product_name"] = "name",
        ["product_price"] = "price",
        ["product_desc"] = "description",
        ["product_img"] = "image"
    };

    public Dictionary<string, string> NormalizeScrapedData(Dictionary<string, string> rawData)
    {
        var normalized = new Dictionary<string, string>();

        foreach (var mapping in _fieldMapping)
        {
            if (rawData.ContainsKey(mapping.Key))
            {
                normalized[mapping.Value] = rawData[mapping.Key];
            }
        }

        return normalized;
    }
}

5. Tracking Visited URLs

When implementing web crawlers, dictionaries help track visited pages and prevent duplicate requests:

public class WebCrawler
{
    private Dictionary<string, bool> _visitedUrls = new Dictionary<string, bool>();
    private Queue<string> _urlQueue = new Queue<string>();

    public async Task CrawlWebsite(string startUrl, int maxPages = 100)
    {
        _urlQueue.Enqueue(startUrl);

        while (_urlQueue.Count > 0 && _visitedUrls.Count < maxPages)
        {
            var currentUrl = _urlQueue.Dequeue();

            // Skip if already visited
            if (_visitedUrls.ContainsKey(currentUrl))
            {
                continue;
            }

            // Mark as visited
            _visitedUrls[currentUrl] = true;

            Console.WriteLine($"Crawling: {currentUrl}");

            // Scrape the page and extract links
            var links = await ExtractLinks(currentUrl);

            foreach (var link in links)
            {
                if (!_visitedUrls.ContainsKey(link))
                {
                    _urlQueue.Enqueue(link);
                }
            }
        }
    }

    private async Task<List<string>> ExtractLinks(string url)
    {
        // Implementation to extract links from page
        return new List<string>();
    }
}

Advanced Dictionary Techniques for Web Scraping

Using TryGetValue for Safe Access

Instead of checking ContainsKey and then accessing the value, use TryGetValue for better performance:

public string GetProductPrice(Dictionary<string, string> product)
{
    if (product.TryGetValue("price", out string price))
    {
        return price;
    }

    return "Price not available";
}

Dictionary with Complex Value Types

Store more complex data structures as dictionary values:

public class ScrapedArticle
{
    public string Title { get; set; }
    public DateTime PublishDate { get; set; }
    public List<string> Tags { get; set; }
}

// Dictionary with URL as key and article object as value
var articles = new Dictionary<string, ScrapedArticle>();

articles["https://example.com/article-1"] = new ScrapedArticle
{
    Title = "Web Scraping Best Practices",
    PublishDate = DateTime.Now,
    Tags = new List<string> { "scraping", "data", "automation" }
};

Concurrent Dictionary for Multi-threaded Scraping

When implementing multithreading in C# for faster web scraping, use ConcurrentDictionary for thread-safe operations:

using System.Collections.Concurrent;
using System.Threading.Tasks;

public class ParallelScraper
{
    private ConcurrentDictionary<string, string> _results = new ConcurrentDictionary<string, string>();

    public async Task ScrapeUrlsInParallel(List<string> urls)
    {
        var tasks = urls.Select(async url =>
        {
            var content = await FetchContent(url);
            _results.TryAdd(url, content);
        });

        await Task.WhenAll(tasks);
    }

    private async Task<string> FetchContent(string url)
    {
        using (var client = new HttpClient())
        {
            return await client.GetStringAsync(url);
        }
    }
}

Serializing Dictionary Data

After scraping, you often need to export data. Here's how to serialize dictionaries to JSON:

using System.Text.Json;
using System.IO;

public class DataExporter
{
    public void ExportToJson(List<Dictionary<string, string>> data, string filePath)
    {
        var options = new JsonSerializerOptions { WriteIndented = true };
        var json = JsonSerializer.Serialize(data, options);
        File.WriteAllText(filePath, json);
    }

    public List<Dictionary<string, string>> ImportFromJson(string filePath)
    {
        var json = File.ReadAllText(filePath);
        return JsonSerializer.Deserialize<List<Dictionary<string, string>>>(json);
    }
}

// Usage
var scraper = new ProductScraper();
var products = await scraper.ScrapeProducts("https://example.com/products");

var exporter = new DataExporter();
exporter.ExportToJson(products, "products.json");

Best Practices

  1. Initialize with Capacity: If you know the approximate size, initialize the dictionary with capacity for better performance:
   var data = new Dictionary<string, string>(100);
  1. Use Null-Coalescing: When accessing potentially missing keys:
   string value = dict.TryGetValue("key", out var val) ? val : "default";
  1. Clean Keys: Normalize keys to avoid duplicates:
   string cleanKey = rawKey.Trim().ToLower();
   dict[cleanKey] = value;
  1. Handle Exceptions: Wrap dictionary operations in try-catch when handling exceptions in C# web scraping applications:
   try
   {
       var value = dict["key"];
   }
   catch (KeyNotFoundException)
   {
       Console.WriteLine("Key not found");
   }

Conclusion

Dictionary is an indispensable data structure for C# web scraping projects. It provides efficient storage and retrieval of scraped data, enables caching strategies, manages configuration settings, and facilitates data transformation. By mastering Dictionary operations, you'll build more efficient, maintainable, and scalable web scraping applications.

Whether you're storing product information, tracking visited URLs, or organizing complex scraped datasets, the Dictionary collection type offers the flexibility and performance needed for professional web scraping projects in C#.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon