Table of contents

How do I work with arrays and lists in C# for storing scraped data?

When web scraping with C#, choosing the right data structure for storing your scraped data is crucial for performance and code maintainability. C# offers several collection types, but arrays and Lists are the most commonly used for storing scraped data. This guide explores both options, their use cases, and best practices for web scraping scenarios.

Understanding Arrays vs Lists in C

Arrays: Fixed-Size Collections

Arrays in C# are fixed-size collections that store elements of the same type in contiguous memory locations. Once created, their size cannot be changed.

// Declare and initialize an array
string[] productNames = new string[10];

// Or initialize with values
string[] urls = new string[] {
    "https://example.com/page1",
    "https://example.com/page2"
};

Pros: - Faster access time (O(1) lookup) - Lower memory overhead - Better for fixed-size datasets

Cons: - Fixed size (must know size beforehand) - Cannot add or remove elements dynamically - Less flexible for web scraping where result count varies

Lists: Dynamic Collections

List<T> is a generic collection that can grow or shrink dynamically. It's part of the System.Collections.Generic namespace and is the preferred choice for most web scraping scenarios.

using System.Collections.Generic;

// Create a new list
List<string> scrapedTitles = new List<string>();

// Or initialize with values
List<string> keywords = new List<string> { "web scraping", "C#", "data extraction" };

Pros: - Dynamic sizing (grows automatically) - Rich set of methods (Add, Remove, Find, etc.) - Perfect for unknown result counts - Better for web scraping scenarios

Cons: - Slightly higher memory overhead - Marginally slower than arrays for very large datasets

Storing Simple Scraped Data

Using Lists for Basic Data Storage

When scraping simple data like titles, prices, or URLs, a List<string> is typically your best choice:

using System;
using System.Collections.Generic;
using HtmlAgilityPack;

public class SimpleWebScraper
{
    public List<string> ScrapeProductTitles(string url)
    {
        List<string> titles = new List<string>();

        var web = new HtmlWeb();
        var document = web.Load(url);

        // Extract all product titles
        var titleNodes = document.DocumentNode.SelectNodes("//h2[@class='product-title']");

        if (titleNodes != null)
        {
            foreach (var node in titleNodes)
            {
                titles.Add(node.InnerText.Trim());
            }
        }

        return titles;
    }
}

Working with Multiple Data Types

For storing different types of scraped data, use type-specific lists:

public class ProductScraper
{
    public void ScrapeProducts(string url)
    {
        List<string> productNames = new List<string>();
        List<decimal> prices = new List<decimal>();
        List<int> stockCounts = new List<int>();
        List<bool> inStock = new List<bool>();

        var web = new HtmlWeb();
        var document = web.Load(url);

        var productNodes = document.DocumentNode.SelectNodes("//div[@class='product']");

        foreach (var product in productNodes)
        {
            productNames.Add(product.SelectSingleNode(".//h3").InnerText);
            prices.Add(decimal.Parse(product.SelectSingleNode(".//span[@class='price']").InnerText.Replace("$", "")));
            stockCounts.Add(int.Parse(product.SelectSingleNode(".//span[@class='stock']").InnerText));
            inStock.Add(stockCounts[stockCounts.Count - 1] > 0);
        }
    }
}

Storing Complex Scraped Data with Custom Classes

For real-world web scraping, you'll often need to store structured data with multiple fields. Creating custom classes and storing them in List<T> collections is the most maintainable approach:

using System;
using System.Collections.Generic;
using HtmlAgilityPack;

public class Product
{
    public string Name { get; set; }
    public decimal Price { get; set; }
    public string Description { get; set; }
    public string ImageUrl { get; set; }
    public int Rating { get; set; }
    public DateTime ScrapedAt { get; set; }
}

public class ProductScraper
{
    public List<Product> ScrapeProducts(string url)
    {
        List<Product> products = new List<Product>();

        var web = new HtmlWeb();
        var document = web.Load(url);

        var productNodes = document.DocumentNode.SelectNodes("//div[@class='product-item']");

        if (productNodes != null)
        {
            foreach (var node in productNodes)
            {
                var product = new Product
                {
                    Name = node.SelectSingleNode(".//h2[@class='title']")?.InnerText.Trim(),
                    Price = ParsePrice(node.SelectSingleNode(".//span[@class='price']")?.InnerText),
                    Description = node.SelectSingleNode(".//p[@class='description']")?.InnerText.Trim(),
                    ImageUrl = node.SelectSingleNode(".//img")?.GetAttributeValue("src", ""),
                    Rating = ParseRating(node.SelectSingleNode(".//div[@class='rating']")?.InnerText),
                    ScrapedAt = DateTime.UtcNow
                };

                products.Add(product);
            }
        }

        return products;
    }

    private decimal ParsePrice(string priceText)
    {
        if (string.IsNullOrEmpty(priceText)) return 0;

        string cleaned = priceText.Replace("$", "").Replace(",", "").Trim();
        return decimal.TryParse(cleaned, out decimal result) ? result : 0;
    }

    private int ParseRating(string ratingText)
    {
        if (string.IsNullOrEmpty(ratingText)) return 0;

        return int.TryParse(ratingText.Trim(), out int result) ? result : 0;
    }
}

Advanced List Operations for Web Scraping

Filtering Scraped Data

Use LINQ to filter your scraped data efficiently:

using System.Linq;

List<Product> products = ScrapeProducts("https://example.com");

// Filter products by price
List<Product> affordableProducts = products.Where(p => p.Price < 100).ToList();

// Filter products by rating
List<Product> topRatedProducts = products.Where(p => p.Rating >= 4).ToList();

// Get products with specific keywords
List<Product> searchResults = products
    .Where(p => p.Name.Contains("laptop", StringComparison.OrdinalIgnoreCase))
    .ToList();

Removing Duplicates

When scraping multiple pages, you might encounter duplicate entries. Here's how to handle them:

public class ProductEqualityComparer : IEqualityComparer<Product>
{
    public bool Equals(Product x, Product y)
    {
        if (x == null || y == null) return false;
        return x.Name == y.Name && x.Price == y.Price;
    }

    public int GetHashCode(Product obj)
    {
        return (obj.Name + obj.Price).GetHashCode();
    }
}

// Remove duplicates
List<Product> allProducts = new List<Product>();
// ... scrape from multiple pages ...

List<Product> uniqueProducts = allProducts
    .Distinct(new ProductEqualityComparer())
    .ToList();

Sorting Scraped Data

Sort your scraped data using various criteria:

// Sort by price ascending
products.Sort((x, y) => x.Price.CompareTo(y.Price));

// Sort by price descending
products.Sort((x, y) => y.Price.CompareTo(x.Price));

// Sort by multiple criteria using LINQ
var sortedProducts = products
    .OrderByDescending(p => p.Rating)
    .ThenBy(p => p.Price)
    .ToList();

Performance Considerations

Pre-allocating List Capacity

When you have an estimate of how many items you'll scrape, pre-allocate the list capacity to improve performance:

// If you expect around 100 products
List<Product> products = new List<Product>(100);

This reduces the number of internal array reallocations as the list grows.

Using Arrays for Known Sizes

If you're scraping a fixed number of items (e.g., exactly 10 featured products), arrays can be more efficient:

public Product[] ScrapeFeaturedProducts(string url)
{
    Product[] featured = new Product[10];

    var web = new HtmlWeb();
    var document = web.Load(url);

    var nodes = document.DocumentNode.SelectNodes("//div[@class='featured-product']");

    for (int i = 0; i < Math.Min(nodes.Count, 10); i++)
    {
        featured[i] = ParseProduct(nodes[i]);
    }

    return featured;
}

Concurrent Collections for Parallel Scraping

When scraping multiple pages in parallel, use thread-safe collections:

using System.Collections.Concurrent;
using System.Threading.Tasks;

public async Task<List<Product>> ScrapeMultiplePages(List<string> urls)
{
    ConcurrentBag<Product> allProducts = new ConcurrentBag<Product>();

    await Parallel.ForEachAsync(urls, async (url, cancellationToken) =>
    {
        var products = await ScrapeProductsAsync(url);

        foreach (var product in products)
        {
            allProducts.Add(product);
        }
    });

    return allProducts.ToList();
}

Nested Collections for Hierarchical Data

When scraping hierarchical data (categories with products, pages with sections), use nested lists:

public class Category
{
    public string Name { get; set; }
    public List<Product> Products { get; set; }

    public Category()
    {
        Products = new List<Product>();
    }
}

public List<Category> ScrapeCatalog(string url)
{
    List<Category> categories = new List<Category>();

    var web = new HtmlWeb();
    var document = web.Load(url);

    var categoryNodes = document.DocumentNode.SelectNodes("//div[@class='category']");

    foreach (var catNode in categoryNodes)
    {
        var category = new Category
        {
            Name = catNode.SelectSingleNode(".//h2")?.InnerText.Trim()
        };

        var productNodes = catNode.SelectNodes(".//div[@class='product']");

        if (productNodes != null)
        {
            foreach (var prodNode in productNodes)
            {
                category.Products.Add(ParseProduct(prodNode));
            }
        }

        categories.Add(category);
    }

    return categories;
}

Converting Between Arrays and Lists

Sometimes you need to convert between these collection types:

// List to Array
List<string> titlesList = new List<string> { "Title 1", "Title 2" };
string[] titlesArray = titlesList.ToArray();

// Array to List
string[] urlsArray = new string[] { "url1", "url2" };
List<string> urlsList = new List<string>(urlsArray);
// or
List<string> urlsList2 = urlsArray.ToList();

Best Practices for Web Scraping Data Storage

  1. Use Lists by default: Unless you have a specific reason to use arrays, List<T> offers better flexibility for web scraping.

  2. Create custom classes: For structured data, always create custom classes instead of managing parallel lists of different types.

  3. Initialize collections properly: Always initialize your lists in constructors or property declarations to avoid null reference exceptions.

  4. Use appropriate data types: Choose the right data type for each field (decimal for prices, DateTime for dates, etc.).

  5. Handle nulls gracefully: When parsing HTML content, always check for null values before accessing properties.

  6. Consider memory usage: For very large datasets (millions of records), consider processing in batches rather than storing everything in memory.

  7. Use LINQ judiciously: While LINQ is powerful, be aware that methods like Where() and Select() create new collections, which can impact memory usage.

Exporting Scraped Data

Once you've stored your scraped data, you'll likely want to export it:

using System.IO;
using System.Text.Json;

// Export to JSON
public void ExportToJson(List<Product> products, string filePath)
{
    string json = JsonSerializer.Serialize(products, new JsonSerializerOptions
    {
        WriteIndented = true
    });

    File.WriteAllText(filePath, json);
}

// Export to CSV
public void ExportToCsv(List<Product> products, string filePath)
{
    using (var writer = new StreamWriter(filePath))
    {
        // Write header
        writer.WriteLine("Name,Price,Rating,Description,ImageUrl,ScrapedAt");

        // Write data
        foreach (var product in products)
        {
            writer.WriteLine($"\"{product.Name}\",{product.Price},{product.Rating}," +
                           $"\"{product.Description}\",\"{product.ImageUrl}\",{product.ScrapedAt}");
        }
    }
}

Conclusion

Working with arrays and lists in C# for web scraping is straightforward once you understand the trade-offs. For most web scraping scenarios, List<T> is the preferred choice due to its dynamic sizing and rich functionality. Combine lists with custom classes to create maintainable, type-safe code that handles complex scraped data efficiently. Whether you're extracting simple text or complex hierarchical data, C#'s collection types provide the flexibility and performance needed for professional web scraping applications.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon