How do I work with arrays and lists in C# for storing scraped data?
When web scraping with C#, choosing the right data structure for storing your scraped data is crucial for performance and code maintainability. C# offers several collection types, but arrays and Lists are the most commonly used for storing scraped data. This guide explores both options, their use cases, and best practices for web scraping scenarios.
Understanding Arrays vs Lists in C
Arrays: Fixed-Size Collections
Arrays in C# are fixed-size collections that store elements of the same type in contiguous memory locations. Once created, their size cannot be changed.
// Declare and initialize an array
string[] productNames = new string[10];
// Or initialize with values
string[] urls = new string[] {
"https://example.com/page1",
"https://example.com/page2"
};
Pros: - Faster access time (O(1) lookup) - Lower memory overhead - Better for fixed-size datasets
Cons: - Fixed size (must know size beforehand) - Cannot add or remove elements dynamically - Less flexible for web scraping where result count varies
Lists: Dynamic Collections
List<T>
is a generic collection that can grow or shrink dynamically. It's part of the System.Collections.Generic
namespace and is the preferred choice for most web scraping scenarios.
using System.Collections.Generic;
// Create a new list
List<string> scrapedTitles = new List<string>();
// Or initialize with values
List<string> keywords = new List<string> { "web scraping", "C#", "data extraction" };
Pros: - Dynamic sizing (grows automatically) - Rich set of methods (Add, Remove, Find, etc.) - Perfect for unknown result counts - Better for web scraping scenarios
Cons: - Slightly higher memory overhead - Marginally slower than arrays for very large datasets
Storing Simple Scraped Data
Using Lists for Basic Data Storage
When scraping simple data like titles, prices, or URLs, a List<string>
is typically your best choice:
using System;
using System.Collections.Generic;
using HtmlAgilityPack;
public class SimpleWebScraper
{
public List<string> ScrapeProductTitles(string url)
{
List<string> titles = new List<string>();
var web = new HtmlWeb();
var document = web.Load(url);
// Extract all product titles
var titleNodes = document.DocumentNode.SelectNodes("//h2[@class='product-title']");
if (titleNodes != null)
{
foreach (var node in titleNodes)
{
titles.Add(node.InnerText.Trim());
}
}
return titles;
}
}
Working with Multiple Data Types
For storing different types of scraped data, use type-specific lists:
public class ProductScraper
{
public void ScrapeProducts(string url)
{
List<string> productNames = new List<string>();
List<decimal> prices = new List<decimal>();
List<int> stockCounts = new List<int>();
List<bool> inStock = new List<bool>();
var web = new HtmlWeb();
var document = web.Load(url);
var productNodes = document.DocumentNode.SelectNodes("//div[@class='product']");
foreach (var product in productNodes)
{
productNames.Add(product.SelectSingleNode(".//h3").InnerText);
prices.Add(decimal.Parse(product.SelectSingleNode(".//span[@class='price']").InnerText.Replace("$", "")));
stockCounts.Add(int.Parse(product.SelectSingleNode(".//span[@class='stock']").InnerText));
inStock.Add(stockCounts[stockCounts.Count - 1] > 0);
}
}
}
Storing Complex Scraped Data with Custom Classes
For real-world web scraping, you'll often need to store structured data with multiple fields. Creating custom classes and storing them in List<T>
collections is the most maintainable approach:
using System;
using System.Collections.Generic;
using HtmlAgilityPack;
public class Product
{
public string Name { get; set; }
public decimal Price { get; set; }
public string Description { get; set; }
public string ImageUrl { get; set; }
public int Rating { get; set; }
public DateTime ScrapedAt { get; set; }
}
public class ProductScraper
{
public List<Product> ScrapeProducts(string url)
{
List<Product> products = new List<Product>();
var web = new HtmlWeb();
var document = web.Load(url);
var productNodes = document.DocumentNode.SelectNodes("//div[@class='product-item']");
if (productNodes != null)
{
foreach (var node in productNodes)
{
var product = new Product
{
Name = node.SelectSingleNode(".//h2[@class='title']")?.InnerText.Trim(),
Price = ParsePrice(node.SelectSingleNode(".//span[@class='price']")?.InnerText),
Description = node.SelectSingleNode(".//p[@class='description']")?.InnerText.Trim(),
ImageUrl = node.SelectSingleNode(".//img")?.GetAttributeValue("src", ""),
Rating = ParseRating(node.SelectSingleNode(".//div[@class='rating']")?.InnerText),
ScrapedAt = DateTime.UtcNow
};
products.Add(product);
}
}
return products;
}
private decimal ParsePrice(string priceText)
{
if (string.IsNullOrEmpty(priceText)) return 0;
string cleaned = priceText.Replace("$", "").Replace(",", "").Trim();
return decimal.TryParse(cleaned, out decimal result) ? result : 0;
}
private int ParseRating(string ratingText)
{
if (string.IsNullOrEmpty(ratingText)) return 0;
return int.TryParse(ratingText.Trim(), out int result) ? result : 0;
}
}
Advanced List Operations for Web Scraping
Filtering Scraped Data
Use LINQ to filter your scraped data efficiently:
using System.Linq;
List<Product> products = ScrapeProducts("https://example.com");
// Filter products by price
List<Product> affordableProducts = products.Where(p => p.Price < 100).ToList();
// Filter products by rating
List<Product> topRatedProducts = products.Where(p => p.Rating >= 4).ToList();
// Get products with specific keywords
List<Product> searchResults = products
.Where(p => p.Name.Contains("laptop", StringComparison.OrdinalIgnoreCase))
.ToList();
Removing Duplicates
When scraping multiple pages, you might encounter duplicate entries. Here's how to handle them:
public class ProductEqualityComparer : IEqualityComparer<Product>
{
public bool Equals(Product x, Product y)
{
if (x == null || y == null) return false;
return x.Name == y.Name && x.Price == y.Price;
}
public int GetHashCode(Product obj)
{
return (obj.Name + obj.Price).GetHashCode();
}
}
// Remove duplicates
List<Product> allProducts = new List<Product>();
// ... scrape from multiple pages ...
List<Product> uniqueProducts = allProducts
.Distinct(new ProductEqualityComparer())
.ToList();
Sorting Scraped Data
Sort your scraped data using various criteria:
// Sort by price ascending
products.Sort((x, y) => x.Price.CompareTo(y.Price));
// Sort by price descending
products.Sort((x, y) => y.Price.CompareTo(x.Price));
// Sort by multiple criteria using LINQ
var sortedProducts = products
.OrderByDescending(p => p.Rating)
.ThenBy(p => p.Price)
.ToList();
Performance Considerations
Pre-allocating List Capacity
When you have an estimate of how many items you'll scrape, pre-allocate the list capacity to improve performance:
// If you expect around 100 products
List<Product> products = new List<Product>(100);
This reduces the number of internal array reallocations as the list grows.
Using Arrays for Known Sizes
If you're scraping a fixed number of items (e.g., exactly 10 featured products), arrays can be more efficient:
public Product[] ScrapeFeaturedProducts(string url)
{
Product[] featured = new Product[10];
var web = new HtmlWeb();
var document = web.Load(url);
var nodes = document.DocumentNode.SelectNodes("//div[@class='featured-product']");
for (int i = 0; i < Math.Min(nodes.Count, 10); i++)
{
featured[i] = ParseProduct(nodes[i]);
}
return featured;
}
Concurrent Collections for Parallel Scraping
When scraping multiple pages in parallel, use thread-safe collections:
using System.Collections.Concurrent;
using System.Threading.Tasks;
public async Task<List<Product>> ScrapeMultiplePages(List<string> urls)
{
ConcurrentBag<Product> allProducts = new ConcurrentBag<Product>();
await Parallel.ForEachAsync(urls, async (url, cancellationToken) =>
{
var products = await ScrapeProductsAsync(url);
foreach (var product in products)
{
allProducts.Add(product);
}
});
return allProducts.ToList();
}
Nested Collections for Hierarchical Data
When scraping hierarchical data (categories with products, pages with sections), use nested lists:
public class Category
{
public string Name { get; set; }
public List<Product> Products { get; set; }
public Category()
{
Products = new List<Product>();
}
}
public List<Category> ScrapeCatalog(string url)
{
List<Category> categories = new List<Category>();
var web = new HtmlWeb();
var document = web.Load(url);
var categoryNodes = document.DocumentNode.SelectNodes("//div[@class='category']");
foreach (var catNode in categoryNodes)
{
var category = new Category
{
Name = catNode.SelectSingleNode(".//h2")?.InnerText.Trim()
};
var productNodes = catNode.SelectNodes(".//div[@class='product']");
if (productNodes != null)
{
foreach (var prodNode in productNodes)
{
category.Products.Add(ParseProduct(prodNode));
}
}
categories.Add(category);
}
return categories;
}
Converting Between Arrays and Lists
Sometimes you need to convert between these collection types:
// List to Array
List<string> titlesList = new List<string> { "Title 1", "Title 2" };
string[] titlesArray = titlesList.ToArray();
// Array to List
string[] urlsArray = new string[] { "url1", "url2" };
List<string> urlsList = new List<string>(urlsArray);
// or
List<string> urlsList2 = urlsArray.ToList();
Best Practices for Web Scraping Data Storage
Use Lists by default: Unless you have a specific reason to use arrays,
List<T>
offers better flexibility for web scraping.Create custom classes: For structured data, always create custom classes instead of managing parallel lists of different types.
Initialize collections properly: Always initialize your lists in constructors or property declarations to avoid null reference exceptions.
Use appropriate data types: Choose the right data type for each field (decimal for prices, DateTime for dates, etc.).
Handle nulls gracefully: When parsing HTML content, always check for null values before accessing properties.
Consider memory usage: For very large datasets (millions of records), consider processing in batches rather than storing everything in memory.
Use LINQ judiciously: While LINQ is powerful, be aware that methods like
Where()
andSelect()
create new collections, which can impact memory usage.
Exporting Scraped Data
Once you've stored your scraped data, you'll likely want to export it:
using System.IO;
using System.Text.Json;
// Export to JSON
public void ExportToJson(List<Product> products, string filePath)
{
string json = JsonSerializer.Serialize(products, new JsonSerializerOptions
{
WriteIndented = true
});
File.WriteAllText(filePath, json);
}
// Export to CSV
public void ExportToCsv(List<Product> products, string filePath)
{
using (var writer = new StreamWriter(filePath))
{
// Write header
writer.WriteLine("Name,Price,Rating,Description,ImageUrl,ScrapedAt");
// Write data
foreach (var product in products)
{
writer.WriteLine($"\"{product.Name}\",{product.Price},{product.Rating}," +
$"\"{product.Description}\",\"{product.ImageUrl}\",{product.ScrapedAt}");
}
}
}
Conclusion
Working with arrays and lists in C# for web scraping is straightforward once you understand the trade-offs. For most web scraping scenarios, List<T>
is the preferred choice due to its dynamic sizing and rich functionality. Combine lists with custom classes to create maintainable, type-safe code that handles complex scraped data efficiently. Whether you're extracting simple text or complex hierarchical data, C#'s collection types provide the flexibility and performance needed for professional web scraping applications.