How do I download files using C# during web scraping?
Downloading files is a common requirement when scraping websites, whether you need to save images, PDFs, documents, or other resources. C# provides several built-in methods for downloading files efficiently, ranging from simple synchronous downloads to advanced asynchronous operations with progress tracking.
Using HttpClient for File Downloads
The recommended approach for downloading files in modern C# is using HttpClient
, which provides async support and better resource management:
using System;
using System.IO;
using System.Net.Http;
using System.Threading.Tasks;
public class FileDownloader
{
private static readonly HttpClient client = new HttpClient();
public async Task DownloadFileAsync(string url, string destinationPath)
{
try
{
// Download the file as a byte array
byte[] fileBytes = await client.GetByteArrayAsync(url);
// Write to disk
await File.WriteAllBytesAsync(destinationPath, fileBytes);
Console.WriteLine($"File downloaded successfully to {destinationPath}");
}
catch (HttpRequestException e)
{
Console.WriteLine($"Error downloading file: {e.Message}");
throw;
}
}
}
// Usage
var downloader = new FileDownloader();
await downloader.DownloadFileAsync(
"https://example.com/document.pdf",
@"C:\Downloads\document.pdf"
);
This method loads the entire file into memory before writing it to disk, which works well for small to medium-sized files.
Streaming Large Files
For larger files, streaming is more memory-efficient as it doesn't load the entire file into memory:
public async Task DownloadLargeFileAsync(string url, string destinationPath)
{
using (HttpClient client = new HttpClient())
{
// Set timeout for large files
client.Timeout = TimeSpan.FromMinutes(10);
using (HttpResponseMessage response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
{
response.EnsureSuccessStatusCode();
using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
using (Stream streamToWriteTo = File.Open(destinationPath, FileMode.Create))
{
await streamToReadFrom.CopyToAsync(streamToWriteTo);
}
}
}
Console.WriteLine($"Large file downloaded to {destinationPath}");
}
The HttpCompletionOption.ResponseHeadersRead
parameter ensures the method returns as soon as headers are read, allowing you to start streaming immediately.
Download with Progress Tracking
When downloading large files, tracking progress improves user experience:
public async Task DownloadFileWithProgressAsync(string url, string destinationPath,
IProgress<double> progress = null)
{
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
{
response.EnsureSuccessStatusCode();
long? contentLength = response.Content.Headers.ContentLength;
using (Stream contentStream = await response.Content.ReadAsStreamAsync())
using (FileStream fileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write, FileShare.None, 8192, true))
{
byte[] buffer = new byte[8192];
long totalBytesRead = 0;
int bytesRead;
while ((bytesRead = await contentStream.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
await fileStream.WriteAsync(buffer, 0, bytesRead);
totalBytesRead += bytesRead;
if (contentLength.HasValue && progress != null)
{
double progressPercentage = (double)totalBytesRead / contentLength.Value * 100;
progress.Report(progressPercentage);
}
}
}
}
}
}
// Usage with progress reporting
var progressHandler = new Progress<double>(percentage =>
{
Console.WriteLine($"Download progress: {percentage:F2}%");
});
await DownloadFileWithProgressAsync(
"https://example.com/largefile.zip",
@"C:\Downloads\largefile.zip",
progressHandler
);
Downloading Multiple Files Concurrently
When scraping multiple files, parallel downloads can significantly improve performance:
public async Task DownloadMultipleFilesAsync(Dictionary<string, string> urlToPathMap)
{
using (HttpClient client = new HttpClient())
{
var downloadTasks = urlToPathMap.Select(async kvp =>
{
try
{
byte[] fileBytes = await client.GetByteArrayAsync(kvp.Key);
await File.WriteAllBytesAsync(kvp.Value, fileBytes);
Console.WriteLine($"Downloaded: {kvp.Value}");
}
catch (Exception ex)
{
Console.WriteLine($"Failed to download {kvp.Key}: {ex.Message}");
}
});
await Task.WhenAll(downloadTasks);
}
}
// Usage
var files = new Dictionary<string, string>
{
{ "https://example.com/image1.jpg", @"C:\Downloads\image1.jpg" },
{ "https://example.com/image2.jpg", @"C:\Downloads\image2.jpg" },
{ "https://example.com/document.pdf", @"C:\Downloads\document.pdf" }
};
await DownloadMultipleFilesAsync(files);
Using WebClient (Legacy Approach)
While WebClient
is simpler, it's considered legacy and HttpClient
is preferred for new development:
using System.Net;
// Synchronous download
using (WebClient client = new WebClient())
{
client.DownloadFile("https://example.com/file.pdf", @"C:\Downloads\file.pdf");
}
// Asynchronous download
using (WebClient client = new WebClient())
{
client.DownloadFileCompleted += (sender, e) =>
{
if (e.Error == null)
Console.WriteLine("Download completed!");
else
Console.WriteLine($"Error: {e.Error.Message}");
};
client.DownloadProgressChanged += (sender, e) =>
{
Console.WriteLine($"{e.ProgressPercentage}% - {e.BytesReceived}/{e.TotalBytesToReceive} bytes");
};
await client.DownloadFileTaskAsync("https://example.com/file.pdf", @"C:\Downloads\file.pdf");
}
Handling Authentication and Headers
Many websites require authentication or specific headers for file downloads:
public async Task DownloadProtectedFileAsync(string url, string destinationPath,
string bearerToken = null)
{
using (HttpClient client = new HttpClient())
{
// Add authorization header
if (!string.IsNullOrEmpty(bearerToken))
{
client.DefaultRequestHeaders.Authorization =
new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", bearerToken);
}
// Add custom headers
client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
client.DefaultRequestHeaders.Add("Accept", "*/*");
byte[] fileBytes = await client.GetByteArrayAsync(url);
await File.WriteAllBytesAsync(destinationPath, fileBytes);
}
}
Extracting Filename from Response Headers
Sometimes the filename isn't in the URL but in the response headers:
public async Task<string> DownloadAndGetFilenameAsync(string url, string downloadDirectory)
{
using (HttpClient client = new HttpClient())
{
using (HttpResponseMessage response = await client.GetAsync(url))
{
response.EnsureSuccessStatusCode();
// Try to get filename from Content-Disposition header
string filename = "downloaded_file";
if (response.Content.Headers.ContentDisposition?.FileName != null)
{
filename = response.Content.Headers.ContentDisposition.FileName.Trim('"');
}
else
{
// Fallback: extract from URL
filename = Path.GetFileName(new Uri(url).LocalPath);
}
string destinationPath = Path.Combine(downloadDirectory, filename);
byte[] fileBytes = await response.Content.ReadAsByteArrayAsync();
await File.WriteAllBytesAsync(destinationPath, fileBytes);
return destinationPath;
}
}
}
Implementing Retry Logic
Network operations can fail, so implementing retry logic is essential for robust web scraping:
public async Task<bool> DownloadFileWithRetryAsync(string url, string destinationPath,
int maxRetries = 3)
{
int retryCount = 0;
while (retryCount < maxRetries)
{
try
{
using (HttpClient client = new HttpClient())
{
client.Timeout = TimeSpan.FromSeconds(30);
byte[] fileBytes = await client.GetByteArrayAsync(url);
await File.WriteAllBytesAsync(destinationPath, fileBytes);
Console.WriteLine($"File downloaded successfully on attempt {retryCount + 1}");
return true;
}
}
catch (Exception ex)
{
retryCount++;
Console.WriteLine($"Attempt {retryCount} failed: {ex.Message}");
if (retryCount >= maxRetries)
{
Console.WriteLine($"Failed to download after {maxRetries} attempts");
return false;
}
// Wait before retrying (exponential backoff)
await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, retryCount)));
}
}
return false;
}
Complete Web Scraping Example
Here's a practical example that combines web scraping with file downloading:
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;
public class ImageScraper
{
private static readonly HttpClient httpClient = new HttpClient();
public async Task ScrapeAndDownloadImagesAsync(string pageUrl, string downloadDirectory)
{
// Create download directory if it doesn't exist
Directory.CreateDirectory(downloadDirectory);
// Fetch the webpage
string html = await httpClient.GetStringAsync(pageUrl);
// Parse HTML
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
// Extract all image URLs
var imageNodes = htmlDoc.DocumentNode.SelectNodes("//img[@src]");
if (imageNodes == null)
{
Console.WriteLine("No images found on the page");
return;
}
var imageUrls = imageNodes
.Select(node => node.GetAttributeValue("src", ""))
.Where(src => !string.IsNullOrEmpty(src))
.Select(src => ConvertToAbsoluteUrl(src, pageUrl))
.ToList();
Console.WriteLine($"Found {imageUrls.Count} images to download");
// Download all images
var downloadTasks = imageUrls.Select(async (url, index) =>
{
try
{
string filename = $"image_{index + 1}{Path.GetExtension(url)}";
string destinationPath = Path.Combine(downloadDirectory, filename);
byte[] imageBytes = await httpClient.GetByteArrayAsync(url);
await File.WriteAllBytesAsync(destinationPath, imageBytes);
Console.WriteLine($"Downloaded: {filename}");
}
catch (Exception ex)
{
Console.WriteLine($"Failed to download {url}: {ex.Message}");
}
});
await Task.WhenAll(downloadTasks);
Console.WriteLine("All downloads completed");
}
private string ConvertToAbsoluteUrl(string url, string baseUrl)
{
if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
return url;
var baseUri = new Uri(baseUrl);
return new Uri(baseUri, url).ToString();
}
}
// Usage
var scraper = new ImageScraper();
await scraper.ScrapeAndDownloadImagesAsync(
"https://example.com/gallery",
@"C:\Downloads\images"
);
Best Practices for File Downloads in Web Scraping
- Use HttpClient correctly: Create a single static instance and reuse it throughout your application to avoid socket exhaustion
- Implement proper error handling: Network operations can fail for various reasons, so always use try-catch blocks and implement exception handling in your C# web scraping applications
- Set appropriate timeouts: Use timeout values for HTTP requests to prevent hanging requests
- Stream large files: For files larger than a few megabytes, use streaming to avoid memory issues
- Respect robots.txt: Always check if file downloads are allowed
- Add delays between downloads: Avoid overwhelming servers with concurrent requests
- Validate file types: Check Content-Type headers to ensure you're downloading the expected file type
- Use async/await: Leverage async/await in C# for asynchronous web scraping to improve performance
Conclusion
C# provides robust options for downloading files during web scraping operations. HttpClient
with async/await is the modern, recommended approach that offers excellent performance and flexibility. For production applications, implement progress tracking, retry logic, and proper error handling to create resilient file download functionality. Whether you're downloading a single file or orchestrating multiple concurrent downloads, C# has the tools you need for efficient web scraping workflows.