Table of contents

How do I download files using C# during web scraping?

Downloading files is a common requirement when scraping websites, whether you need to save images, PDFs, documents, or other resources. C# provides several built-in methods for downloading files efficiently, ranging from simple synchronous downloads to advanced asynchronous operations with progress tracking.

Using HttpClient for File Downloads

The recommended approach for downloading files in modern C# is using HttpClient, which provides async support and better resource management:

using System;
using System.IO;
using System.Net.Http;
using System.Threading.Tasks;

public class FileDownloader
{
    private static readonly HttpClient client = new HttpClient();

    public async Task DownloadFileAsync(string url, string destinationPath)
    {
        try
        {
            // Download the file as a byte array
            byte[] fileBytes = await client.GetByteArrayAsync(url);

            // Write to disk
            await File.WriteAllBytesAsync(destinationPath, fileBytes);

            Console.WriteLine($"File downloaded successfully to {destinationPath}");
        }
        catch (HttpRequestException e)
        {
            Console.WriteLine($"Error downloading file: {e.Message}");
            throw;
        }
    }
}

// Usage
var downloader = new FileDownloader();
await downloader.DownloadFileAsync(
    "https://example.com/document.pdf",
    @"C:\Downloads\document.pdf"
);

This method loads the entire file into memory before writing it to disk, which works well for small to medium-sized files.

Streaming Large Files

For larger files, streaming is more memory-efficient as it doesn't load the entire file into memory:

public async Task DownloadLargeFileAsync(string url, string destinationPath)
{
    using (HttpClient client = new HttpClient())
    {
        // Set timeout for large files
        client.Timeout = TimeSpan.FromMinutes(10);

        using (HttpResponseMessage response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
        {
            response.EnsureSuccessStatusCode();

            using (Stream streamToReadFrom = await response.Content.ReadAsStreamAsync())
            using (Stream streamToWriteTo = File.Open(destinationPath, FileMode.Create))
            {
                await streamToReadFrom.CopyToAsync(streamToWriteTo);
            }
        }
    }

    Console.WriteLine($"Large file downloaded to {destinationPath}");
}

The HttpCompletionOption.ResponseHeadersRead parameter ensures the method returns as soon as headers are read, allowing you to start streaming immediately.

Download with Progress Tracking

When downloading large files, tracking progress improves user experience:

public async Task DownloadFileWithProgressAsync(string url, string destinationPath,
    IProgress<double> progress = null)
{
    using (HttpClient client = new HttpClient())
    {
        using (HttpResponseMessage response = await client.GetAsync(url, HttpCompletionOption.ResponseHeadersRead))
        {
            response.EnsureSuccessStatusCode();

            long? contentLength = response.Content.Headers.ContentLength;

            using (Stream contentStream = await response.Content.ReadAsStreamAsync())
            using (FileStream fileStream = new FileStream(destinationPath, FileMode.Create, FileAccess.Write, FileShare.None, 8192, true))
            {
                byte[] buffer = new byte[8192];
                long totalBytesRead = 0;
                int bytesRead;

                while ((bytesRead = await contentStream.ReadAsync(buffer, 0, buffer.Length)) > 0)
                {
                    await fileStream.WriteAsync(buffer, 0, bytesRead);
                    totalBytesRead += bytesRead;

                    if (contentLength.HasValue && progress != null)
                    {
                        double progressPercentage = (double)totalBytesRead / contentLength.Value * 100;
                        progress.Report(progressPercentage);
                    }
                }
            }
        }
    }
}

// Usage with progress reporting
var progressHandler = new Progress<double>(percentage =>
{
    Console.WriteLine($"Download progress: {percentage:F2}%");
});

await DownloadFileWithProgressAsync(
    "https://example.com/largefile.zip",
    @"C:\Downloads\largefile.zip",
    progressHandler
);

Downloading Multiple Files Concurrently

When scraping multiple files, parallel downloads can significantly improve performance:

public async Task DownloadMultipleFilesAsync(Dictionary<string, string> urlToPathMap)
{
    using (HttpClient client = new HttpClient())
    {
        var downloadTasks = urlToPathMap.Select(async kvp =>
        {
            try
            {
                byte[] fileBytes = await client.GetByteArrayAsync(kvp.Key);
                await File.WriteAllBytesAsync(kvp.Value, fileBytes);
                Console.WriteLine($"Downloaded: {kvp.Value}");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Failed to download {kvp.Key}: {ex.Message}");
            }
        });

        await Task.WhenAll(downloadTasks);
    }
}

// Usage
var files = new Dictionary<string, string>
{
    { "https://example.com/image1.jpg", @"C:\Downloads\image1.jpg" },
    { "https://example.com/image2.jpg", @"C:\Downloads\image2.jpg" },
    { "https://example.com/document.pdf", @"C:\Downloads\document.pdf" }
};

await DownloadMultipleFilesAsync(files);

Using WebClient (Legacy Approach)

While WebClient is simpler, it's considered legacy and HttpClient is preferred for new development:

using System.Net;

// Synchronous download
using (WebClient client = new WebClient())
{
    client.DownloadFile("https://example.com/file.pdf", @"C:\Downloads\file.pdf");
}

// Asynchronous download
using (WebClient client = new WebClient())
{
    client.DownloadFileCompleted += (sender, e) =>
    {
        if (e.Error == null)
            Console.WriteLine("Download completed!");
        else
            Console.WriteLine($"Error: {e.Error.Message}");
    };

    client.DownloadProgressChanged += (sender, e) =>
    {
        Console.WriteLine($"{e.ProgressPercentage}% - {e.BytesReceived}/{e.TotalBytesToReceive} bytes");
    };

    await client.DownloadFileTaskAsync("https://example.com/file.pdf", @"C:\Downloads\file.pdf");
}

Handling Authentication and Headers

Many websites require authentication or specific headers for file downloads:

public async Task DownloadProtectedFileAsync(string url, string destinationPath,
    string bearerToken = null)
{
    using (HttpClient client = new HttpClient())
    {
        // Add authorization header
        if (!string.IsNullOrEmpty(bearerToken))
        {
            client.DefaultRequestHeaders.Authorization =
                new System.Net.Http.Headers.AuthenticationHeaderValue("Bearer", bearerToken);
        }

        // Add custom headers
        client.DefaultRequestHeaders.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)");
        client.DefaultRequestHeaders.Add("Accept", "*/*");

        byte[] fileBytes = await client.GetByteArrayAsync(url);
        await File.WriteAllBytesAsync(destinationPath, fileBytes);
    }
}

Extracting Filename from Response Headers

Sometimes the filename isn't in the URL but in the response headers:

public async Task<string> DownloadAndGetFilenameAsync(string url, string downloadDirectory)
{
    using (HttpClient client = new HttpClient())
    {
        using (HttpResponseMessage response = await client.GetAsync(url))
        {
            response.EnsureSuccessStatusCode();

            // Try to get filename from Content-Disposition header
            string filename = "downloaded_file";
            if (response.Content.Headers.ContentDisposition?.FileName != null)
            {
                filename = response.Content.Headers.ContentDisposition.FileName.Trim('"');
            }
            else
            {
                // Fallback: extract from URL
                filename = Path.GetFileName(new Uri(url).LocalPath);
            }

            string destinationPath = Path.Combine(downloadDirectory, filename);

            byte[] fileBytes = await response.Content.ReadAsByteArrayAsync();
            await File.WriteAllBytesAsync(destinationPath, fileBytes);

            return destinationPath;
        }
    }
}

Implementing Retry Logic

Network operations can fail, so implementing retry logic is essential for robust web scraping:

public async Task<bool> DownloadFileWithRetryAsync(string url, string destinationPath,
    int maxRetries = 3)
{
    int retryCount = 0;

    while (retryCount < maxRetries)
    {
        try
        {
            using (HttpClient client = new HttpClient())
            {
                client.Timeout = TimeSpan.FromSeconds(30);

                byte[] fileBytes = await client.GetByteArrayAsync(url);
                await File.WriteAllBytesAsync(destinationPath, fileBytes);

                Console.WriteLine($"File downloaded successfully on attempt {retryCount + 1}");
                return true;
            }
        }
        catch (Exception ex)
        {
            retryCount++;
            Console.WriteLine($"Attempt {retryCount} failed: {ex.Message}");

            if (retryCount >= maxRetries)
            {
                Console.WriteLine($"Failed to download after {maxRetries} attempts");
                return false;
            }

            // Wait before retrying (exponential backoff)
            await Task.Delay(TimeSpan.FromSeconds(Math.Pow(2, retryCount)));
        }
    }

    return false;
}

Complete Web Scraping Example

Here's a practical example that combines web scraping with file downloading:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net.Http;
using System.Threading.Tasks;
using HtmlAgilityPack;

public class ImageScraper
{
    private static readonly HttpClient httpClient = new HttpClient();

    public async Task ScrapeAndDownloadImagesAsync(string pageUrl, string downloadDirectory)
    {
        // Create download directory if it doesn't exist
        Directory.CreateDirectory(downloadDirectory);

        // Fetch the webpage
        string html = await httpClient.GetStringAsync(pageUrl);

        // Parse HTML
        var htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(html);

        // Extract all image URLs
        var imageNodes = htmlDoc.DocumentNode.SelectNodes("//img[@src]");

        if (imageNodes == null)
        {
            Console.WriteLine("No images found on the page");
            return;
        }

        var imageUrls = imageNodes
            .Select(node => node.GetAttributeValue("src", ""))
            .Where(src => !string.IsNullOrEmpty(src))
            .Select(src => ConvertToAbsoluteUrl(src, pageUrl))
            .ToList();

        Console.WriteLine($"Found {imageUrls.Count} images to download");

        // Download all images
        var downloadTasks = imageUrls.Select(async (url, index) =>
        {
            try
            {
                string filename = $"image_{index + 1}{Path.GetExtension(url)}";
                string destinationPath = Path.Combine(downloadDirectory, filename);

                byte[] imageBytes = await httpClient.GetByteArrayAsync(url);
                await File.WriteAllBytesAsync(destinationPath, imageBytes);

                Console.WriteLine($"Downloaded: {filename}");
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Failed to download {url}: {ex.Message}");
            }
        });

        await Task.WhenAll(downloadTasks);
        Console.WriteLine("All downloads completed");
    }

    private string ConvertToAbsoluteUrl(string url, string baseUrl)
    {
        if (Uri.IsWellFormedUriString(url, UriKind.Absolute))
            return url;

        var baseUri = new Uri(baseUrl);
        return new Uri(baseUri, url).ToString();
    }
}

// Usage
var scraper = new ImageScraper();
await scraper.ScrapeAndDownloadImagesAsync(
    "https://example.com/gallery",
    @"C:\Downloads\images"
);

Best Practices for File Downloads in Web Scraping

  1. Use HttpClient correctly: Create a single static instance and reuse it throughout your application to avoid socket exhaustion
  2. Implement proper error handling: Network operations can fail for various reasons, so always use try-catch blocks and implement exception handling in your C# web scraping applications
  3. Set appropriate timeouts: Use timeout values for HTTP requests to prevent hanging requests
  4. Stream large files: For files larger than a few megabytes, use streaming to avoid memory issues
  5. Respect robots.txt: Always check if file downloads are allowed
  6. Add delays between downloads: Avoid overwhelming servers with concurrent requests
  7. Validate file types: Check Content-Type headers to ensure you're downloading the expected file type
  8. Use async/await: Leverage async/await in C# for asynchronous web scraping to improve performance

Conclusion

C# provides robust options for downloading files during web scraping operations. HttpClient with async/await is the modern, recommended approach that offers excellent performance and flexibility. For production applications, implement progress tracking, retry logic, and proper error handling to create resilient file download functionality. Whether you're downloading a single file or orchestrating multiple concurrent downloads, C# has the tools you need for efficient web scraping workflows.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon