What training data do I need to fine-tune an LLM for web scraping?

Fine-tuning a large language model (LLM) for web scraping requires high-quality training data consisting of HTML-to-structured-data pairs. The training dataset should include diverse examples of web content along with the correctly extracted information in your desired output format. Typically, you'll need 50-1000+ examples depending on the complexity of your extraction task and the model you're using.

Essential Components of Training Data

1. Input-Output Pairs

The foundation of any LLM fine-tuning dataset is paired examples showing what goes in and what should come out.

Required Elements: - HTML Input: Raw or cleaned HTML snippets from web pages - Structured Output: JSON, CSV, or other structured formats with extracted data - System Prompts: Instructions that define the extraction task - Consistency: Uniform formatting across all examples

Example Training Pair:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a web scraping assistant that extracts product information from e-commerce HTML pages. Always return valid JSON with fields: name, price, rating, availability."
    },
    {
      "role": "user",
      "content": "<div class='product-card'>\n  <h2 class='product-title'>Wireless Bluetooth Headphones</h2>\n  <span class='price-tag'>$89.99</span>\n  <div class='rating'>4.7 out of 5 stars</div>\n  <span class='stock'>In Stock</span>\n</div>"
    },
    {
      "role": "assistant",
      "content": "{\"name\": \"Wireless Bluetooth Headphones\", \"price\": 89.99, \"rating\": 4.7, \"availability\": \"in_stock\"}"
    }
  ]
}

2. Data Quantity Requirements

The amount of training data you need varies based on several factors:

Minimum Dataset Sizes: - Simple extraction tasks: 50-100 examples (single website, consistent structure) - Medium complexity: 200-500 examples (multiple page types, varied structures) - Complex tasks: 500-1000+ examples (diverse websites, challenging edge cases) - Production-grade models: 1000-10,000+ examples (robust, generalized extraction)

Quality vs. Quantity:

# Bad: Large dataset with inconsistent formatting
bad_dataset = [
    {"input": "<div>Product A</div>", "output": "Product A"},  # String output
    {"input": "<div>Product B</div>", "output": {"name": "Product B"}},  # Object
    {"input": "<div>Product C</div>", "output": ["Product C", "$10"]},  # Array
]

# Good: Smaller dataset with consistent formatting
good_dataset = [
    {
        "messages": [
            {"role": "system", "content": "Extract product info as JSON"},
            {"role": "user", "content": "<div>Product A - $20</div>"},
            {"role": "assistant", "content": '{"name": "Product A", "price": 20.00}'}
        ]
    },
    {
        "messages": [
            {"role": "system", "content": "Extract product info as JSON"},
            {"role": "user", "content": "<span>Product B - $30</span>"},
            {"role": "assistant", "content": '{"name": "Product B", "price": 30.00}'}
        ]
    },
]

3. Edge Cases and Variations

Your training data must include examples that represent real-world scenarios and edge cases.

Critical Edge Cases to Include:

edge_case_examples = [
    # Missing data
    {
        "html": "<div class='product'><h2>Item Name</h2></div>",  # No price
        "output": {"name": "Item Name", "price": null}
    },

    # Malformed HTML
    {
        "html": "<div class='product'><h2>Item<span>$19.99</h2></span></div>",
        "output": {"name": "Item", "price": 19.99}
    },

    # Multiple potential values (need the correct one)
    {
        "html": "<div><span class='old-price'>$50</span><span class='current-price'>$39.99</span></div>",
        "output": {"price": 39.99, "original_price": 50.00}
    },

    # Special characters and encoding
    {
        "html": "<h1>Café Français - 100€</h1>",
        "output": {"name": "Café Français", "price": 100.00, "currency": "EUR"}
    },

    # Out of stock variations
    {
        "html": "<div class='product'><h2>Item</h2><span class='unavailable'>Currently Unavailable</span></div>",
        "output": {"name": "Item", "availability": "out_of_stock"}
    },

    # Nested structures
    {
        "html": "<article><header><div><h1>Title</h1></div></header><p>Description</p></article>",
        "output": {"title": "Title", "description": "Description"}
    },
]

Collecting Training Data

Method 1: Manual Annotation

The most accurate but time-intensive approach.

from bs4 import BeautifulSoup
import json

def create_training_example_manually(url, html_content, manual_extraction):
    """
    Create a training example from manually extracted data
    """
    # Clean the HTML (optional)
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove unwanted elements
    for element in soup(['script', 'style', 'nav', 'footer']):
        element.decompose()

    cleaned_html = str(soup)

    # Create the training format
    training_example = {
        "messages": [
            {
                "role": "system",
                "content": "Extract product information from HTML and return as JSON."
            },
            {
                "role": "user",
                "content": cleaned_html[:4000]  # Limit length
            },
            {
                "role": "assistant",
                "content": json.dumps(manual_extraction, ensure_ascii=False)
            }
        ],
        "metadata": {
            "source_url": url,
            "created_at": "2025-01-15"
        }
    }

    return training_example

# Example usage
html = """
<div class="product">
    <h1>Ergonomic Office Chair</h1>
    <div class="price">$299.99</div>
    <span class="rating">4.5 stars (234 reviews)</span>
</div>
"""

manual_data = {
    "name": "Ergonomic Office Chair",
    "price": 299.99,
    "rating": 4.5,
    "review_count": 234
}

example = create_training_example_manually(
    "https://example.com/product/chair",
    html,
    manual_data
)

# Save to JSONL file
with open('training_data.jsonl', 'a') as f:
    f.write(json.dumps(example) + '\n')

Method 2: Semi-Automated Collection

Use existing scrapers to generate training data faster.

import requests
from bs4 import BeautifulSoup
import json

def collect_training_data_from_scraper(urls, css_selectors):
    """
    Generate training data using CSS selectors (which you'll teach the LLM to replicate)
    """
    training_data = []

    for url in urls:
        # Fetch the page
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract using CSS selectors
        extracted_data = {}
        for field_name, selector in css_selectors.items():
            element = soup.select_one(selector)
            if element:
                extracted_data[field_name] = element.get_text(strip=True)

        # Create training example
        training_example = {
            "messages": [
                {
                    "role": "system",
                    "content": "Extract article data from HTML and return as JSON with fields: title, author, date, content_preview."
                },
                {
                    "role": "user",
                    "content": str(soup)[:6000]
                },
                {
                    "role": "assistant",
                    "content": json.dumps(extracted_data, ensure_ascii=False)
                }
            ]
        }

        training_data.append(training_example)

    return training_data

# Example usage
article_urls = [
    "https://example.com/article-1",
    "https://example.com/article-2",
    "https://example.com/article-3",
]

selectors = {
    "title": "h1.article-title",
    "author": "span.author-name",
    "date": "time.publish-date",
    "content_preview": "div.article-summary"
}

training_examples = collect_training_data_from_scraper(article_urls, selectors)

# Save to file
with open('article_training_data.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

Method 3: Using Existing Datasets

Bootstrap with publicly available datasets or synthetic data.

import random
from faker import Faker

fake = Faker()

def generate_synthetic_training_data(num_examples=100):
    """
    Generate synthetic product data for training
    """
    templates = [
        '<div class="product"><h2>{name}</h2><span class="price">${price}</span><p class="rating">{rating} stars</p></div>',
        '<article class="item"><h1>{name}</h1><div class="cost">${price}</div><span>{rating}/5</span></article>',
        '<section><header>{name}</header><p>Price: ${price}</p><div>Rating: {rating}</div></section>',
    ]

    training_data = []

    for _ in range(num_examples):
        # Generate fake product data
        product_name = fake.catch_phrase()
        price = round(random.uniform(10, 500), 2)
        rating = round(random.uniform(3.0, 5.0), 1)

        # Choose random template
        template = random.choice(templates)
        html = template.format(name=product_name, price=price, rating=rating)

        # Create training example
        example = {
            "messages": [
                {
                    "role": "system",
                    "content": "Extract product information and return JSON with name, price, and rating."
                },
                {
                    "role": "user",
                    "content": html
                },
                {
                    "role": "assistant",
                    "content": json.dumps({
                        "name": product_name,
                        "price": price,
                        "rating": rating
                    })
                }
            ]
        }

        training_data.append(example)

    return training_data

# Generate data
synthetic_data = generate_synthetic_training_data(200)

# Save
with open('synthetic_training.jsonl', 'w') as f:
    for item in synthetic_data:
        f.write(json.dumps(item) + '\n')

Data Formatting Standards

OpenAI Fine-Tuning Format

{
  "messages": [
    {"role": "system", "content": "System instruction"},
    {"role": "user", "content": "HTML input"},
    {"role": "assistant", "content": "JSON output"}
  ]
}

Hugging Face Format

{
  "instruction": "Extract product data from the following HTML",
  "input": "<div class='product'>...</div>",
  "output": "{\"name\": \"Product\", \"price\": 99.99}"
}

Converting Between Formats

def convert_openai_to_huggingface(openai_format):
    """
    Convert OpenAI format to Hugging Face instruction format
    """
    messages = openai_format['messages']

    return {
        "instruction": messages[0]['content'],  # system message
        "input": messages[1]['content'],         # user message
        "output": messages[2]['content']          # assistant message
    }

def convert_huggingface_to_openai(hf_format):
    """
    Convert Hugging Face format to OpenAI format
    """
    return {
        "messages": [
            {"role": "system", "content": hf_format['instruction']},
            {"role": "user", "content": hf_format['input']},
            {"role": "assistant", "content": hf_format['output']}
        ]
    }

Data Quality Guidelines

1. Validation and Cleaning

import json
from jsonschema import validate, ValidationError

# Define expected output schema
output_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "rating": {"type": ["number", "null"]}
    },
    "required": ["name", "price"]
}

def validate_training_example(example):
    """
    Validate a training example for quality
    """
    errors = []

    # Check structure
    if 'messages' not in example:
        errors.append("Missing 'messages' field")
        return errors

    messages = example['messages']

    # Check message count
    if len(messages) != 3:
        errors.append(f"Expected 3 messages, got {len(messages)}")

    # Check roles
    expected_roles = ['system', 'user', 'assistant']
    for i, msg in enumerate(messages):
        if msg.get('role') != expected_roles[i]:
            errors.append(f"Message {i} has wrong role: {msg.get('role')}")

    # Validate JSON output
    try:
        assistant_output = json.loads(messages[2]['content'])
        validate(instance=assistant_output, schema=output_schema)
    except json.JSONDecodeError:
        errors.append("Assistant message is not valid JSON")
    except ValidationError as e:
        errors.append(f"Output doesn't match schema: {e.message}")

    # Check for empty content
    for msg in messages:
        if not msg.get('content', '').strip():
            errors.append(f"Empty content in {msg.get('role')} message")

    return errors

# Validate entire dataset
def validate_dataset(jsonl_file):
    """
    Validate all examples in a training dataset
    """
    valid_count = 0
    total_count = 0

    with open(jsonl_file, 'r') as f:
        for line_num, line in enumerate(f, 1):
            total_count += 1
            try:
                example = json.loads(line)
                errors = validate_training_example(example)

                if errors:
                    print(f"Line {line_num} errors: {errors}")
                else:
                    valid_count += 1
            except json.JSONDecodeError:
                print(f"Line {line_num}: Invalid JSON")

    print(f"\nValidation complete: {valid_count}/{total_count} valid examples")
    return valid_count / total_count

# Run validation
validate_dataset('training_data.jsonl')

2. Diversity Metrics

from collections import Counter
import re

def analyze_dataset_diversity(jsonl_file):
    """
    Analyze the diversity of your training dataset
    """
    html_lengths = []
    output_lengths = []
    unique_structures = set()
    field_counts = Counter()

    with open(jsonl_file, 'r') as f:
        for line in f:
            example = json.loads(line)
            messages = example['messages']

            # Analyze HTML input
            html_input = messages[1]['content']
            html_lengths.append(len(html_input))

            # Extract HTML tags to identify structure
            tags = re.findall(r'<(\w+)', html_input)
            structure = '-'.join(sorted(set(tags)))
            unique_structures.add(structure)

            # Analyze output
            output = json.loads(messages[2]['content'])
            output_lengths.append(len(messages[2]['content']))

            # Count fields
            for field in output.keys():
                field_counts[field] += 1

    print(f"Dataset Diversity Analysis:")
    print(f"  Total examples: {len(html_lengths)}")
    print(f"  Unique HTML structures: {len(unique_structures)}")
    print(f"  Avg HTML length: {sum(html_lengths) / len(html_lengths):.0f} chars")
    print(f"  Avg output length: {sum(output_lengths) / len(output_lengths):.0f} chars")
    print(f"\nField frequency:")
    for field, count in field_counts.most_common(10):
        print(f"  {field}: {count} times")

analyze_dataset_diversity('training_data.jsonl')

3. Train-Validation Split

import random

def split_dataset(input_file, train_file, val_file, val_ratio=0.2):
    """
    Split dataset into training and validation sets
    """
    # Read all examples
    with open(input_file, 'r') as f:
        examples = [json.loads(line) for line in f]

    # Shuffle
    random.shuffle(examples)

    # Split
    split_idx = int(len(examples) * (1 - val_ratio))
    train_examples = examples[:split_idx]
    val_examples = examples[split_idx:]

    # Write training set
    with open(train_file, 'w') as f:
        for example in train_examples:
            f.write(json.dumps(example) + '\n')

    # Write validation set
    with open(val_file, 'w') as f:
        for example in val_examples:
            f.write(json.dumps(example) + '\n')

    print(f"Split complete:")
    print(f"  Training: {len(train_examples)} examples")
    print(f"  Validation: {len(val_examples)} examples")

split_dataset('all_data.jsonl', 'train.jsonl', 'val.jsonl', val_ratio=0.2)

Advanced Training Data Strategies

Data Augmentation

Increase dataset size by creating variations of existing examples:

import random
from bs4 import BeautifulSoup

def augment_html_example(html_input, output_data):
    """
    Create variations of HTML while maintaining the same extraction
    """
    soup = BeautifulSoup(html_input, 'html.parser')
    augmented_examples = []

    # Original example
    augmented_examples.append((html_input, output_data))

    # Add whitespace variations
    html_with_spaces = re.sub(r'>\s*<', '>\n  <', html_input)
    augmented_examples.append((html_with_spaces, output_data))

    # Add random class names (if extraction is class-agnostic)
    random_classes = ['item', 'card', 'box', 'container', 'element']
    modified_html = html_input
    for tag in ['div', 'span', 'section']:
        pattern = f'<{tag}>'
        replacement = f'<{tag} class="{random.choice(random_classes)}">'
        modified_html = modified_html.replace(pattern, replacement, 1)
    augmented_examples.append((modified_html, output_data))

    return augmented_examples

# Example
original_html = "<div><h2>Product Name</h2><span>$99</span></div>"
original_output = {"name": "Product Name", "price": 99}

variations = augment_html_example(original_html, original_output)
print(f"Created {len(variations)} variations from 1 example")

Active Learning

Identify which examples would most improve your model:

def identify_challenging_examples(unlabeled_html_samples, model, threshold=0.5):
    """
    Use model uncertainty to find examples worth labeling
    """
    challenging_examples = []

    for html in unlabeled_html_samples:
        # Get multiple predictions with temperature > 0
        predictions = []
        for _ in range(5):
            response = model.predict(html, temperature=0.8)
            predictions.append(response)

        # Calculate agreement between predictions
        unique_predictions = len(set(predictions))
        agreement_rate = 1 - (unique_predictions / len(predictions))

        # If model is uncertain (low agreement), this is a good training example
        if agreement_rate < threshold:
            challenging_examples.append({
                'html': html,
                'uncertainty_score': 1 - agreement_rate
            })

    # Return most uncertain examples
    challenging_examples.sort(key=lambda x: x['uncertainty_score'], reverse=True)
    return challenging_examples

# Use this to prioritize which pages to manually label

Integration with Web Scraping

When preparing training data, consider how it will be used in production:

def prepare_production_aligned_data(urls, extraction_function):
    """
    Create training data that matches production scraping workflow
    """
    training_data = []

    for url in urls:
        # Fetch exactly as you would in production
        response = requests.get(
            url,
            headers={'User-Agent': 'Mozilla/5.0...'},
            timeout=30
        )

        # Apply same preprocessing as production
        soup = BeautifulSoup(response.content, 'html.parser')

        # Remove elements you'd remove in production
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()

        cleaned_html = str(soup)

        # Get ground truth extraction
        extracted = extraction_function(soup)

        # Create training example that mirrors production
        example = {
            "messages": [
                {"role": "system", "content": "Extract structured data from preprocessed HTML."},
                {"role": "user", "content": cleaned_html[:8000]},
                {"role": "assistant", "content": json.dumps(extracted)}
            ]
        }

        training_data.append(example)

    return training_data

Common Training Data Mistakes to Avoid

Overfitting to specific HTML structure: Include variations
Inconsistent output formatting: Standardize JSON schemas
Missing edge cases: Test with incomplete/malformed data
Too little data: Start with at least 100 quality examples
No validation split: Always hold out 15-20% for testing
Ignoring token limits: Keep examples within model context window
Static datasets: Update with new examples from production errors

Conclusion

The quality and composition of your training data directly determines how well your fine-tuned LLM will perform at web scraping tasks. Focus on creating diverse, high-quality examples that represent real-world scenarios, including edge cases and variations. Start with 100-500 carefully curated examples and expand based on validation performance.

Remember to maintain consistent formatting, validate all examples programmatically, and align your training data with your production scraping workflow. For developers who want to skip the complexity of fine-tuning LLMs for web scraping, specialized AI web scraping tools provide pre-trained models and managed infrastructure that work out of the box.

Table of contents