What training data do I need to fine-tune an LLM for web scraping?
Fine-tuning a large language model (LLM) for web scraping requires high-quality training data consisting of HTML-to-structured-data pairs. The training dataset should include diverse examples of web content along with the correctly extracted information in your desired output format. Typically, you'll need 50-1000+ examples depending on the complexity of your extraction task and the model you're using.
Essential Components of Training Data
1. Input-Output Pairs
The foundation of any LLM fine-tuning dataset is paired examples showing what goes in and what should come out.
Required Elements: - HTML Input: Raw or cleaned HTML snippets from web pages - Structured Output: JSON, CSV, or other structured formats with extracted data - System Prompts: Instructions that define the extraction task - Consistency: Uniform formatting across all examples
Example Training Pair:
{
"messages": [
{
"role": "system",
"content": "You are a web scraping assistant that extracts product information from e-commerce HTML pages. Always return valid JSON with fields: name, price, rating, availability."
},
{
"role": "user",
"content": "<div class='product-card'>\n <h2 class='product-title'>Wireless Bluetooth Headphones</h2>\n <span class='price-tag'>$89.99</span>\n <div class='rating'>4.7 out of 5 stars</div>\n <span class='stock'>In Stock</span>\n</div>"
},
{
"role": "assistant",
"content": "{\"name\": \"Wireless Bluetooth Headphones\", \"price\": 89.99, \"rating\": 4.7, \"availability\": \"in_stock\"}"
}
]
}
2. Data Quantity Requirements
The amount of training data you need varies based on several factors:
Minimum Dataset Sizes: - Simple extraction tasks: 50-100 examples (single website, consistent structure) - Medium complexity: 200-500 examples (multiple page types, varied structures) - Complex tasks: 500-1000+ examples (diverse websites, challenging edge cases) - Production-grade models: 1000-10,000+ examples (robust, generalized extraction)
Quality vs. Quantity:
# Bad: Large dataset with inconsistent formatting
bad_dataset = [
{"input": "<div>Product A</div>", "output": "Product A"}, # String output
{"input": "<div>Product B</div>", "output": {"name": "Product B"}}, # Object
{"input": "<div>Product C</div>", "output": ["Product C", "$10"]}, # Array
]
# Good: Smaller dataset with consistent formatting
good_dataset = [
{
"messages": [
{"role": "system", "content": "Extract product info as JSON"},
{"role": "user", "content": "<div>Product A - $20</div>"},
{"role": "assistant", "content": '{"name": "Product A", "price": 20.00}'}
]
},
{
"messages": [
{"role": "system", "content": "Extract product info as JSON"},
{"role": "user", "content": "<span>Product B - $30</span>"},
{"role": "assistant", "content": '{"name": "Product B", "price": 30.00}'}
]
},
]
3. Edge Cases and Variations
Your training data must include examples that represent real-world scenarios and edge cases.
Critical Edge Cases to Include:
edge_case_examples = [
# Missing data
{
"html": "<div class='product'><h2>Item Name</h2></div>", # No price
"output": {"name": "Item Name", "price": null}
},
# Malformed HTML
{
"html": "<div class='product'><h2>Item<span>$19.99</h2></span></div>",
"output": {"name": "Item", "price": 19.99}
},
# Multiple potential values (need the correct one)
{
"html": "<div><span class='old-price'>$50</span><span class='current-price'>$39.99</span></div>",
"output": {"price": 39.99, "original_price": 50.00}
},
# Special characters and encoding
{
"html": "<h1>Café Français - 100€</h1>",
"output": {"name": "Café Français", "price": 100.00, "currency": "EUR"}
},
# Out of stock variations
{
"html": "<div class='product'><h2>Item</h2><span class='unavailable'>Currently Unavailable</span></div>",
"output": {"name": "Item", "availability": "out_of_stock"}
},
# Nested structures
{
"html": "<article><header><div><h1>Title</h1></div></header><p>Description</p></article>",
"output": {"title": "Title", "description": "Description"}
},
]
Collecting Training Data
Method 1: Manual Annotation
The most accurate but time-intensive approach.
from bs4 import BeautifulSoup
import json
def create_training_example_manually(url, html_content, manual_extraction):
"""
Create a training example from manually extracted data
"""
# Clean the HTML (optional)
soup = BeautifulSoup(html_content, 'html.parser')
# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'footer']):
element.decompose()
cleaned_html = str(soup)
# Create the training format
training_example = {
"messages": [
{
"role": "system",
"content": "Extract product information from HTML and return as JSON."
},
{
"role": "user",
"content": cleaned_html[:4000] # Limit length
},
{
"role": "assistant",
"content": json.dumps(manual_extraction, ensure_ascii=False)
}
],
"metadata": {
"source_url": url,
"created_at": "2025-01-15"
}
}
return training_example
# Example usage
html = """
<div class="product">
<h1>Ergonomic Office Chair</h1>
<div class="price">$299.99</div>
<span class="rating">4.5 stars (234 reviews)</span>
</div>
"""
manual_data = {
"name": "Ergonomic Office Chair",
"price": 299.99,
"rating": 4.5,
"review_count": 234
}
example = create_training_example_manually(
"https://example.com/product/chair",
html,
manual_data
)
# Save to JSONL file
with open('training_data.jsonl', 'a') as f:
f.write(json.dumps(example) + '\n')
Method 2: Semi-Automated Collection
Use existing scrapers to generate training data faster.
import requests
from bs4 import BeautifulSoup
import json
def collect_training_data_from_scraper(urls, css_selectors):
"""
Generate training data using CSS selectors (which you'll teach the LLM to replicate)
"""
training_data = []
for url in urls:
# Fetch the page
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract using CSS selectors
extracted_data = {}
for field_name, selector in css_selectors.items():
element = soup.select_one(selector)
if element:
extracted_data[field_name] = element.get_text(strip=True)
# Create training example
training_example = {
"messages": [
{
"role": "system",
"content": "Extract article data from HTML and return as JSON with fields: title, author, date, content_preview."
},
{
"role": "user",
"content": str(soup)[:6000]
},
{
"role": "assistant",
"content": json.dumps(extracted_data, ensure_ascii=False)
}
]
}
training_data.append(training_example)
return training_data
# Example usage
article_urls = [
"https://example.com/article-1",
"https://example.com/article-2",
"https://example.com/article-3",
]
selectors = {
"title": "h1.article-title",
"author": "span.author-name",
"date": "time.publish-date",
"content_preview": "div.article-summary"
}
training_examples = collect_training_data_from_scraper(article_urls, selectors)
# Save to file
with open('article_training_data.jsonl', 'w') as f:
for example in training_examples:
f.write(json.dumps(example) + '\n')
Method 3: Using Existing Datasets
Bootstrap with publicly available datasets or synthetic data.
import random
from faker import Faker
fake = Faker()
def generate_synthetic_training_data(num_examples=100):
"""
Generate synthetic product data for training
"""
templates = [
'<div class="product"><h2>{name}</h2><span class="price">${price}</span><p class="rating">{rating} stars</p></div>',
'<article class="item"><h1>{name}</h1><div class="cost">${price}</div><span>{rating}/5</span></article>',
'<section><header>{name}</header><p>Price: ${price}</p><div>Rating: {rating}</div></section>',
]
training_data = []
for _ in range(num_examples):
# Generate fake product data
product_name = fake.catch_phrase()
price = round(random.uniform(10, 500), 2)
rating = round(random.uniform(3.0, 5.0), 1)
# Choose random template
template = random.choice(templates)
html = template.format(name=product_name, price=price, rating=rating)
# Create training example
example = {
"messages": [
{
"role": "system",
"content": "Extract product information and return JSON with name, price, and rating."
},
{
"role": "user",
"content": html
},
{
"role": "assistant",
"content": json.dumps({
"name": product_name,
"price": price,
"rating": rating
})
}
]
}
training_data.append(example)
return training_data
# Generate data
synthetic_data = generate_synthetic_training_data(200)
# Save
with open('synthetic_training.jsonl', 'w') as f:
for item in synthetic_data:
f.write(json.dumps(item) + '\n')
Data Formatting Standards
OpenAI Fine-Tuning Format
{
"messages": [
{"role": "system", "content": "System instruction"},
{"role": "user", "content": "HTML input"},
{"role": "assistant", "content": "JSON output"}
]
}
Hugging Face Format
{
"instruction": "Extract product data from the following HTML",
"input": "<div class='product'>...</div>",
"output": "{\"name\": \"Product\", \"price\": 99.99}"
}
Converting Between Formats
def convert_openai_to_huggingface(openai_format):
"""
Convert OpenAI format to Hugging Face instruction format
"""
messages = openai_format['messages']
return {
"instruction": messages[0]['content'], # system message
"input": messages[1]['content'], # user message
"output": messages[2]['content'] # assistant message
}
def convert_huggingface_to_openai(hf_format):
"""
Convert Hugging Face format to OpenAI format
"""
return {
"messages": [
{"role": "system", "content": hf_format['instruction']},
{"role": "user", "content": hf_format['input']},
{"role": "assistant", "content": hf_format['output']}
]
}
Data Quality Guidelines
1. Validation and Cleaning
import json
from jsonschema import validate, ValidationError
# Define expected output schema
output_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": ["number", "null"]}
},
"required": ["name", "price"]
}
def validate_training_example(example):
"""
Validate a training example for quality
"""
errors = []
# Check structure
if 'messages' not in example:
errors.append("Missing 'messages' field")
return errors
messages = example['messages']
# Check message count
if len(messages) != 3:
errors.append(f"Expected 3 messages, got {len(messages)}")
# Check roles
expected_roles = ['system', 'user', 'assistant']
for i, msg in enumerate(messages):
if msg.get('role') != expected_roles[i]:
errors.append(f"Message {i} has wrong role: {msg.get('role')}")
# Validate JSON output
try:
assistant_output = json.loads(messages[2]['content'])
validate(instance=assistant_output, schema=output_schema)
except json.JSONDecodeError:
errors.append("Assistant message is not valid JSON")
except ValidationError as e:
errors.append(f"Output doesn't match schema: {e.message}")
# Check for empty content
for msg in messages:
if not msg.get('content', '').strip():
errors.append(f"Empty content in {msg.get('role')} message")
return errors
# Validate entire dataset
def validate_dataset(jsonl_file):
"""
Validate all examples in a training dataset
"""
valid_count = 0
total_count = 0
with open(jsonl_file, 'r') as f:
for line_num, line in enumerate(f, 1):
total_count += 1
try:
example = json.loads(line)
errors = validate_training_example(example)
if errors:
print(f"Line {line_num} errors: {errors}")
else:
valid_count += 1
except json.JSONDecodeError:
print(f"Line {line_num}: Invalid JSON")
print(f"\nValidation complete: {valid_count}/{total_count} valid examples")
return valid_count / total_count
# Run validation
validate_dataset('training_data.jsonl')
2. Diversity Metrics
from collections import Counter
import re
def analyze_dataset_diversity(jsonl_file):
"""
Analyze the diversity of your training dataset
"""
html_lengths = []
output_lengths = []
unique_structures = set()
field_counts = Counter()
with open(jsonl_file, 'r') as f:
for line in f:
example = json.loads(line)
messages = example['messages']
# Analyze HTML input
html_input = messages[1]['content']
html_lengths.append(len(html_input))
# Extract HTML tags to identify structure
tags = re.findall(r'<(\w+)', html_input)
structure = '-'.join(sorted(set(tags)))
unique_structures.add(structure)
# Analyze output
output = json.loads(messages[2]['content'])
output_lengths.append(len(messages[2]['content']))
# Count fields
for field in output.keys():
field_counts[field] += 1
print(f"Dataset Diversity Analysis:")
print(f" Total examples: {len(html_lengths)}")
print(f" Unique HTML structures: {len(unique_structures)}")
print(f" Avg HTML length: {sum(html_lengths) / len(html_lengths):.0f} chars")
print(f" Avg output length: {sum(output_lengths) / len(output_lengths):.0f} chars")
print(f"\nField frequency:")
for field, count in field_counts.most_common(10):
print(f" {field}: {count} times")
analyze_dataset_diversity('training_data.jsonl')
3. Train-Validation Split
import random
def split_dataset(input_file, train_file, val_file, val_ratio=0.2):
"""
Split dataset into training and validation sets
"""
# Read all examples
with open(input_file, 'r') as f:
examples = [json.loads(line) for line in f]
# Shuffle
random.shuffle(examples)
# Split
split_idx = int(len(examples) * (1 - val_ratio))
train_examples = examples[:split_idx]
val_examples = examples[split_idx:]
# Write training set
with open(train_file, 'w') as f:
for example in train_examples:
f.write(json.dumps(example) + '\n')
# Write validation set
with open(val_file, 'w') as f:
for example in val_examples:
f.write(json.dumps(example) + '\n')
print(f"Split complete:")
print(f" Training: {len(train_examples)} examples")
print(f" Validation: {len(val_examples)} examples")
split_dataset('all_data.jsonl', 'train.jsonl', 'val.jsonl', val_ratio=0.2)
Advanced Training Data Strategies
Data Augmentation
Increase dataset size by creating variations of existing examples:
import random
from bs4 import BeautifulSoup
def augment_html_example(html_input, output_data):
"""
Create variations of HTML while maintaining the same extraction
"""
soup = BeautifulSoup(html_input, 'html.parser')
augmented_examples = []
# Original example
augmented_examples.append((html_input, output_data))
# Add whitespace variations
html_with_spaces = re.sub(r'>\s*<', '>\n <', html_input)
augmented_examples.append((html_with_spaces, output_data))
# Add random class names (if extraction is class-agnostic)
random_classes = ['item', 'card', 'box', 'container', 'element']
modified_html = html_input
for tag in ['div', 'span', 'section']:
pattern = f'<{tag}>'
replacement = f'<{tag} class="{random.choice(random_classes)}">'
modified_html = modified_html.replace(pattern, replacement, 1)
augmented_examples.append((modified_html, output_data))
return augmented_examples
# Example
original_html = "<div><h2>Product Name</h2><span>$99</span></div>"
original_output = {"name": "Product Name", "price": 99}
variations = augment_html_example(original_html, original_output)
print(f"Created {len(variations)} variations from 1 example")
Active Learning
Identify which examples would most improve your model:
def identify_challenging_examples(unlabeled_html_samples, model, threshold=0.5):
"""
Use model uncertainty to find examples worth labeling
"""
challenging_examples = []
for html in unlabeled_html_samples:
# Get multiple predictions with temperature > 0
predictions = []
for _ in range(5):
response = model.predict(html, temperature=0.8)
predictions.append(response)
# Calculate agreement between predictions
unique_predictions = len(set(predictions))
agreement_rate = 1 - (unique_predictions / len(predictions))
# If model is uncertain (low agreement), this is a good training example
if agreement_rate < threshold:
challenging_examples.append({
'html': html,
'uncertainty_score': 1 - agreement_rate
})
# Return most uncertain examples
challenging_examples.sort(key=lambda x: x['uncertainty_score'], reverse=True)
return challenging_examples
# Use this to prioritize which pages to manually label
Integration with Web Scraping
When preparing training data, consider how it will be used in production:
def prepare_production_aligned_data(urls, extraction_function):
"""
Create training data that matches production scraping workflow
"""
training_data = []
for url in urls:
# Fetch exactly as you would in production
response = requests.get(
url,
headers={'User-Agent': 'Mozilla/5.0...'},
timeout=30
)
# Apply same preprocessing as production
soup = BeautifulSoup(response.content, 'html.parser')
# Remove elements you'd remove in production
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
cleaned_html = str(soup)
# Get ground truth extraction
extracted = extraction_function(soup)
# Create training example that mirrors production
example = {
"messages": [
{"role": "system", "content": "Extract structured data from preprocessed HTML."},
{"role": "user", "content": cleaned_html[:8000]},
{"role": "assistant", "content": json.dumps(extracted)}
]
}
training_data.append(example)
return training_data
Common Training Data Mistakes to Avoid
- Overfitting to specific HTML structure: Include variations
- Inconsistent output formatting: Standardize JSON schemas
- Missing edge cases: Test with incomplete/malformed data
- Too little data: Start with at least 100 quality examples
- No validation split: Always hold out 15-20% for testing
- Ignoring token limits: Keep examples within model context window
- Static datasets: Update with new examples from production errors
Conclusion
The quality and composition of your training data directly determines how well your fine-tuned LLM will perform at web scraping tasks. Focus on creating diverse, high-quality examples that represent real-world scenarios, including edge cases and variations. Start with 100-500 carefully curated examples and expand based on validation performance.
Remember to maintain consistent formatting, validate all examples programmatically, and align your training data with your production scraping workflow. For developers who want to skip the complexity of fine-tuning LLMs for web scraping, specialized AI web scraping tools provide pre-trained models and managed infrastructure that work out of the box.