Table of contents

How do I fine-tune an LLM for web scraping tasks?

Fine-tuning a large language model (LLM) for web scraping tasks involves training a pre-trained model on domain-specific data to improve its ability to extract structured information from HTML content. While general-purpose LLMs can handle many scraping scenarios, fine-tuning can significantly improve accuracy, reduce costs, and create specialized models for your specific use cases.

Why Fine-Tune an LLM for Web Scraping?

Before diving into the technical process, it's important to understand when fine-tuning makes sense:

When to Consider Fine-Tuning: - You're processing a specific website structure repeatedly - You need consistent extraction patterns across similar pages - General LLMs produce inconsistent results for your use case - You want to reduce per-request costs for high-volume scraping - You need faster inference times than larger general-purpose models

When to Stick with Pre-trained Models: - You're scraping diverse, constantly changing websites - Your scraping tasks are one-off or infrequent - The upfront cost and time investment don't justify the benefits - General-purpose models already provide acceptable accuracy

Understanding the Fine-Tuning Process

Fine-tuning adapts a pre-trained model to your specific task by continuing the training process on your custom dataset. For web scraping, this means teaching the model to recognize patterns in HTML and extract data consistently.

Step 1: Collect and Prepare Training Data

The quality of your training data directly impacts model performance. You'll need pairs of HTML input and desired JSON output.

Example Training Data Format (JSONL):

{"messages": [{"role": "system", "content": "You are a web scraping assistant that extracts product information from HTML."}, {"role": "user", "content": "<div class='product'><h2>Wireless Headphones</h2><span class='price'>$79.99</span><p class='rating'>4.5 stars</p></div>"}, {"role": "assistant", "content": "{\"name\": \"Wireless Headphones\", \"price\": 79.99, \"rating\": 4.5}"}]}
{"messages": [{"role": "system", "content": "You are a web scraping assistant that extracts product information from HTML."}, {"role": "user", "content": "<div class='product'><h2>Smart Watch</h2><span class='price'>$199.99</span><p class='rating'>4.8 stars</p></div>"}, {"role": "assistant", "content": "{\"name\": \"Smart Watch\", \"price\": 199.99, \"rating\": 4.8}"}]}

Dataset Preparation Script (Python):

import json
from bs4 import BeautifulSoup

def create_training_example(html_snippet, extracted_data):
    """
    Convert HTML and extracted data into fine-tuning format
    """
    return {
        "messages": [
            {
                "role": "system",
                "content": "You are a web scraping assistant that extracts structured data from HTML."
            },
            {
                "role": "user",
                "content": html_snippet
            },
            {
                "role": "assistant",
                "content": json.dumps(extracted_data)
            }
        ]
    }

# Prepare your dataset
training_examples = []

# Example: Product pages
html_samples = [
    '<div class="item"><h3>Laptop</h3><span>$999</span></div>',
    '<div class="item"><h3>Mouse</h3><span>$25</span></div>',
]

extracted_samples = [
    {"product": "Laptop", "price": 999},
    {"product": "Mouse", "price": 25},
]

for html, data in zip(html_samples, extracted_samples):
    training_examples.append(create_training_example(html, data))

# Save to JSONL file
with open('training_data.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

Step 2: Choose Your Fine-Tuning Platform

Several platforms support LLM fine-tuning for web scraping tasks:

OpenAI Fine-Tuning:

from openai import OpenAI
client = OpenAI()

# Upload training file
with open('training_data.jsonl', 'rb') as f:
    training_file = client.files.create(
        file=f,
        purpose='fine-tune'
    )

# Create fine-tuning job
fine_tune_job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    model="gpt-3.5-turbo",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 4,
        "learning_rate_multiplier": 0.1
    }
)

print(f"Fine-tuning job created: {fine_tune_job.id}")

# Monitor the job
import time
while True:
    job_status = client.fine_tuning.jobs.retrieve(fine_tune_job.id)
    print(f"Status: {job_status.status}")

    if job_status.status in ['succeeded', 'failed', 'cancelled']:
        break

    time.sleep(60)

if job_status.status == 'succeeded':
    print(f"Fine-tuned model: {job_status.fine_tuned_model}")

Using Your Fine-Tuned Model:

# Use the fine-tuned model for scraping
response = client.chat.completions.create(
    model="ft:gpt-3.5-turbo:your-org:custom-model-id",
    messages=[
        {"role": "system", "content": "You are a web scraping assistant."},
        {"role": "user", "content": "<div class='product'><h2>Tablet</h2><span>$349</span></div>"}
    ]
)

extracted_data = json.loads(response.choices[0].message.content)
print(extracted_data)  # {"product": "Tablet", "price": 349}

Anthropic Claude Fine-Tuning:

As of now, Claude doesn't offer public fine-tuning, but you can use prompt engineering and few-shot learning techniques to achieve similar results. Contact Anthropic for enterprise custom model options.

Open-Source Options (Hugging Face):

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset

# Load a base model
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Load your dataset
dataset = load_dataset('json', data_files='training_data.jsonl')

def tokenize_function(examples):
    # Combine system, user, and assistant messages
    full_text = f"{examples['messages'][0]['content']}\n{examples['messages'][1]['content']}\n{examples['messages'][2]['content']}"
    return tokenizer(full_text, truncation=True, padding='max_length', max_length=512)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Configure training
training_args = TrainingArguments(
    output_dir="./scraping-llm-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    warmup_steps=100,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=500,
)

# Train the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
)

trainer.train()
trainer.save_model("./scraping-llm-finetuned-final")

Step 3: Validate and Test Your Model

After fine-tuning, thoroughly test the model on held-out validation data:

import json
from sklearn.metrics import accuracy_score

def validate_extraction(test_cases):
    """
    Test the fine-tuned model on validation data
    """
    correct = 0
    total = len(test_cases)

    for html_input, expected_output in test_cases:
        response = client.chat.completions.create(
            model="ft:gpt-3.5-turbo:your-org:model-id",
            messages=[
                {"role": "system", "content": "Extract structured data from HTML."},
                {"role": "user", "content": html_input}
            ]
        )

        try:
            extracted = json.loads(response.choices[0].message.content)
            if extracted == expected_output:
                correct += 1
            else:
                print(f"Mismatch: Expected {expected_output}, Got {extracted}")
        except json.JSONDecodeError:
            print(f"Invalid JSON output: {response.choices[0].message.content}")

    accuracy = (correct / total) * 100
    print(f"Validation Accuracy: {accuracy:.2f}%")
    return accuracy

# Test cases
validation_data = [
    ('<div class="item"><h3>Keyboard</h3><span>$89</span></div>',
     {"product": "Keyboard", "price": 89}),
    ('<div class="item"><h3>Monitor</h3><span>$299</span></div>',
     {"product": "Monitor", "price": 299}),
]

validate_extraction(validation_data)

Best Practices for Fine-Tuning LLMs for Web Scraping

1. Dataset Quality and Diversity

Create Representative Examples: - Include edge cases (missing fields, malformed HTML) - Vary HTML structure while maintaining the extraction pattern - Include negative examples (what NOT to extract) - Ensure consistent output formatting

# Good: Diverse examples
examples = [
    # Standard case
    ("<div><h2>Title</h2></div>", {"title": "Title"}),
    # Missing data
    ("<div></div>", {"title": null}),
    # Nested structure
    ("<article><header><h1>Title</h1></header></article>", {"title": "Title"}),
    # Multiple candidates
    ("<div><h2>Ad Title</h2><h1>Real Title</h1></div>", {"title": "Real Title"}),
]

2. Hyperparameter Tuning

Key parameters to experiment with:

  • Learning Rate: Start with 0.1-0.2 multiplier for OpenAI models
  • Epochs: Usually 3-10; monitor for overfitting
  • Batch Size: Smaller batches (4-8) often work better for small datasets
  • Training Data Size: Minimum 50-100 examples, ideally 500+

3. Iterative Improvement

def iterative_training_loop(initial_data, model_id=None):
    """
    Continuously improve the model based on errors
    """
    current_data = initial_data
    iteration = 0

    while iteration < 5:  # Max iterations
        print(f"\n=== Iteration {iteration + 1} ===")

        # Fine-tune (or continue fine-tuning)
        if model_id is None:
            model_id = train_model(current_data)
        else:
            model_id = continue_training(model_id, current_data)

        # Validate
        errors = validate_and_collect_errors(model_id)

        if len(errors) == 0:
            print("No errors found! Training complete.")
            break

        # Add error cases to training data
        current_data.extend(create_corrections(errors))
        iteration += 1

    return model_id

4. Cost Optimization

Fine-tuning can reduce long-term costs:

# Cost comparison
def calculate_cost_savings(requests_per_month, input_tokens=1000, output_tokens=200):
    """
    Compare costs: GPT-4 vs fine-tuned GPT-3.5
    """
    # GPT-4 pricing (example)
    gpt4_input_cost = (input_tokens / 1000) * 0.03
    gpt4_output_cost = (output_tokens / 1000) * 0.06
    gpt4_monthly = (gpt4_input_cost + gpt4_output_cost) * requests_per_month

    # Fine-tuned GPT-3.5 pricing (example)
    ft_input_cost = (input_tokens / 1000) * 0.012
    ft_output_cost = (output_tokens / 1000) * 0.016
    ft_monthly = (ft_input_cost + ft_output_cost) * requests_per_month

    savings = gpt4_monthly - ft_monthly
    print(f"Monthly savings: ${savings:.2f}")
    print(f"Annual savings: ${savings * 12:.2f}")

    return savings

calculate_cost_savings(10000)  # 10K requests/month

Integration with Web Scraping Workflows

Once your model is fine-tuned, integrate it into your scraping pipeline:

import requests
from openai import OpenAI

client = OpenAI()

def scrape_with_finetuned_llm(url, fine_tuned_model):
    """
    Complete scraping workflow with fine-tuned LLM
    """
    # Fetch HTML
    response = requests.get(url)
    html_content = response.text

    # Optional: Clean/preprocess HTML
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Remove scripts, styles
    for script in soup(["script", "style"]):
        script.decompose()

    cleaned_html = str(soup)

    # Extract with fine-tuned model
    llm_response = client.chat.completions.create(
        model=fine_tuned_model,
        messages=[
            {"role": "system", "content": "Extract structured data from HTML."},
            {"role": "user", "content": cleaned_html[:8000]}  # Limit tokens
        ],
        temperature=0.1  # Lower temperature for consistent extraction
    )

    try:
        extracted_data = json.loads(llm_response.choices[0].message.content)
        return extracted_data
    except json.JSONDecodeError:
        print("Failed to parse JSON response")
        return None

# Use it
data = scrape_with_finetuned_llm(
    "https://example.com/product/123",
    "ft:gpt-3.5-turbo:your-org:scraper-v1"
)
print(data)

Monitoring and Maintenance

Track Model Performance:

import logging
from datetime import datetime

class ScrapingModelMonitor:
    def __init__(self):
        self.successful_extractions = 0
        self.failed_extractions = 0
        self.parsing_errors = []

    def log_extraction(self, url, success, error=None):
        timestamp = datetime.now()

        if success:
            self.successful_extractions += 1
        else:
            self.failed_extractions += 1
            self.parsing_errors.append({
                'url': url,
                'error': error,
                'timestamp': timestamp
            })

        # Log metrics
        if (self.successful_extractions + self.failed_extractions) % 100 == 0:
            self.report_metrics()

    def report_metrics(self):
        total = self.successful_extractions + self.failed_extractions
        success_rate = (self.successful_extractions / total) * 100

        print(f"Success Rate: {success_rate:.2f}%")
        print(f"Total Extractions: {total}")

        if success_rate < 90:
            print("WARNING: Success rate below 90%. Consider retraining.")

monitor = ScrapingModelMonitor()

Advanced Techniques

Parameter-Efficient Fine-Tuning (PEFT)

For smaller models and reduced costs, consider LoRA (Low-Rank Adaptation):

from peft import get_peft_model, LoraConfig, TaskType

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,  # Rank
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"]
)

# Apply LoRA to model
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()  # Shows only ~1% parameters are trainable

Multi-Task Fine-Tuning

Train one model for multiple scraping tasks:

# Training data with task prefixes
multi_task_examples = [
    {"messages": [
        {"role": "system", "content": "Task: product_extraction"},
        {"role": "user", "content": "<div>...</div>"},
        {"role": "assistant", "content": '{"name": "..."}'}
    ]},
    {"messages": [
        {"role": "system", "content": "Task: article_extraction"},
        {"role": "user", "content": "<article>...</article>"},
        {"role": "assistant", "content": '{"title": "...", "author": "..."}'}
    ]},
]

Conclusion

Fine-tuning an LLM for web scraping tasks can dramatically improve extraction accuracy, reduce costs, and create specialized models for your specific needs. While the initial investment in data preparation and training requires effort, the long-term benefits often justify the cost—especially for high-volume, repetitive scraping tasks.

Start with a small, high-quality dataset of 100-500 examples, validate thoroughly, and iterate based on real-world performance. As you collect more data from production use, continuously refine your model to handle edge cases and improve structured output quality.

For developers looking for a simpler solution without the overhead of fine-tuning, consider using specialized AI-powered web scraping APIs that handle the complexity for you.

Try WebScraping.AI for Your Web Scraping Needs

Looking for a powerful web scraping solution? WebScraping.AI provides an LLM-powered API that combines Chromium JavaScript rendering with rotating proxies for reliable data extraction.

Key Features:

  • AI-powered extraction: Ask questions about web pages or extract structured data fields
  • JavaScript rendering: Full Chromium browser support for dynamic content
  • Rotating proxies: Datacenter and residential proxies from multiple countries
  • Easy integration: Simple REST API with SDKs for Python, Ruby, PHP, and more
  • Reliable & scalable: Built for developers who need consistent results

Getting Started:

Get page content with AI analysis:

curl "https://api.webscraping.ai/ai/question?url=https://example.com&question=What is the main topic?&api_key=YOUR_API_KEY"

Extract structured data:

curl "https://api.webscraping.ai/ai/fields?url=https://example.com&fields[title]=Page title&fields[price]=Product price&api_key=YOUR_API_KEY"

Try in request builder

Related Questions

Get Started Now

WebScraping.AI provides rotating proxies, Chromium rendering and built-in HTML parser for web scraping
Icon