Collect domain-specific data for fine-tuning language models. Build instruction datasets, Q&A pairs, and specialized training corpora.
Off-the-shelf LLMs lack expertise in your specific domain. Fine-tuning with high-quality, domain-specific data creates models that truly understand your use case.
Building instruction datasets manually is expensive and slow. You need automated collection of examples that match your target format and domain.
Collect the data your model needs
Extract questions and answers from FAQs, forums, and documentation.
Generate instruction-response pairs for RLHF and SFT training.
Build specialized text corpora for continued pre-training.
Extract multi-turn conversations for chat model training.
Collect LLM fine-tuning data
const axios = require('axios');
const API_KEY = 'your_api_key';
// Extract Q&A pairs from FAQ pages
const faqUrl = 'https://support.example.com/faq';
const qaData = await axios.get('https://api.webscraping.ai/ai/fields', {
params: {
api_key: API_KEY,
url: faqUrl,
fields: JSON.stringify({
qa_pairs: 'Array of objects with question and answer fields for each FAQ item'
})
}
});
console.log(qaData.data);
// {
// "qa_pairs": [
// {
// "question": "How do I reset my password?",
// "answer": "Click the 'Forgot Password' link on the login page..."
// },
// {
// "question": "What payment methods do you accept?",
// "answer": "We accept Visa, Mastercard, PayPal..."
// }
// ]
// }
// Generate instruction-format training examples
const tutorialUrl = 'https://docs.example.com/tutorial/getting-started';
const instructionData = await axios.get('https://api.webscraping.ai/ai/question', {
params: {
api_key: API_KEY,
url: tutorialUrl,
question: 'Convert this tutorial into 5 instruction-response pairs. Format: {"instruction": "user request", "response": "assistant answer"}. Focus on practical tasks covered.'
}
});
// Extract domain-specific terminology and definitions
const glossaryUrl = 'https://medical-reference.com/glossary';
const domainData = await axios.get('https://api.webscraping.ai/ai/fields', {
params: {
api_key: API_KEY,
url: glossaryUrl,
fields: JSON.stringify({
terms: 'Array of {term, definition, usage_example} objects for each medical term'
})
}
});
# Extract Q&A pairs
curl -G "https://api.webscraping.ai/ai/fields" \
--data-urlencode "api_key=your_api_key" \
--data-urlencode "url=https://support.example.com/faq" \
--data-urlencode 'fields={"qa_pairs":"Array of {question, answer} objects"}'
# Generate instruction data
curl -G "https://api.webscraping.ai/ai/question" \
--data-urlencode "api_key=your_api_key" \
--data-urlencode "url=https://docs.example.com/tutorial" \
--data-urlencode "question=Convert to 5 instruction-response training pairs"
Train models that understand your industry terminology
Fine-tune on documentation and code examples
Train on support conversations and FAQ data
Fine-tune for your brand voice and style
Get started with 1,000 free API credits. No credit card required.