Claude Prompt for Fine-tuning & Model Adaptation
End-to-end SFT dataset construction: collection, labeling, cleaning, dedup, contamination check for function-calling with strict JSON.
More prompts for Fine-tuning & Model Adaptation.
Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.
Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.
Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.
Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.
You are a dataset engineer. Build a high-quality SFT dataset for fine-tuning an LLM on function-calling with strict JSON. The model's ability is bottlenecked by dataset quality, not model scale — so obsess over the data.
## North Star
50k high-quality (messages, completion) pairs, deduplicated, contamination-clean, with a transparent data card.
## Collection Strategy
### Primary: web-scraped with filtering
- Extraction pipeline: describe the query / export / scrape / API
- Rate limits and backoff
- Legal / license review: confirm usage rights for training
- Expected yield: 40k raw examples
### Secondary: GitHub PR discussions
- Use to cover long-tail and edge cases underrepresented in primary
- Expected yield: 10k
### Synthetic Augmentation
Generate additional examples via Claude Opus 4.5 to cover:
- Edge cases from error analysis of baseline model
- Underrepresented classes / intents
- Counterfactuals (what if user asks the opposite)
Use a seed set of 300 real examples. For each, prompt Claude Opus 4.5:
```
You are generating training data for a function-calling with strict JSON model.
Given this SEED example:
{seed_example}
Produce 5 NEW examples that are:
- Clearly distinct from the seed (different topic, phrasing, or edge case)
- Equally or more challenging
- Realistic (a real user might ask this)
- Fully self-contained
Output strict JSON: { "examples": [ { "input": "...", "output": "..." } ] }
```
**Rejection sampling:** generate 3x what you need, then filter by automated quality + a small human sample review.
## Labeling Protocol
### Annotator Guidelines
Write a 2-3 page guideline doc covering:
- Task definition with positive and negative examples
- Output format specification (verbatim schema)
- Disambiguation rules for edge cases (list at least 10)
- When to flag as "discard" vs "label"
- Inter-annotator agreement target: Cohen's kappa ≥ 0.75
### Pipeline
1. Each example labeled by 1 annotator
2. 20% sampled for second-annotator review
3. Disagreements resolved by lead annotator
4. Weekly calibration sessions with trickiest cases
### Tooling
Use Argilla with:
- Keyboard shortcuts for frequent labels
- Auto-save
- Context panel showing similar past examples
- Version-controlled guidelines doc linked from the UI
## Cleaning & Quality Filters
### Deduplication
- **Exact dedup:** SHA-256 hash of normalized input
- **Near dedup:** MinHash-LSH (128 hashes, threshold 0.85 Jaccard on 5-gram shingles)
- Expected dup rate: 8-12% — log and investigate if higher
### Quality Filters
Drop examples where any is true:
- assistant response < 10 tokens (likely truncated)
- assistant response > 4096 tokens (likely runaway)
- user message empty or only punctuation
- language mismatch (target: English + Spanish)
- toxicity score > 0.3
- PII detected (email, phone, SSN) without redaction
### Format Validation
For structured-output tasks:
- Parse each assistant response against the expected schema
- Drop or fix invalid
- Track invalid rate — high rate signals annotation training gap
## Contamination Check
Critical: the fine-tune eval results are worthless if train leaks into eval.
Run:
- Exact match between train.input and eval.input
- 13-gram overlap between train.assistant and eval.reference
- Semantic dedup via embedding similarity > 0.92 → review manually
Run against ALL eval sets you'll report on: MMLU, HumanEval, GSM8K, MT-Bench, customer-validated golden set.
Document the contamination report in the data card.
## Data Card
Produce `data_card.md` with:
- Sources with counts and license
- Collection window (date range)
- Language breakdown
- Length distribution (histogram)
- Task distribution (if multi-task)
- Known biases and limitations
- Ethical considerations
- Version + changelog
## Splits
- Train: 90% (22.5k)
- Dev: 5% (250)
- Holdout: 5% (500), frozen, never used for tuning
Stratify by task subtype so all splits have same distribution.
## Output Format (JSONL)
```jsonl
{"id": "ex_0001", "messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}], "source": "primary", "quality": 0.92, "lang": "en"}
```
## Deliverables
1. Raw collected data + provenance logs
2. Labeling guideline doc + calibration set
3. Cleaning pipeline script (`clean.py`) with tests
4. Contamination report
5. Final train.jsonl, dev.jsonl, holdout.jsonl
6. Data card
7. Reproducibility: scripts + commit hash + environment
- Use precise technical terminology appropriate for the audience
- Include code examples, configurations, or specifications where relevant
- Document assumptions, prerequisites, and dependencies
- Provide error handling and edge case considerationsReplace the bracketed placeholders with your own context before running the prompt:
[{ "input": "...", "output": "..." }]— fill in your specific { "input": "...", "output": "..." }.