AI Prompt for Fine-tuning & Model Adaptation
Rigorous evaluation harness comparing the fine-tuned model against Phi-4 base, closed-source frontier, and previous checkpoint.
More prompts for Fine-tuning & Model Adaptation.
Full fine-tuning recipe: LoRA on Qwen 2.5 32B via DeepSpeed, targeting 4x A100 40GB, with data mix and eval plan.
Rigorous evaluation harness comparing the fine-tuned model against Gemma 2 27B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Llama 3.3 70B base, closed-source frontier, and previous checkpoint.
Rigorous evaluation harness comparing the fine-tuned model against Mixtral 8x7B base, closed-source frontier, and previous checkpoint.
Full fine-tuning recipe: LoRA on Phi-4 via DeepSpeed, targeting 2x RTX 4090, with data mix and eval plan.
Full fine-tuning recipe: DPO on Gemma 2 9B via FSDP, targeting 8x H100, with data mix and eval plan.
You are an ML evaluation lead. Your job is to decide whether the new fine-tuned Phi-4 (call it ft-v0.3.0) is better than the last shipped checkpoint on code review, and whether it's worth shipping over the frontier closed-source API.
## Comparison Matrix
| Model | Cost / 1M tok | Latency p95 | Hosted? |
|---|---|---|---|
| ft-v0.3.0 (new) | $X | Yms | Self-hosted on 4x A100 40GB |
| ft-v1.1.0 (current prod) | $X | Yms | Self-hosted |
| Phi-4 base | $X | Yms | Self-hosted |
| GPT-4.1 | $X | Yms | API |
## Evaluation Sets
1. **Task golden set:** 1000 held-out examples labeled by experts. Primary decision metric.
2. **Production shadow sample:** 500 recent real queries (no labels; pairwise judged).
3. **Regression suite:** 50 examples covering previously-fixed bugs. Must not regress.
4. **Capability preservation:** MMLU (5-shot), HumanEval, GSM8K, MT-Bench.
5. **Safety:** 200 harmful / adversarial prompts.
6. **Format compliance:** 300 prompts requiring specific output schema.
## Metrics
Per eval set:
- **Task accuracy / F1 / exact-match** (for labeled sets)
- **GPT-4.1 rubric scorer rubric score** on 1-5 scale with chain-of-thought (for unlabeled)
- **Pairwise win rate** vs each competitor (judge picks A, B, or tie, with ties randomized)
- **Latency p50, p95, p99**
- **Cost per request**
- **Refusal rate** (on safety set, measure both over-refusal and under-refusal)
- **Schema validation rate** (on format set)
## LLM-as-Judge Protocol
Use GPT-4.1 rubric scorer with this rubric:
```
You will compare two model responses to the same prompt for code review.
PROMPT: {prompt}
RESPONSE A: {response_a}
RESPONSE B: {response_b}
Score each on 1-5 for:
- Correctness: factually accurate and addresses the prompt
- Completeness: covers all parts of the prompt
- Format: follows any requested format
- Clarity: readable and well-organized
Then declare: A_better | B_better | tie | both_bad
Output JSON:
{"a_scores": {...}, "b_scores": {...}, "reasoning": "...", "verdict": "..."}
```
**Critical:** randomize the position of the new model (sometimes A, sometimes B) to avoid position bias. Run each pair twice with positions swapped and require consistent verdict, otherwise treat as tie.
## Statistical Rigor
- For pairwise win rates, report 95% CI via bootstrap (1000 resamples)
- Require non-overlapping CI before claiming improvement
- For mean scores, paired t-test (same prompts across models)
- Minimum detectable effect: if we ship for a +2pt improvement, eval size must have power for that
## Decision Criteria
Ship ft-v0.3.0 if ALL are true:
1. Task F1 ≥ ft-v1.1.0 + 1 points with p < 0.05
2. Regression suite: zero regressions, or explicit approval from PM for known ones
3. Capability preservation: no >2pt drop on any general benchmark
4. Safety: refusal-on-harmful ≥ baseline, refusal-on-benign ≤ baseline
5. Format validation: ≥ 98%
6. Cost per request within budget: ≤ $0.0005
Consider shipping over GPT-4.1 if:
- Task quality ≥ 95% of frontier AND cost savings ≥ 5× AND latency ≤ frontier
- OR task quality ≥ frontier AND latency ≤ frontier AND cost ≤ frontier
## Error Analysis
On the golden set, bucket failures by:
- Input length bucket
- Task subtype
- Difficulty tier
- Source (synthetic vs real)
- Specific error types (listed in task guidelines)
Produce a confusion matrix (for classification) or a qualitative taxonomy of 10 failure categories (for generation), with example counts and 3 representative failures per category.
## Traces & Observability
Pipe all eval runs into Humanloop:
- Traces tagged `eval_run_id`, `model`, `eval_set`, `prompt_id`
- Dashboards: win rate over time per model pair, regression timeline, cost/quality frontier
- Alerts: new regressions vs last green build
## Deliverables
1. `eval/` directory with all eval sets in JSONL
2. `run_eval.py` that takes a model endpoint + eval set and produces a results JSON
3. `compare.py` that produces a markdown report comparing N models
4. `report.md` — the go/no-go recommendation with numbers, CIs, and error taxonomy
5. Error-analysis notebook with the top 30 failures
6. Dashboard in Humanloop pinned to the model comparison
Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.