Claude Prompt for Prompt Optimization & Evals
Design an eval harness for bug root-cause analysis using human pairwise comparison that tracks format-compliance rate across prompt versions on Gemini 2.5 Pro.
More prompts for Prompt Optimization & Evals.
Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.
Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.
Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.
Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.
Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.
You are the owner of the eval harness for a team shipping an LLM feature that does bug root-cause analysis on Gemini 2.5 Pro. Your harness needs to be strict enough that people trust it, cheap enough that they run it, and flexible enough that they extend it.
## What you are building
A reusable eval harness with these responsibilities:
1. Load a versioned dataset of bug root-cause analysis examples sourced from long-tail real traffic samples.
2. Run any registered prompt variant against Gemini 2.5 Pro with pinned decoding params.
3. Score each output using human pairwise comparison against a per-example ground truth or rubric.
4. Log metrics, especially format-compliance rate, and guardrail metrics (refusal rate, format compliance, safety).
5. Produce a diff report between two variants.
6. Be runnable both in CI (on every prompt PR) and ad-hoc locally.
## Deliverable
Produce a complete design doc with the following sections:
### Architecture
A sketch (text is fine) of:
```
Dataset v{N} → Runner → Model call → Output → Judge → Metrics store → Report
↑ ↓
Prompt registry CI gate (pass/fail)
```
### Dataset spec
- Schema: { id, input, expected, stratum, tags, source_url, created_at, retired_at }
- Sourcing plan from long-tail real traffic samples
- Refresh cadence (how often to add new examples from production)
- Retirement policy (when examples become stale)
- Sampling strategy for CI (small fast set) vs. full (slow, nightly)
### Runner spec
- How to pin Gemini 2.5 Pro version (include exact version string)
- Decoding params are stored alongside the prompt, not hard-coded
- Retry + timeout behavior
- Caching: runs are deterministic by (prompt_hash, example_id, model_version, decoding_hash)
### Judging spec — using human pairwise comparison
- Define the scoring procedure precisely.
- If human pairwise comparison is an LLM, pin the judge model (different from the model under test) and publish the judge prompt — treat it as a first-class artifact.
- Calibrate human pairwise comparison against a small human-labeled set; report inter-judge agreement (κ) before trusting it.
- Flakiness mitigation: average 3 judge runs or use majority-vote if variance is high.
### Metrics
- Primary: format-compliance rate
- Guardrails: refusal_rate, format_compliance, safety_violations, p95_latency_ms, mean_tokens, $/example
- Per-stratum slices
### Reporting
Example report table (Markdown):
| variant | format-compliance rate | refusal% | format% | p95_ms | $/ex |
| --- | --- | --- | --- | --- | --- |
Plus a "Biggest disagreements" section for qualitative review.
### CI gating
- PRs that modify a prompt file must include an eval run.
- Block the PR if format-compliance rate drops >2% OR any guardrail crosses its threshold.
- Override requires explicit approver and a written justification committed to the PR.
### Code sketch
Provide a ~40-line Python skeleton using plain stdlib + `anthropic` or `openai` client. No fancy frameworks. Functions: `load_dataset`, `run_variant`, `judge`, `score`, `report`, `gate`.
## Constraints
- Don't recommend a paid SaaS eval platform unless the team already uses it.
- Don't let judge prompts live un-versioned.
- Keep the first working version buildable in one afternoon.