Claude Prompt for Evals & Observability
Golden-set regression harness for RAG over internal docs with Claude Opus 4.5 pairwise scoring, CI integration, and budget-aware runs.
More prompts for Evals & Observability.
Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.
Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.
Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.
You are responsible for preventing regressions on an LLM app serving RAG over internal docs. Build a regression test suite that runs on every PR and blocks merges on quality drops.
## What Regression Means Here
Not "exact-match snapshot test" — LLM output is non-deterministic and stochastic. Instead:
- **Aggregate metric on golden set:** does avg judge_score drop?
- **Per-item assertion:** did any high-value golden case specifically fail?
- **Schema/format compliance:** did output format break?
- **Latency/cost:** did these inflate beyond budget?
## Golden Set Structure
```jsonl
{
"id": "gold_001",
"category": "billing question",
"difficulty": "easy | medium | hard",
"prompt": "...",
"context": "... (optional retrieval context)",
"reference_answer": "...",
"assertions": [
{"type": "contains", "value": "API rate limit"},
{"type": "format", "value": "json"},
{"type": "max_length", "value": 500},
{"type": "refuses", "value": false}
],
"min_judge_score": 4,
"tags": ["..."]
}
```
- **id:** stable, never reused after retirement
- **category:** stratify for balanced eval
- **difficulty:** harder cases should have higher weight or separate pass criteria
- **assertions:** lightweight deterministic checks (regex, substring, format, length)
- **min_judge_score:** per-item floor. If this item scores below this, hard fail.
- **tags:** for filtering (`smoke`, `critical`, `known-flaky`)
Target size: 100 cases initially, grow to 500 over time.
## Curation
### Sources
- Top user complaints ("model said X, should have said Y") — highest priority
- Known-previously-broken cases (every bug fix adds its golden case)
- Safety / refusal scenarios
- Format compliance
- Long-input edge cases
- Multi-turn conversations (if applicable)
- Low-resource languages (if multilingual)
- Prompt injection red-team cases
### Refresh
- Monthly: add 20-50 new cases from production issues
- Quarterly: audit for staleness (remove cases whose expected behavior changed)
- Never: retroactively edit an existing case's expected output to match the current model (this hides regressions)
### Versioning
Golden set is checked into the repo under `evals/golden/v{N}/`. Increment version on schema changes. Keep old versions runnable for historical comparison.
## Runner
### CLI
```bash
pnpm eval run \
--set evals/golden/v3/ \
--model-endpoint production \
--judge Claude Opus 4.5 pairwise \
--parallel 16 \
--out results/run_$(date +%s).json
```
### Budget Controls
Full golden run on Claude Opus 4.5 pairwise costs ~$150 and takes ~10 min. Speed up CI with tiers:
- **Smoke tier** (~20 cases, ~$0.80, ~2 min): every PR
- **Full tier:** nightly on main + pre-release
- **Extended tier** (includes red-team, low-resource lang): weekly
Select tier via tag filter: `--tags smoke,critical`.
### Parallelism
- Run 32 concurrent LLM requests
- Rate-limit to provider quotas (10k rpm, etc.)
- Retry transient errors (5xx, rate-limit) with exponential backoff
- Fail fast on auth errors
## Scoring
Per item:
1. Run all deterministic assertions → any fail = item fail
2. If assertions pass, call judge with rubric → score per dimension
3. Check `min_judge_score` → below = item fail
4. Aggregate: item_pass = all(assertions pass) AND judge_overall >= min_judge_score
Aggregate metrics:
- **Pass rate** (count(item_pass) / total)
- **Mean judge score per dimension**
- **Mean judge score weighted by difficulty**
- **Per-category pass rate**
- **Latency p95**
- **Cost total**
## Pass/Fail Criteria for CI
Build FAILS if ANY of:
- Overall pass rate drops > 1 pt vs last green main
- Any `tag=critical` case fails
- Any `tag=safety` case fails (zero tolerance)
- Judge score on any dimension drops > 0.3
- Latency p95 up > 30%
- Cost up > 30%
- Schema compliance < 98%
On failure, post GitHub PR comment with:
- Summary table: metric, before, after, delta
- List of newly-failing items with IDs + short description
- Link to full results in Galileo
## Flake Handling
Some cases are inherently noisy (open-ended generation). Mark with tag `flaky` and:
- Run 3 times, take majority pass
- Separate aggregate metric for flaky cases
- Investigation backlog: ideally zero flaky cases, migrate to cleaner assertions over time
## Results Storage
- Every run: results JSON in Galileo + S3 archive
- Queryable by: commit_sha, branch, model_version, golden_version
- Dashboard: pass rate over time, per-category trends, regression timeline
## Shadow / Canary
For models going to production:
1. Nightly run of new model against same golden set as prod model
2. Compare pairwise with Claude Opus 4.5 pairwise
3. Auto-promote criteria: model wins or ties on ≥ 95% of cases AND no critical-tagged loss
4. Manual approval for borderline outcomes
## Anti-Patterns (do not do these)
- **Updating expected outputs to match current model** when they fail → you're hiding regressions
- **Skipping safety cases because they're hard** → safety is non-negotiable
- **Running evals only at release time** → bugs compound undetected
- **Single-dimension judge score** → you'll miss axis-specific regressions
## Deliverables
1. Golden set directory with versioned JSONL
2. Runner CLI with budget tiers
3. Scorer with deterministic + judge layers
4. CI integration (GitHub Actions / Buildkite)
5. PR comment template
6. Galileo dashboard
7. Curation playbook for adding new cases
Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.Replace the bracketed placeholders with your own context before running the prompt:
["..."]— fill in your specific "...".