Prompts/AI Engineering & LLM Apps/Evals & Observability

FreeAI Engineering & LLM Apps🟠 Claude

Regression Test Suite for RAG over internal docs LLM App

Claude Prompt for Evals & Observability

Golden-set regression harness for RAG over internal docs with Claude Opus 4.5 pairwise scoring, CI integration, and budget-aware runs.

Related prompts

More prompts for Evals & Observability.

Browse all AI Engineering & LLM Apps →

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Instrument, query, and triage structured extraction LLM app traces in Lunary with TypeScript SDK, covering latency, cost, and quality dashboards.

🟠Claude

1911517

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Instrument, query, and triage classification pipeline LLM app traces in OpenTelemetry + Jaeger with Ruby SDK, covering latency, cost, and quality dashboards.

🤖Any Model

521515

AI Engineering & LLM Apps

Premium

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Instrument, query, and triage agent with tool-use LLM app traces in Galileo with Java SDK, covering latency, cost, and quality dashboards.

🤖Any Model

1841514

AI Engineering & LLM Apps

Free

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

Instrument, query, and triage code-completion copilot LLM app traces in OpenTelemetry + Jaeger with Java SDK, covering latency, cost, and quality dashboards.

💬ChatGPT

1901513

AI Engineering & LLM Apps

Premium

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.

🤖Any Model

3751512

AI Engineering & LLM Apps

Premium

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.

💬ChatGPT

3091508

You are responsible for preventing regressions on an LLM app serving RAG over internal docs. Build a regression test suite that runs on every PR and blocks merges on quality drops. ## What Regression Means Here Not "exact-match snapshot test" — LLM output is non-deterministic and stochastic. Instead: - **Aggregate metric on golden set:** does avg judge_score drop? - **Per-item assertion:** did any high-value golden case specifically fail? - **Schema/format compliance:** did output format break? - **Latency/cost:** did these inflate beyond budget? ## Golden Set Structure ```jsonl { "id": "gold_001", "category": "billing question", "difficulty": "easy | medium | hard", "prompt": "...", "context": "... (optional retrieval context)", "reference_answer": "...", "assertions": [ {"type": "contains", "value": "API rate limit"}, {"type": "format", "value": "json"}, {"type": "max_length", "value": 500}, {"type": "refuses", "value": false} ], "min_judge_score": 4, "tags": ["..."] } ``` - **id:** stable, never reused after retirement - **category:** stratify for balanced eval - **difficulty:** harder cases should have higher weight or separate pass criteria - **assertions:** lightweight deterministic checks (regex, substring, format, length) - **min_judge_score:** per-item floor. If this item scores below this, hard fail. - **tags:** for filtering (`smoke`, `critical`, `known-flaky`) Target size: 100 cases initially, grow to 500 over time. ## Curation ### Sources - Top user complaints ("model said X, should have said Y") — highest priority - Known-previously-broken cases (every bug fix adds its golden case) - Safety / refusal scenarios - Format compliance - Long-input edge cases - Multi-turn conversations (if applicable) - Low-resource languages (if multilingual) - Prompt injection red-team cases ### Refresh - Monthly: add 20-50 new cases from production issues - Quarterly: audit for staleness (remove cases whose expected behavior changed) - Never: retroactively edit an existing case's expected output to match the current model (this hides regressions) ### Versioning Golden set is checked into the repo under `evals/golden/v{N}/`. Increment version on schema changes. Keep old versions runnable for historical comparison. ## Runner ### CLI ```bash pnpm eval run \ --set evals/golden/v3/ \ --model-endpoint production \ --judge Claude Opus 4.5 pairwise \ --parallel 16 \ --out results/run_$(date +%s).json ``` ### Budget Controls Full golden run on Claude Opus 4.5 pairwise costs ~$150 and takes ~10 min. Speed up CI with tiers: - **Smoke tier** (~20 cases, ~$0.80, ~2 min): every PR - **Full tier:** nightly on main + pre-release - **Extended tier** (includes red-team, low-resource lang): weekly Select tier via tag filter: `--tags smoke,critical`. ### Parallelism - Run 32 concurrent LLM requests - Rate-limit to provider quotas (10k rpm, etc.) - Retry transient errors (5xx, rate-limit) with exponential backoff - Fail fast on auth errors ## Scoring Per item: 1. Run all deterministic assertions → any fail = item fail 2. If assertions pass, call judge with rubric → score per dimension 3. Check `min_judge_score` → below = item fail 4. Aggregate: item_pass = all(assertions pass) AND judge_overall >= min_judge_score Aggregate metrics: - **Pass rate** (count(item_pass) / total) - **Mean judge score per dimension** - **Mean judge score weighted by difficulty** - **Per-category pass rate** - **Latency p95** - **Cost total** ## Pass/Fail Criteria for CI Build FAILS if ANY of: - Overall pass rate drops > 1 pt vs last green main - Any `tag=critical` case fails - Any `tag=safety` case fails (zero tolerance) - Judge score on any dimension drops > 0.3 - Latency p95 up > 30% - Cost up > 30% - Schema compliance < 98% On failure, post GitHub PR comment with: - Summary table: metric, before, after, delta - List of newly-failing items with IDs + short description - Link to full results in Galileo ## Flake Handling Some cases are inherently noisy (open-ended generation). Mark with tag `flaky` and: - Run 3 times, take majority pass - Separate aggregate metric for flaky cases - Investigation backlog: ideally zero flaky cases, migrate to cleaner assertions over time ## Results Storage - Every run: results JSON in Galileo + S3 archive - Queryable by: commit_sha, branch, model_version, golden_version - Dashboard: pass rate over time, per-category trends, regression timeline ## Shadow / Canary For models going to production: 1. Nightly run of new model against same golden set as prod model 2. Compare pairwise with Claude Opus 4.5 pairwise 3. Auto-promote criteria: model wins or ties on ≥ 95% of cases AND no critical-tagged loss 4. Manual approval for borderline outcomes ## Anti-Patterns (do not do these) - **Updating expected outputs to match current model** when they fail → you're hiding regressions - **Skipping safety cases because they're hard** → safety is non-negotiable - **Running evals only at release time** → bugs compound undetected - **Single-dimension judge score** → you'll miss axis-specific regressions ## Deliverables 1. Golden set directory with versioned JSONL 2. Runner CLI with budget tiers 3. Scorer with deterministic + judge layers 4. CI integration (GitHub Actions / Buildkite) 5. PR comment template 6. Galileo dashboard 7. Curation playbook for adding new cases Structure as a professional report with: Executive Summary, Key Findings, Detailed Analysis, Recommendations, and Next Steps.

Regression Test Suite for RAG over internal docs LLM App

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

Regression Test Suite for RAG over internal docs LLM App

Related prompts

Trace Analysis Playbook for structured extraction LLM App in Lunary (TypeScript)

Trace Analysis Playbook for classification pipeline LLM App in OpenTelemetry + Jaeger (Ruby)

Trace Analysis Playbook for agent with tool-use LLM App in Galileo (Java)

Trace Analysis Playbook for code-completion copilot LLM App in OpenTelemetry + Jaeger (Java)

G-Eval with Gemini 2.5 Pro LLM-as-Judge Rubric for multi-turn dialogue

DeepEval correctness judge LLM-as-Judge Rubric for SQL generation

How to customize this prompt

Tags

Who this is for