Prompts/Prompt Engineering/Prompt Optimization & Evals

FreePrompt Engineering🟠 Claude

Build human pairwise comparison Eval Harness for bug root-cause analysis on Gemini 2.5 Pro

Claude Prompt for Prompt Optimization & Evals

Design an eval harness for bug root-cause analysis using human pairwise comparison that tracks format-compliance rate across prompt versions on Gemini 2.5 Pro.

Related prompts

More prompts for Prompt Optimization & Evals.

Browse all Prompt Engineering →

Prompt Engineering

Premium

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Run a rigorous A/B test on prompt variants for API design decisions, measuring cost-per-correct-answer on Claude Opus 4.5 using rubric scoring.

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Design an eval harness for bug root-cause analysis using BLEU/ROUGE that tracks token cost across prompt versions on Llama 3.3 70B.

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks refusal rate across prompt versions on GPT-4o.

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

Token-cost and latency reduction playbook for a academic grading prompt running on Claude Opus 4.5, judged by human pairwise comparison.

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

Run a rigorous A/B test on prompt variants for legal brief summarization, measuring hallucination rate on o1-mini using promptfoo assertions.

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Run a rigorous A/B test on prompt variants for API design decisions, measuring toolcall precision on GPT-4o-mini using Trulens feedback functions.

🤖Any Model

1941514

You are the owner of the eval harness for a team shipping an LLM feature that does bug root-cause analysis on Gemini 2.5 Pro. Your harness needs to be strict enough that people trust it, cheap enough that they run it, and flexible enough that they extend it. ## What you are building A reusable eval harness with these responsibilities: 1. Load a versioned dataset of bug root-cause analysis examples sourced from long-tail real traffic samples. 2. Run any registered prompt variant against Gemini 2.5 Pro with pinned decoding params. 3. Score each output using human pairwise comparison against a per-example ground truth or rubric. 4. Log metrics, especially format-compliance rate, and guardrail metrics (refusal rate, format compliance, safety). 5. Produce a diff report between two variants. 6. Be runnable both in CI (on every prompt PR) and ad-hoc locally. ## Deliverable Produce a complete design doc with the following sections: ### Architecture A sketch (text is fine) of: ``` Dataset v{N} → Runner → Model call → Output → Judge → Metrics store → Report ↑ ↓ Prompt registry CI gate (pass/fail) ``` ### Dataset spec - Schema: { id, input, expected, stratum, tags, source_url, created_at, retired_at } - Sourcing plan from long-tail real traffic samples - Refresh cadence (how often to add new examples from production) - Retirement policy (when examples become stale) - Sampling strategy for CI (small fast set) vs. full (slow, nightly) ### Runner spec - How to pin Gemini 2.5 Pro version (include exact version string) - Decoding params are stored alongside the prompt, not hard-coded - Retry + timeout behavior - Caching: runs are deterministic by (prompt_hash, example_id, model_version, decoding_hash) ### Judging spec — using human pairwise comparison - Define the scoring procedure precisely. - If human pairwise comparison is an LLM, pin the judge model (different from the model under test) and publish the judge prompt — treat it as a first-class artifact. - Calibrate human pairwise comparison against a small human-labeled set; report inter-judge agreement (κ) before trusting it. - Flakiness mitigation: average 3 judge runs or use majority-vote if variance is high. ### Metrics - Primary: format-compliance rate - Guardrails: refusal_rate, format_compliance, safety_violations, p95_latency_ms, mean_tokens, $/example - Per-stratum slices ### Reporting Example report table (Markdown): | variant | format-compliance rate | refusal% | format% | p95_ms | $/ex | | --- | --- | --- | --- | --- | --- | Plus a "Biggest disagreements" section for qualitative review. ### CI gating - PRs that modify a prompt file must include an eval run. - Block the PR if format-compliance rate drops >2% OR any guardrail crosses its threshold. - Override requires explicit approver and a written justification committed to the PR. ### Code sketch Provide a ~40-line Python skeleton using plain stdlib + `anthropic` or `openai` client. No fancy frameworks. Functions: `load_dataset`, `run_variant`, `judge`, `score`, `report`, `gate`. ## Constraints - Don't recommend a paid SaaS eval platform unless the team already uses it. - Don't let judge prompts live un-versioned. - Keep the first working version buildable in one afternoon.

Build human pairwise comparison Eval Harness for bug root-cause analysis on Gemini 2.5 Pro

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Build human pairwise comparison Eval Harness for bug root-cause analysis on Gemini 2.5 Pro

Related prompts

A/B Test cost-per-correct-answer Between Two Prompts for API design decisions on Claude Opus 4.5

Build BLEU/ROUGE Eval Harness for bug root-cause analysis on Llama 3.3 70B

Build DeepEval metrics Eval Harness for bug root-cause analysis on GPT-4o

Cut token cost by 30% on academic grading Prompt for Claude Opus 4.5

A/B Test hallucination rate Between Two Prompts for legal brief summarization on o1-mini

A/B Test toolcall precision Between Two Prompts for API design decisions on GPT-4o-mini

Tags

Who this is for