Design an eval harness for bug root-cause analysis using DeepEval metrics that tracks toolcall precision across prompt versions on Claude 3.7 Sonnet.
Design an eval harness for bug root-cause analysis using semantic similarity that tracks format-compliance rate across prompt versions on Claude 4.5 Sonnet.
Design an eval harness for bug root-cause analysis using BERTScore that tracks format-compliance rate across prompt versions on Claude Haiku 4.
Design an eval harness for bug root-cause analysis using promptfoo assertions that tracks hallucination rate across prompt versions on DeepSeek-V3.
Design an eval harness for bug root-cause analysis using human pairwise comparison that tracks hallucination rate across prompt versions on Llama 3.3 70B.
Design an eval harness for bug root-cause analysis using factuality with retrieval that tracks hallucination rate across prompt versions on Mistral Large.
Design an eval harness for bug root-cause analysis using embedding distance that tracks user satisfaction (CSAT) across prompt versions on Qwen 2.5 72B.
Design an eval harness for bug root-cause analysis using rubric scoring that tracks user satisfaction (CSAT) across prompt versions on o1-mini.
Design an eval harness for bug root-cause analysis using LLM-as-judge that tracks inter-judge agreement across prompt versions on Grok 3.
Design an eval harness for bug root-cause analysis using tool-call accuracy that tracks inter-judge agreement across prompt versions on GPT-4o.
Design an eval harness for bug root-cause analysis using G-Eval that tracks cost-per-correct-answer across prompt versions on GPT-4o-mini.
Design an eval harness for bug root-cause analysis using exact match that tracks cost-per-correct-answer across prompt versions on Claude 3.7 Sonnet.