Category Not Found

1252 prompts

Sort:

Build LLM-as-judge Eval Harness for incident post-mortems on Llama 3.1 405B

Design an eval harness for incident post-mortems using LLM-as-judge that tracks hallucination rate across prompt versions on Llama 3.1 405B.

Build tool-call accuracy Eval Harness for incident post-mortems on Mistral Small 3

Design an eval harness for incident post-mortems using tool-call accuracy that tracks hallucination rate across prompt versions on Mistral Small 3.

Build G-Eval Eval Harness for incident post-mortems on o1

Design an eval harness for incident post-mortems using G-Eval that tracks hallucination rate across prompt versions on o1.

Build LLM-as-judge Eval Harness for incident post-mortems on o3-mini

Design an eval harness for incident post-mortems using LLM-as-judge that tracks user satisfaction (CSAT) across prompt versions on o3-mini.

Build tool-call accuracy Eval Harness for incident post-mortems on Command R+

Design an eval harness for incident post-mortems using tool-call accuracy that tracks user satisfaction (CSAT) across prompt versions on Command R+.

Build G-Eval Eval Harness for incident post-mortems on GPT-4.1

Design an eval harness for incident post-mortems using G-Eval that tracks inter-judge agreement across prompt versions on GPT-4.1.

Build exact match Eval Harness for incident post-mortems on Claude 3.5 Sonnet

Design an eval harness for incident post-mortems using exact match that tracks inter-judge agreement across prompt versions on Claude 3.5 Sonnet.

Build JSON schema validation Eval Harness for incident post-mortems on Claude 4 Sonnet

Design an eval harness for incident post-mortems using JSON schema validation that tracks cost-per-correct-answer across prompt versions on Claude 4 Sonnet.

Build Trulens feedback functions Eval Harness for incident post-mortems on Claude Opus 4.5

Design an eval harness for incident post-mortems using Trulens feedback functions that tracks cost-per-correct-answer across prompt versions on Claude Opus 4.5.