Run a rigorous A/B test on prompt variants for financial report analysis, measuring F1 score on o1-mini using embedding distance.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring token cost on Grok 3 using tool-call accuracy.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring hallucination rate on GPT-4o using JSON schema validation.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring factuality on GPT-4o-mini using JSON schema validation.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring token cost on Claude 3.7 Sonnet using regex match checks.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring hallucination rate on Claude 4.5 Sonnet using regex match checks.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring refusal rate on Gemini 2.5 Pro using BERTScore.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring p95 latency on DeepSeek-V3 using factuality with retrieval.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring user satisfaction (CSAT) on Llama 3.3 70B using factuality with retrieval.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring refusal rate on Mistral Large using LLM-as-judge.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring p95 latency on Qwen 2.5 72B using LLM-as-judge.
Run a rigorous A/B test on prompt variants for financial report analysis, measuring inter-judge agreement on o1-mini using exact match.