Token-cost and latency reduction playbook for a math word problems prompt running on GPT-4o-mini, judged by regex match checks.
Token-cost and latency reduction playbook for a math word problems prompt running on Claude 3.7 Sonnet, judged by BERTScore.
Token-cost and latency reduction playbook for a math word problems prompt running on Claude Opus 4.5, judged by factuality with retrieval.
Token-cost and latency reduction playbook for a math word problems prompt running on Gemini 2.5 Pro, judged by factuality with retrieval.
Token-cost and latency reduction playbook for a math word problems prompt running on DeepSeek-V3, judged by LLM-as-judge.
Token-cost and latency reduction playbook for a math word problems prompt running on Llama 3.3 70B, judged by LLM-as-judge.
Token-cost and latency reduction playbook for a math word problems prompt running on Mistral Large, judged by exact match.
Token-cost and latency reduction playbook for a math word problems prompt running on Qwen 2.5 72B, judged by BLEU/ROUGE.
Token-cost and latency reduction playbook for a math word problems prompt running on o3, judged by BLEU/ROUGE.
Token-cost and latency reduction playbook for a math word problems prompt running on Grok 3, judged by semantic similarity.
Token-cost and latency reduction playbook for a math word problems prompt running on GPT-4o, judged by semantic similarity.
Token-cost and latency reduction playbook for a math word problems prompt running on GPT-4o-mini, judged by human pairwise comparison.