Design A/B rollout analysis and drift detection for citation accuracy on a production LLM app in code assistant.
Design a pairwise + rubric LLM-as-judge prompt for customer support chat with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for code generation with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for SQL generation with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for technical summarization with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for tool-use agent with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for long-doc QA with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for creative writing with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for translation with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for medical Q&A with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for legal reasoning with bias mitigation, calibration, and reproducibility.
Design a pairwise + rubric LLM-as-judge prompt for multi-turn dialogue with bias mitigation, calibration, and reproducibility.