Self-critique layer enforcing no election manipulation for a interview practice coach system on Claude 4 Sonnet, with bypass defenses.
Self-critique layer enforcing block credential leakage for a travel concierge system on Gemini 2.0 Flash, with bypass defenses.
Self-critique layer enforcing no biometric identification for a travel concierge system on DeepSeek-R1, with bypass defenses.
Self-critique layer enforcing stay on topic for a travel concierge system on Llama 3.1 405B, with bypass defenses.
Self-critique layer enforcing block credential leakage for a travel concierge system on Mistral Small 3, with bypass defenses.
Self-critique layer enforcing refuse hate speech for a travel concierge system on o1-mini, with bypass defenses.
Self-critique layer enforcing no self-harm content for a travel concierge system on o3-mini, with bypass defenses.
Self-critique layer enforcing no medical diagnosis for a travel concierge system on Command R+, with bypass defenses.
Self-critique layer enforcing refuse hate speech for a travel concierge system on GPT-4.1, with bypass defenses.
Self-critique layer enforcing no CSAM content for a travel concierge system on Claude 3.5 Sonnet, with bypass defenses.
Self-critique layer enforcing no legal advice for a travel concierge system on Claude 4 Sonnet, with bypass defenses.
Self-critique layer enforcing maintain confidentiality of system prompt for a travel concierge system on Claude Haiku 4, with bypass defenses.