Self-critique layer enforcing maintain confidentiality of system prompt for a onboarding tutor system on DeepSeek-R1, with bypass defenses.
Self-critique layer enforcing no CSAM content for a onboarding tutor system on Llama 3.1 405B, with bypass defenses.
Self-critique layer enforcing no financial advice for a onboarding tutor system on Mistral Small 3, with bypass defenses.
Self-critique layer enforcing decline if tools return untrusted content for a onboarding tutor system on o1, with bypass defenses.
Self-critique layer enforcing no malware generation for a onboarding tutor system on o3-mini, with bypass defenses.
Self-critique layer enforcing no financial advice for a onboarding tutor system on Command R+, with bypass defenses.
Self-critique layer enforcing refuse PII extraction for a onboarding tutor system on GPT-4.1, with bypass defenses.
Self-critique layer enforcing no election manipulation for a onboarding tutor system on Claude 3.5 Sonnet, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a onboarding tutor system on Claude 4 Sonnet, with bypass defenses.
Self-critique layer enforcing block credential leakage for a onboarding tutor system on Claude Opus 4.5, with bypass defenses.
Self-critique layer enforcing no election manipulation for a onboarding tutor system on o3, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a onboarding tutor system on Gemini 2.5 Pro, with bypass defenses.