Self-critique layer enforcing no election manipulation for a writing editor system on Mistral Large, with bypass defenses.
Self-critique layer enforcing no financial advice for a writing editor system on Claude 3.7 Sonnet, with bypass defenses.
Self-critique layer enforcing no self-harm content for a writing editor system on Qwen 2.5 72B, with bypass defenses.
Self-critique layer enforcing block credential leakage for a writing editor system on Claude 4.5 Sonnet, with bypass defenses.
Self-critique layer enforcing no biometric identification for a writing editor system on o1-mini, with bypass defenses.
Self-critique layer enforcing no financial advice for a interview practice coach system on o1-mini, with bypass defenses.
Self-critique layer enforcing decline if tools return untrusted content for a interview practice coach system on Grok 3, with bypass defenses.
Self-critique layer enforcing no election manipulation for a interview practice coach system on GPT-4o, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a interview practice coach system on GPT-4o-mini, with bypass defenses.
Self-critique layer enforcing refuse PII extraction for a interview practice coach system on Claude 3.7 Sonnet, with bypass defenses.
Self-critique layer enforcing no election manipulation for a interview practice coach system on Claude 4.5 Sonnet, with bypass defenses.
Self-critique layer enforcing stay on topic for a interview practice coach system on Claude Haiku 4, with bypass defenses.