Self-critique layer enforcing no election manipulation for a threat-intel summarizer system on o1-mini, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a threat-intel summarizer system on Claude Haiku 4, with bypass defenses.
Self-critique layer enforcing refuse PII extraction for a threat-intel summarizer system on o3-mini, with bypass defenses.
Self-critique layer enforcing no malware generation for a threat-intel summarizer system on Gemini 2.0 Flash, with bypass defenses.
Self-critique layer enforcing refuse hate speech for a threat-intel summarizer system on Command R+, with bypass defenses.
Self-critique layer enforcing no malware generation for a release-notes drafter system on Mistral Large, with bypass defenses.
Self-critique layer enforcing no financial advice for a release-notes drafter system on Qwen 2.5 72B, with bypass defenses.
Self-critique layer enforcing decline if tools return untrusted content for a release-notes drafter system on o1-mini, with bypass defenses.
Self-critique layer enforcing no malware generation for a release-notes drafter system on o3-mini, with bypass defenses.
Self-critique layer enforcing cite sources with URLs for a release-notes drafter system on Command R+, with bypass defenses.
Self-critique layer enforcing refuse PII extraction for a release-notes drafter system on GPT-4o-mini, with bypass defenses.
Self-critique layer enforcing no election manipulation for a release-notes drafter system on Claude 3.7 Sonnet, with bypass defenses.