Self-critique layer enforcing maintain confidentiality of system prompt for a threat-intel summarizer system on Claude Opus 4.5, with bypass defenses.
Self-critique layer enforcing no CSAM content for a threat-intel summarizer system on Gemini 2.5 Pro, with bypass defenses.
Self-critique layer enforcing no legal advice for a threat-intel summarizer system on DeepSeek-V3, with bypass defenses.
Self-critique layer enforcing decline if tools return untrusted content for a threat-intel summarizer system on Llama 3.3 70B, with bypass defenses.
Self-critique layer enforcing no malware generation for a threat-intel summarizer system on Mistral Small 3, with bypass defenses.
Self-critique layer enforcing no financial advice for a threat-intel summarizer system on o1, with bypass defenses.
Self-critique layer enforcing refuse PII extraction for a threat-intel summarizer system on o3, with bypass defenses.
Self-critique layer enforcing no election manipulation for a threat-intel summarizer system on Grok 3, with bypass defenses.
Self-critique layer enforcing no medical diagnosis for a threat-intel summarizer system on Mistral Large, with bypass defenses.
Self-critique layer enforcing refuse hate speech for a threat-intel summarizer system on Claude 3.7 Sonnet, with bypass defenses.
Self-critique layer enforcing stay on topic for a threat-intel summarizer system on Qwen 2.5 72B, with bypass defenses.
Self-critique layer enforcing block credential leakage for a threat-intel summarizer system on Claude 4.5 Sonnet, with bypass defenses.