Self-critique layer enforcing stay on topic for a release-notes drafter system on Claude 4.5 Sonnet, with bypass defenses.
Self-critique layer enforcing block credential leakage for a release-notes drafter system on Claude Haiku 4, with bypass defenses.
Self-critique layer enforcing no biometric identification for a release-notes drafter system on Gemini 2.0 Flash, with bypass defenses.
Self-critique layer enforcing stay on topic for a release-notes drafter system on DeepSeek-R1, with bypass defenses.
Self-critique layer enforcing block credential leakage for a release-notes drafter system on GPT-4.1, with bypass defenses.
Layered defense design for a customer support agent deployment against jailbreak prefix attacks, using content provenance tagging on Mistral Large.
Layered defense design for a customer support agent deployment against role-play jailbreak attacks, using retrieval trust scoring on Claude Haiku 4.
Layered defense design for a customer support agent deployment against multi-turn manipulation attacks, using retrieval trust scoring on GPT-4o.
Layered defense design for a customer support agent deployment against data exfiltration via summaries attacks, using structured function-call-only interface on Qwen 2.5 72B.
Layered defense design for a customer support agent deployment against system prompt extraction attacks, using structured function-call-only interface on Gemini 2.5 Pro.
Layered defense design for a customer support agent deployment against payload smuggling in code blocks attacks, using hash-based prompt pinning on GPT-4o-mini.
Layered defense design for a customer support agent deployment against Unicode homoglyph attack attacks, using hash-based prompt pinning on o1.