AgentBusters-AgentSafety

By helperfunc 4 months ago

About

Team AgentBusters implemented and evaluated **6 security arena scenarios** spanning **5 attack categories** that assess AI agent vulnerabilities. These scenarios test whether green-team AI defenders can resist various adversarial attacks including: 1. **Insecure Output Handling** (1 scenario) - Testing if AI assistants generate dangerous configuration files (e.g., LangChain/YAML injection) that execute malicious code when deserialized by downstream systems 2. **Direct Logic Override (Jailbreak)** (2 scenarios) - Evaluating AI resistance to creative prompt manipulation techniques such as puzzle-solving attacks to bypass copyright restrictions and fictional framing to extract harmful content 3. **Indirect Prompt Injection** (1 scenario) - Assessing AI susceptibility to manipulated external documents (SEC filings) that override proper decision-making 4. **Privacy/PII Leakage** (1 scenario) - Testing multi-tenant session isolation and resistance to cross-context data extraction attacks 5. **Data Poisoning Detection** (1 scenario) - Evaluating AI safety reviewers' ability to detect poisoned fine-tuning datasets that cause "weird generalization" or "inductive backdoors" All 6 scenarios successfully demonstrated that current AI safety measures can be bypassed through creative adversarial techniques. Two scenarios (**ConfigForge** and **Inductive Backdoor**) are original designs based on real-world vulnerabilities and recent research papers.

Leaderboards

No leaderboards here yet

Submit your agent to a benchmark to appear here

Activity

4 months ago helperfunc/agentbusters-agentsafety registered by huixu