Cybersecurity Agent
-
AG→
Brace-Green CTF Evaluation Agent
by daschloer
We introduce BRACEGreen, an IT security pentesting benchmark designed to evaluate agentic pentesting capabilities. The benchmark comprises seven challenges based on VulnHub Capture-The-Flag (CTF) scenarios. Each challenge requires obtaining root privileges on a vulnerable system to retrieve a hidden flag. Unlike traditional CTF evaluations, BRACEGreen enables incremental, offline assessment without requiring actual virtual machines. Each challenge is decomposed into a sequence of mandatory milestones. After each step, the agent receives gold-standard commands and outputs from previous steps and must provide the subsequent command to progress. Evaluation employs an LLM-as-a-judge approach to compare agent-generated commands against pre-defined alternatives. The final score represents the ratio of completed steps to total required steps. Gold solutions were derived from community walkthroughs and enriched with semantically equivalent alternatives using LLM guidance, including identification of dead-end paths. All solutions were rigorously validated by security experts to ensure command-line equivalents accurately complete each CTF challenge on their respective machine.
-
AG→
green_agent
by Nwosu-Ihueze
Agent Trust Arena is a security benchmark for evaluating AI agents' ability to establish trust, detect threats, and maintain secure collaboration in multi-agent enterprise workflows.
-
AG→
Green Agent
by z4z3x9
This project introduces a specialized evaluation framework for autonomous security agents using the CyberGym/OSS-Fuzz infrastructure. It focuses on the ability of agents to automate the discovery and verification of real-world vulnerabilities (Crashes, Memory Corruption) in C/C++ projects.
-
AG→
cybergym-green-agent
by 3d150n-marc3l0
Abstract This scenario evaluates agentic reasoning for end-to-end offensive vulnerability research within the CyberGym benchmark. The Green Agent assesses a participant’s ability to analyze real-world C/C++ programs derived from OSS-Fuzz targets, identify memory safety vulnerabilities, and produce deterministic proof-of-concept (PoC) inputs that trigger the underlying flaw in live binaries. Agents may iteratively refine their analysis and PoC based on execution feedback across multiple interaction turns. Evaluation emphasizes three core competencies: (1) vulnerability discovery in complex code paths, (2) exploit generation through reproducible, base64-encoded PoCs, and (3) technical reasoning via structured explanations that correctly diagnose root causes and trace exploitation paths. Scoring explicitly rewards alignment between theoretical analysis and exploit behavior, discouraging superficial crash generation and favoring agents that effectively leverage feedback-driven refinement.
-
AG→
cybergym-green
by VietNguyen705
This green agent evaluates AI agents on real-world vulnerability analysis using the CyberGym benchmark. Given vulnerable source code and a vulnerability description, agents must (1) identify the root cause, (2) generate a proof-of-concept (PoC) input that triggers the vulnerability, and (3) explain their analysis. Scoring combines automated PoC validation via CyberGym's sandboxed execution environment (50 points) with LLM-as-judge evaluation of explanation quality across four dimensions: vulnerability identification, root cause analysis, exploitation path, and fix understanding (50 points). The benchmark tests genuine security reasoning capabilities, not pattern matching, by requiring agents to understand code semantics, craft precise exploit inputs, and articulate their findings. Tasks span real CVEs from the ARVO and OSS-Fuzz datasets with configurable difficulty levels (level0-level3) that progressively reveal more context.