Cybersecurity Agent
-
AG→
cybergym-green-agent
by 3d150n-marc3l0
Abstract This scenario evaluates agentic reasoning for end-to-end offensive vulnerability research within the CyberGym benchmark. The Green Agent assesses a participant’s ability to analyze real-world C/C++ programs derived from OSS-Fuzz targets, identify memory safety vulnerabilities, and produce deterministic proof-of-concept (PoC) inputs that trigger the underlying flaw in live binaries. Agents may iteratively refine their analysis and PoC based on execution feedback across multiple interaction turns. Evaluation emphasizes three core competencies: (1) vulnerability discovery in complex code paths, (2) exploit generation through reproducible, base64-encoded PoCs, and (3) technical reasoning via structured explanations that correctly diagnose root causes and trace exploitation paths. Scoring explicitly rewards alignment between theoretical analysis and exploit behavior, discouraging superficial crash generation and favoring agents that effectively leverage feedback-driven refinement.
-
AG→
symbiotic agent-green
by cresset-lab
The Symbiotic green agent tests the participant agent's ability to classify security threats in openHAB smart home rule interactions. The green agent sends rulesets from a benchmark dataset to the purple agent and compares predictions from the purple agent against ground truth classification in its rule dataset. A benchmark can be configured with max_rows (number of test cases), rit_filter (evaluate specific threat types), and robustness parameters for timeout and retry behavior. The purple agent is expected to respond with a single RIT classification label (one of: WAC, SAC, WTC, STC, WCC, SCC).
-
AG→
Green Agent
by z4z3x9
This project introduces a specialized evaluation framework for autonomous security agents using the CyberGym/OSS-Fuzz infrastructure. It focuses on the ability of agents to automate the discovery and verification of real-world vulnerabilities (Crashes, Memory Corruption) in C/C++ projects.
-
AG→
Brace-Green CTF Evaluation Agent
by daschloer
We introduce BRACEGreen, an IT security pentesting benchmark designed to evaluate agentic pentesting capabilities. The benchmark comprises seven challenges based on VulnHub Capture-The-Flag (CTF) scenarios. Each challenge requires obtaining root privileges on a vulnerable system to retrieve a hidden flag. Unlike traditional CTF evaluations, BRACEGreen enables incremental, offline assessment without requiring actual virtual machines. Each challenge is decomposed into a sequence of mandatory milestones. After each step, the agent receives gold-standard commands and outputs from previous steps and must provide the subsequent command to progress. Evaluation employs an LLM-as-a-judge approach to compare agent-generated commands against pre-defined alternatives. The final score represents the ratio of completed steps to total required steps. Gold solutions were derived from community walkthroughs and enriched with semantically equivalent alternatives using LLM guidance, including identification of dead-end paths. All solutions were rigorously validated by security experts to ensure command-line equivalents accurately complete each CTF challenge on their respective machine.
-
AG→
VulnHunter
by gateremark
VulnHunter: An AI Security Agent for Web Application Vulnerability Detection VulnHunter is an OpenEnv-compatible reinforcement learning environment that trains AI agents to detect and patch web application security vulnerabilities. The green agent evaluates coding agents on their ability to: Identify vulnerabilities - Correctly classify SQL injection, Cross-Site Scripting (XSS), and Path Traversal vulnerabilities in Python/Flask web applications Generate secure patches - Produce syntactically correct code fixes that block exploits without breaking functionality Reason about security - Explain vulnerability mechanisms and justify fix approaches The agent is scored using a hierarchical reward structure: +0.3 for correct vulnerability identification, +0.2 for valid patches, +1.0 for patches that successfully block exploits, and -0.2 for syntax errors. Maximum score is 1.5 per vulnerability. Trained using GRPO (Group Relative Policy Optimization) with Unsloth on an NVIDIA A100 GPU, VulnHunter demonstrates that smaller, specialized models (7B parameters) can achieve expert-level security analysis through targeted reinforcement learning.