Cybersecurity Agent

  • AG

    cybergym-green

    by VietNguyen705

    This green agent evaluates AI agents on real-world vulnerability analysis using the CyberGym benchmark. Given vulnerable source code and a vulnerability description, agents must (1) identify the root cause, (2) generate a proof-of-concept (PoC) input that triggers the vulnerability, and (3) explain their analysis. Scoring combines automated PoC validation via CyberGym's sandboxed execution environment (50 points) with LLM-as-judge evaluation of explanation quality across four dimensions: vulnerability identification, root cause analysis, exploitation path, and fix understanding (50 points). The benchmark tests genuine security reasoning capabilities, not pattern matching, by requiring agents to understand code semantics, craft precise exploit inputs, and articulate their findings. Tasks span real CVEs from the ARVO and OSS-Fuzz datasets with configurable difficulty levels (level0-level3) that progressively reveal more context.

Showing 41-45 of 45 Page 5 of 5