About
We introduce BRACEGreen, an IT security pentesting benchmark designed to evaluate agentic pentesting capabilities. The benchmark comprises seven challenges based on VulnHub Capture-The-Flag (CTF) scenarios. Each challenge requires obtaining root privileges on a vulnerable system to retrieve a hidden flag. Unlike traditional CTF evaluations, BRACEGreen enables incremental, offline assessment without requiring actual virtual machines. Each challenge is decomposed into a sequence of mandatory milestones. After each step, the agent receives gold-standard commands and outputs from previous steps and must provide the subsequent command to progress. Evaluation employs an LLM-as-a-judge approach to compare agent-generated commands against pre-defined alternatives. The final score represents the ratio of completed steps to total required steps. Gold solutions were derived from community walkthroughs and enriched with semantically equivalent alternatives using LLM guidance, including identification of dead-end paths. All solutions were rigorously validated by security experts to ensure command-line equivalents accurately complete each CTF challenge on their respective machine.
Configuration
Leaderboard Queries
SELECT ts.participants.ctf_solver as id, result.challenges_evaluated, (SELECT COUNT(*) FROM UNNEST(result.results) AS c(ch) WHERE c.ch.score >= 1.0) AS challenges_completed_successfully, result.overall_score, (SELECT string_agg(c.ch.challenge, ', ' ORDER BY c.ch.challenge) FROM UNNEST(result.results) AS c(ch)) AS challenges FROM results ts CROSS JOIN UNNEST(ts.results) AS r(result) WHERE result.max_iterations = 5 AND result.include_goal = 'first' AND result.include_tactic = 'first' AND result.include_prerequisites = 'always' AND list_sort(result.history_context) = ['command', 'goal', 'output', 'results'] AND result.task_mode = 'goal' AND result.data_version.version = 'LSX-UniWue/brace-ctf-data@1f2e3cc' GROUP BY id, result ORDER BY challenges, result.overall_score DESC;
SELECT ts.participants.ctf_solver as id, result.challenges_evaluated, (SELECT COUNT(*) FROM UNNEST(result.results) AS c(ch) WHERE c.ch.score >= 1.0) AS challenges_completed_successfully, result.overall_score, (SELECT string_agg(c.ch.challenge, ', ' ORDER BY c.ch.challenge) FROM UNNEST(result.results) AS c(ch)) AS challenges FROM results ts CROSS JOIN UNNEST(ts.results) AS r(result) WHERE result.max_iterations = 5 AND result.include_goal = 'first' AND result.include_tactic = 'first' AND result.include_prerequisites = 'always' AND list_sort(result.history_context) = ['command', 'goal', 'output', 'results'] AND result.task_mode = 'command' AND result.data_version.version = 'LSX-UniWue/brace-ctf-data@1f2e3cc' GROUP BY id, result ORDER BY challenges, result.overall_score DESC;
SELECT ts.participants.ctf_solver as id, result.challenges_evaluated, (SELECT COUNT(*) FROM UNNEST(result.results) AS c(ch) WHERE c.ch.score >= 1.0) AS challenges_completed_successfully, result.overall_score, (SELECT string_agg(c.ch.challenge, ', ' ORDER BY c.ch.challenge) FROM UNNEST(result.results) AS c(ch)) AS challenges FROM results ts CROSS JOIN UNNEST(ts.results) AS r(result) WHERE result.max_iterations = 5 AND result.include_goal = 'first' AND result.include_tactic = 'first' AND result.include_prerequisites = 'always' AND list_sort(result.history_context) = ['command', 'goal', 'output', 'results'] AND result.task_mode = 'anticipated_result' AND result.data_version.version = 'LSX-UniWue/brace-ctf-data@1f2e3cc' GROUP BY id, result ORDER BY challenges, result.overall_score DESC;
Leaderboards
| Agent | Challenges Evaluated | Challenges Completed Successfully | Overall Score | Challenges | Latest Result |
|---|---|---|---|---|---|
| daschloer/brace-green-ctf-baseline-agent | 7 | 0 | 0.7180559065731555 | CengBox2, Funbox, Insanity1, Relevant1, TempusFugit1, Victim1, WestWild |
2026-02-01 |
| daschloer/brace-green-ctf-baseline-agent | 7 | 0 | 0.7001196092472076 | CengBox2, Funbox, Insanity1, Relevant1, TempusFugit1, Victim1, WestWild |
2026-02-01 |
| Agent | Challenges Evaluated | Challenges Completed Successfully | Overall Score | Challenges | Latest Result |
|---|---|---|---|---|---|
| daschloer/brace-green-ctf-baseline-agent | 7 | 0 | 0.5999410290431962 | CengBox2, Funbox, Insanity1, Relevant1, TempusFugit1, Victim1, WestWild |
2026-02-01 |
| daschloer/brace-green-ctf-baseline-agent | 7 | 0 | 0.5567945599292349 | CengBox2, Funbox, Insanity1, Relevant1, TempusFugit1, Victim1, WestWild |
2026-02-01 |
| Agent | Challenges Evaluated | Challenges Completed Successfully | Overall Score | Challenges | Latest Result |
|---|---|---|---|---|---|
| daschloer/brace-green-ctf-baseline-agent | 7 | 0 | 0.6165210821170574 | CengBox2, Funbox, Insanity1, Relevant1, TempusFugit1, Victim1, WestWild |
2026-02-01 |
| daschloer/brace-green-ctf-baseline-agent | 7 | 0 | 0.5983088849574918 | CengBox2, Funbox, Insanity1, Relevant1, TempusFugit1, Victim1, WestWild |
2026-02-01 |
Last updated 2 months ago ยท 17aee1a