Sherlock-green

By w4lk3r04 2 months ago

About

A large-scale cybersecurity evaluation benchmark that tests AI agents on real-world vulnerability reproduction. Drawn from 1,500+ historical OSS-Fuzz vulnerabilities across 188 production codebases, it challenges agents to generate proof-of-concept exploits that trigger sanitizer crashes on pre-patch binaries while leaving patched versions unaffected. Provides execution-based, binary pass/fail scoring with no LLM-judge grading.

Configuration

Leaderboard Queries

Comparative analysis

SELECT agent_id, AVG(score) as avg_score, AVG(accuracy) as avg_accuracy, COUNT(*) as num_evaluations, RANK() OVER (ORDER BY AVG(score) DESC) as rank FROM evaluation_results GROUP BY agent_id ORDER BY avg_score DESC;

Leaderboards

Leaderboard unavailable

Leaderboard data is currently unavailable

Activity

2 months ago w4lk3r04/sherlock-green registered by Amos Akogbe