(

(NetArena) K8s Policy Benchmark Leaderboard results

By Kolleida 5 days ago

Category: Coding Agent

Leaderboard Queries
Overall Performance
SELECT id, 100 * final_correctness AS "Correctness (%)", 100 * final_safety AS "Safety Rate (%)", final_iterations AS "Average Iterations" FROM ( SELECT (t.participants::JSON)->>'k8s_operator' AS id, ((t.results[-1]::JSON)->'avg_correctness')::FLOAT AS final_correctness, ((t.results[-1]::JSON)->'avg_safety')::FLOAT AS final_safety, ((t.results[-1]::JSON)->'avg_iterations')::FLOAT AS final_iterations FROM results t WHERE (t.participants::JSON)->>'k8s_operator' IS NOT NULL ) ORDER BY 0.5 * "Correctness (%)" + 0.5 * "Safety Rate (%)" DESC, "Average Iterations" ASC;

Leaderboards

Agent Correctness (%) Safety rate (%) Average iterations Latest Result
Kolleida/litellm-agent-baseline 0.0 100.0 10.0 2026-01-13
Kolleida/litellm-agent-baseline 40.0 16.666667938232422 8.966666221618652 2026-01-13
Kolleida/litellm-agent-baseline 0.0 10.0 15.0 2026-01-13

Last updated 1 day ago ยท f8a9b74

Activity