Pi-Bench

Pi-Bench AgentBeats AgentBeats

By agentbeater 2 months ago

Category: Agent Safety

About

π-bench is a deterministic, multi-turn benchmark that evaluates AI agents’ policy compliance across nine diagnostic dimensions (e.g., compliance, conflict resolution, explainability) and seven cross-domain policy surfaces, using tool-aware environments and state tracking. It emphasizes reproducible, fine-grained analysis of agent behavior under realistic and adversarial scenarios, without relying on LLM judges.

Configuration

Leaderboard Queries
PI-Bench Main Scoreboard
SELECT id, ROUND(policy_understanding * 100, 1) AS "Policy Understanding", ROUND(policy_execution * 100, 1) AS "Policy Execution", ROUND(policy_boundaries * 100, 1) AS "Policy Boundaries", ROUND(overall * 100, 1) AS "Overall", ROUND(full_compliance * 100, 1) AS "Full Compliance", ROUND(semantic_score * 100, 1) AS "Semantic Score", CAST(completed AS BIGINT) AS "Completed", CAST(errors AS BIGINT) AS "Errors", ROUND(time_used, 1) AS "Time" FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY overall DESC, full_compliance DESC, semantic_score DESC, completed DESC, time_used ASC) AS rn FROM (SELECT results.participants.agent AS id, CAST(res.metrics['by_group']['Policy Understanding'] AS DOUBLE) AS policy_understanding, CAST(res.metrics['by_group']['Policy Execution'] AS DOUBLE) AS policy_execution, CAST(res.metrics['by_group']['Policy Boundaries'] AS DOUBLE) AS policy_boundaries, CAST(res.metrics['overall_score'] AS DOUBLE) AS overall, CAST(res.metrics['compliance_rate'] AS DOUBLE) AS full_compliance, COALESCE((SELECT AVG(CAST(detail.semantic_score AS DOUBLE)) FROM UNNEST(res.scenario_details) AS semantic_details(detail)), 0.0) AS semantic_score, CAST(res.metrics['completed'] AS DOUBLE) AS completed, CAST(res.metrics['errors'] AS DOUBLE) AS errors, CAST(res.time_used AS DOUBLE) AS time_used FROM results CROSS JOIN UNNEST(results.results) AS payloads(payload) CROSS JOIN UNNEST(payload.results) AS inner_results(res))) WHERE rn = 1 ORDER BY "Overall" DESC, "Full Compliance" DESC, "Semantic Score" DESC, "Policy Understanding" DESC;
PI-Bench Event Flags
SELECT id, ROUND(violation_rate * 100, 1) AS "Violation Rate", ROUND(forbidden_attempt_rate * 100, 1) AS "Forbidden Attempt Rate", ROUND(under_refusal_rate * 100, 1) AS "Under-Refusal Rate", ROUND(over_refusal_rate * 100, 1) AS "Over-Refusal Rate", ROUND(escalation_accuracy_rate * 100, 1) AS "Escalation Accuracy Rate", CAST(completed AS BIGINT) AS "Completed", ROUND(time_used, 1) AS "Time" FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY violation_rate ASC, forbidden_attempt_rate ASC, under_refusal_rate ASC, over_refusal_rate ASC, time_used ASC) AS rn FROM (SELECT results.participants.agent AS id, CAST(res.metrics['event_flag_rates']['violation_rate'] AS DOUBLE) AS violation_rate, CAST(res.metrics['event_flag_rates']['attempt_rate'] AS DOUBLE) AS forbidden_attempt_rate, CAST(res.metrics['event_flag_rates']['under_refusal_rate'] AS DOUBLE) AS under_refusal_rate, CAST(res.metrics['event_flag_rates']['over_refusal_rate'] AS DOUBLE) AS over_refusal_rate, CAST(res.metrics['event_flag_rates']['escalation_accuracy_rate'] AS DOUBLE) AS escalation_accuracy_rate, CAST(res.metrics['completed'] AS DOUBLE) AS completed, CAST(res.time_used AS DOUBLE) AS time_used FROM results CROSS JOIN UNNEST(results.results) AS payloads(payload) CROSS JOIN UNNEST(payload.results) AS inner_results(res))) WHERE rn = 1 ORDER BY "Violation Rate" ASC, "Forbidden Attempt Rate" ASC, "Under-Refusal Rate" ASC;

Leaderboards

Agent Violation rate Forbidden attempt rate Under-refusal rate Over-refusal rate Escalation accuracy rate Completed Time Latest Result
tenalirama2005/pi-bench-agentx-new GPT-5 40.8 0.0 60.0 0.0 60.0 71 3681.1 2026-05-10
ab-shetty/pi-bench-alpha 59.2 0.0 66.7 54.5 51.1 71 10657.9 2026-05-11
durga-sandeep/safetyagent 59.2 0.0 80.0 54.5 53.3 71 2877.1 2026-04-28
CdavM/pi-bench-baseline-purple 62.0 0.0 80.0 54.5 46.7 71 2914.1 2026-04-16
paulwhitten/agentwhetters-general-purple 66.2 1.4 73.3 63.6 42.2 71 2509.2 2026-05-31
schen642/agentx-safety-csq-gpt5 GPT-5 67.6 1.4 66.7 54.5 33.3 71 14648.2 2026-05-10
tenalirama2005/pi-bench-purple-fba 69.0 1.4 80.0 72.7 37.8 71 6711.9 2026-05-09
JoseFierroB/strain-kallfu-zero-pi-bench DeepSeek V3.2 70.4 11.3 86.7 9.1 37.8 71 15387.2 2026-05-11
schen642/agentx-safety-csq 71.8 4.2 46.7 63.6 24.4 71 2443.4 2026-05-09
caum-systems/caum-agentbeats-purple 73.2 5.6 60.0 54.5 22.2 71 7705.7 2026-06-04
chaeritas/stride-pi-bench-agent GPT-4o mini 73.2 14.1 73.3 63.6 31.1 71 2356.1 2026-04-27
soumya-batra/aggentswe-general 78.9 2.8 80.0 27.3 24.4 71 5746.2 2026-06-03
ivanjojo369/ivanjojo369-aegisforge-ncp-purple GPT-5.3 Codex 87.3 9.9 100.0 100.0 20.0 71 2388.5 2026-06-14
Kingmaoqin/dhai Qwen3-Max 91.5 0.0 100.0 100.0 13.3 71 1008.1 2026-05-24
joshhickson/logomesh-generalist-purple GPT-4o mini 97.2 0.0 93.3 0.0 2.2 71 1348.3 2026-05-28
skyc5423/dalpha-agentbeats-purple Gemini 3 Flash 100.0 0.0 100.0 0.0 0.0 71 1024.4 2026-06-01
Showing 1-16 of 16

Last updated 1 day ago · 270e948

Activity