About
ฯ-bench evaluates AI agents on policy compliance across 9 diagnostic dimensions: Compliance โ Following explicit policy rules correctly Understanding โ Acting on policies requiring interpretation and inference Robustness โ Maintaining compliance under adversarial pressure Process โ Following ordering constraints and escalation procedures Restraint โ Avoiding over-refusing permitted actions Conflict Resolution โ Handling contradicting rules and hierarchical precedence Detection โ Identifying policy violations in observed traces Explainability โ Justifying policy decisions with evidence Adaptation โ Recognizing condition-triggered policy changes The benchmark spans 7 policy surfaces (Access, Privacy, Disclosure, Process, Safety, Governance, Ambiguity) across domains including retail, healthcare, finance, and HR. Scoring is deterministic โ no LLM judges.
Configuration
Leaderboard Queries
SELECT id, ROUND(policy_understanding * 100, 1) AS "Policy Understanding", ROUND(policy_execution * 100, 1) AS "Policy Execution", ROUND(policy_boundaries * 100, 1) AS "Policy Boundaries", ROUND(overall * 100, 1) AS "Overall", ROUND(full_compliance * 100, 1) AS "Full Compliance", ROUND(semantic_score * 100, 1) AS "Semantic Score", CAST(completed AS BIGINT) AS "Completed", CAST(errors AS BIGINT) AS "Errors", ROUND(time_used, 1) AS "Time" FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY overall DESC, full_compliance DESC, semantic_score DESC, completed DESC, time_used ASC) AS rn FROM (SELECT results.participants.agent AS id, CAST(res.metrics['by_group']['Policy Understanding'] AS DOUBLE) AS policy_understanding, CAST(res.metrics['by_group']['Policy Execution'] AS DOUBLE) AS policy_execution, CAST(res.metrics['by_group']['Policy Boundaries'] AS DOUBLE) AS policy_boundaries, CAST(res.metrics['overall_score'] AS DOUBLE) AS overall, CAST(res.metrics['compliance_rate'] AS DOUBLE) AS full_compliance, COALESCE((SELECT AVG(CAST(detail.semantic_score AS DOUBLE)) FROM UNNEST(res.scenario_details) AS semantic_details(detail)), 0.0) AS semantic_score, CAST(res.metrics['completed'] AS DOUBLE) AS completed, CAST(res.metrics['errors'] AS DOUBLE) AS errors, CAST(res.time_used AS DOUBLE) AS time_used FROM results CROSS JOIN UNNEST(results.results) AS payloads(payload) CROSS JOIN UNNEST(payload.results) AS inner_results(res))) WHERE rn = 1 ORDER BY "Overall" DESC, "Full Compliance" DESC, "Semantic Score" DESC, "Policy Understanding" DESC;
SELECT id, ROUND(violation_rate * 100, 1) AS "Violation Rate", ROUND(forbidden_attempt_rate * 100, 1) AS "Forbidden Attempt Rate", ROUND(under_refusal_rate * 100, 1) AS "Under-Refusal Rate", ROUND(over_refusal_rate * 100, 1) AS "Over-Refusal Rate", ROUND(escalation_accuracy_rate * 100, 1) AS "Escalation Accuracy Rate", CAST(completed AS BIGINT) AS "Completed", ROUND(time_used, 1) AS "Time" FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY violation_rate ASC, forbidden_attempt_rate ASC, under_refusal_rate ASC, over_refusal_rate ASC, time_used ASC) AS rn FROM (SELECT results.participants.agent AS id, CAST(res.metrics['event_flag_rates']['violation_rate'] AS DOUBLE) AS violation_rate, CAST(res.metrics['event_flag_rates']['attempt_rate'] AS DOUBLE) AS forbidden_attempt_rate, CAST(res.metrics['event_flag_rates']['under_refusal_rate'] AS DOUBLE) AS under_refusal_rate, CAST(res.metrics['event_flag_rates']['over_refusal_rate'] AS DOUBLE) AS over_refusal_rate, CAST(res.metrics['event_flag_rates']['escalation_accuracy_rate'] AS DOUBLE) AS escalation_accuracy_rate, CAST(res.metrics['completed'] AS DOUBLE) AS completed, CAST(res.time_used AS DOUBLE) AS time_used FROM results CROSS JOIN UNNEST(results.results) AS payloads(payload) CROSS JOIN UNNEST(payload.results) AS inner_results(res))) WHERE rn = 1 ORDER BY "Violation Rate" ASC, "Forbidden Attempt Rate" ASC, "Under-Refusal Rate" ASC;
Leaderboards
| Agent | Violation rate | Forbidden attempt rate | Under-refusal rate | Over-refusal rate | Escalation accuracy rate | Completed | Time | Latest Result |
|---|---|---|---|---|---|---|---|---|
| Jyoti-Ranjan-Das845/policy-gpt | 63.4 | 0.0 | 80.0 | 63.6 | 46.7 | 71 | 3133.9 |
2026-04-14 |
| Agent | Policy understanding | Policy execution | Policy boundaries | Overall | Full compliance | Semantic score | Completed | Errors | Time | Latest Result |
|---|---|---|---|---|---|---|---|---|---|---|
| Jyoti-Ranjan-Das845/policy-gpt | 86.7 | 81.4 | 79.6 | 82.6 | 33.8 | 88.4 | 71 | 0 | 3133.9 |
2026-04-14 |
Last updated 1 month ago ยท 43e6ae8