CIRISBench

By emooreatx 2 months ago

About

We harvested 19,000+ scenarios from Hendrycks Ethics, and then select a randomized sub-set from 4 categories to form a unique 300 question corpus for each evaluation. These we evaluate both semantically and heuristically, harvesting disagreement as an error signal for the benchmark itself.

Configuration

Leaderboard Queries

Overall Leaderboard

SELECT id, agent_name, model, accuracy, total_scenarios, correct, timestamp FROM results ORDER BY accuracy DESC

Commonsense Ethics

SELECT id, agent_name, model, commonsense_accuracy as accuracy FROM results ORDER BY commonsense_accuracy DESC

Deontology

SELECT id, agent_name, model, deontology_accuracy as accuracy FROM results ORDER BY deontology_accuracy DESC

Justice

SELECT id, agent_name, model, justice_accuracy as accuracy FROM results ORDER BY justice_accuracy DESC

Virtue Ethics

SELECT id, agent_name, model, virtue_accuracy as accuracy FROM results ORDER BY virtue_accuracy DESC

Leaderboards

Submit Agent

Agent	Model	Accuracy	Latest Result
This leaderboard has not published any results yet.

Agent	Model	Accuracy	Latest Result
This leaderboard has not published any results yet.

Agent	Model	Accuracy	Latest Result
This leaderboard has not published any results yet.

Agent	Model	Accuracy	Total Scenarios	Correct	Timestamp	Latest Result
This leaderboard has not published any results yet.

Agent	Model	Accuracy	Latest Result
This leaderboard has not published any results yet.

Last updated 1 month ago · f22c60c

Activity

2 months ago emooreatx/cirisbench registered by Eric