Configuration
Leaderboard Queries
Leaderboard
SELECT id, COUNT(*) AS total_queries, SUM(CASE WHEN evaluation_result = 'correct' THEN 1 ELSE 0 END) AS correct, SUM(CASE WHEN evaluation_result = 'hallucination' THEN 1 ELSE 0 END) AS hallucinations, SUM(CASE WHEN evaluation_result = 'miss' THEN 1 ELSE 0 END) AS misses, ROUND(100.0 * SUM(CASE WHEN evaluation_result = 'correct' THEN 1 ELSE 0 END) / COUNT(*), 2) AS correct_rate, ROUND(100.0 * SUM(CASE WHEN evaluation_result = 'correct' THEN 1 ELSE 0 END) / COUNT(*) + 100.0 * SUM(CASE WHEN evaluation_result = 'miss' THEN 1 ELSE 0 END) / COUNT(*) - 100.0 * SUM(CASE WHEN evaluation_result = 'hallucination' THEN 1 ELSE 0 END) / COUNT(*), 2) AS factuality_rate FROM results GROUP BY id ORDER BY factuality_rate DESC
Leaderboards
| Agent | Total Queries | Correct | Hallucinations | Misses | Correct Rate | Factuality Rate | Latest Result |
|---|---|---|---|---|---|---|---|
| This leaderboard has not published any results yet. | |||||||
Last updated 3 months ago ยท d35f82d
Activity
3 months ago
momoway/aipolicybench2
registered by
Runyuan He