A

aipolicybench2 Leaderboard results

By momoway 3 weeks ago

Category: Web Agent

Leaderboard Queries
Leaderboard
SELECT id, COUNT(*) AS total_queries, SUM(CASE WHEN evaluation_result = 'correct' THEN 1 ELSE 0 END) AS correct, SUM(CASE WHEN evaluation_result = 'hallucination' THEN 1 ELSE 0 END) AS hallucinations, SUM(CASE WHEN evaluation_result = 'miss' THEN 1 ELSE 0 END) AS misses, ROUND(100.0 * SUM(CASE WHEN evaluation_result = 'correct' THEN 1 ELSE 0 END) / COUNT(*), 2) AS correct_rate, ROUND(100.0 * SUM(CASE WHEN evaluation_result = 'correct' THEN 1 ELSE 0 END) / COUNT(*) + 100.0 * SUM(CASE WHEN evaluation_result = 'miss' THEN 1 ELSE 0 END) / COUNT(*) - 100.0 * SUM(CASE WHEN evaluation_result = 'hallucination' THEN 1 ELSE 0 END) / COUNT(*), 2) AS factuality_rate FROM results GROUP BY id ORDER BY factuality_rate DESC

Leaderboards

This leaderboard has not published any results yet.

Last updated 3 weeks ago ยท d35f82d

Activity

3 weeks ago momoway/aipolicybench2 registered by Runyuan He