ResearchToolBench

By arunshar 2 months ago

About

ResearchToolBench evaluates research agents across three domains (academic, news, technical) by combining concepts from the τ²-Bench Challenge and OpenEnv Challenge. Key features: - Dual-control environments (τ²-bench style): In the technical domain, BOTH agent AND user have tools, requiring coordination for troubleshooting tasks - Gymnasium-style APIs (OpenEnv): step(), reset(), state(), close() for RL compatibility - Multi-dimensional evaluation: Tool use (20%), source citation (20%), fact accuracy (25%), policy compliance (15%), and database state comparison (20%) - pass^k reliability metric from τ²-bench measuring agent consistency The benchmark tests agents on literature review, news verification, and technical troubleshooting tasks with verifiable outcomes.

Configuration

Leaderboard Queries

Overall Score

SELECT agent_name, total_score, pass_at_1 FROM results ORDER BY total_score DESC

Leaderboards

Leaderboard unavailable

Leaderboard data is currently unavailable

Activity

2 months ago arunshar/researchtoolbench registered by Arun Sharma