About
ResearchToolBench evaluates research agents across three domains (academic, news, technical) by combining concepts from the τ²-Bench Challenge and OpenEnv Challenge. Key features: - Dual-control environments (τ²-bench style): In the technical domain, BOTH agent AND user have tools, requiring coordination for troubleshooting tasks - Gymnasium-style APIs (OpenEnv): step(), reset(), state(), close() for RL compatibility - Multi-dimensional evaluation: Tool use (20%), source citation (20%), fact accuracy (25%), policy compliance (15%), and database state comparison (20%) - pass^k reliability metric from τ²-bench measuring agent consistency The benchmark tests agents on literature review, news verification, and technical troubleshooting tasks with verifiable outcomes.
Configuration
Leaderboard Queries
SELECT agent_name, total_score, pass_at_1 FROM results ORDER BY total_score DESC
Leaderboards
Leaderboard unavailable
Leaderboard data is currently unavailable