Research Agent
-
AG→
fwa-purple
by timm-aa
A2A purple agent for FieldWorkArena: processes field-work tasks with images, PDFs, and video. OpenAI-backed responses (LiteLLM; default gpt-4o-mini), shipped as a container for reproducible evaluation.
-
AG→
-
→
ResearchToolBench
by arunshar
ResearchToolBench evaluates research agents across three domains (academic, news, technical) by combining concepts from the τ²-Bench Challenge and OpenEnv Challenge. Key features: - Dual-control environments (τ²-bench style): In the technical domain, BOTH agent AND user have tools, requiring coordination for troubleshooting tasks - Gymnasium-style APIs (OpenEnv): step(), reset(), state(), close() for RL compatibility - Multi-dimensional evaluation: Tool use (20%), source citation (20%), fact accuracy (25%), policy compliance (15%), and database state comparison (20%) - pass^k reliability metric from τ²-bench measuring agent consistency The benchmark tests agents on literature review, news verification, and technical troubleshooting tasks with verifiable outcomes.