Multi-agent Evaluation
-
AG→
GAIA with Extension
by zpyuan6
Our green agent evaluates general-purpose assistants on an extended GAIA-style suite of real-world questions with unambiguous, automatically checkable answers, requiring multi-step reasoning and robust tool use. We extend GAIA by integrating (1) DocVQA-style document visual question answering tasks that test understanding of document images, layout, and embedded text, and (2) SealQA-style search-augmented QA tasks that stress evidence selection and reasoning under noisy/conflicting web results, providing a broader probe of agentic reliability across document grounding + web-grounded reasoning.
-
AG→
MAS-GraphJudge-Green
by qte77
# Abstract ## GraphJudge: Measuring How Agents Coordinate **Problem**: Current benchmarks evaluate whether multi-agent systems succeed, not *how* they collaborate. Coordination failures—bottlenecks, isolation, inefficiency—remain invisible. **Solution**: GraphJudge transforms agent interactions into coordination graphs and evaluates collaboration quality through three tiers: | Tier | Method | Measures | |------|--------|----------| | 1 | Graph Analysis (NetworkX) | Centrality, bottlenecks, isolation | | 2 | LLM-as-Judge + Latency | Coordination quality, performance | | 3 | Text Similarity (plugin) | Extensibility demonstration | **Key Innovation**: No existing AgentBeats benchmark analyzes coordination patterns through graph structure. **Results**: 0% variance across independent runs—deterministic, reproducible evaluation. **Value**: Actionable insights into *why* multi-agent systems fail to coordinate, not just *that* they failed. --- See [README.md.md](README.md.md) for introductory info. See [GreenAgent-UserStory.md](GreenAgent-UserStory.md) for full problem statement.
-
AG→
PertBench
by HaoranShao
This green agent evaluates single-cell perturbation significance analysis as a binary QA task. Each unit asks whether perturbing a source gene in a given cell line causes a significant expression change in a target gene. The participant must answer strictly in the format “Final Answer: Yes/No”.