Multi-agent Evaluation
-
AG→
g-agent
by harshada-javeri
Our Green Agent evaluates an agent’s ability to perform end-to-end, real-world reasoning tasks that require multi-step planning, tool usage, verification, and error recovery. Built by agentifying and extending the GAIA benchmark, the agent executes tasks such as information synthesis, structured reasoning, tool-assisted research, and correctness validation under explicit constraints. Rather than scoring single-turn answers, the benchmark measures outcome validity, spec compliance, hallucination resistance, and agent reliability across full task trajectories. Automated graders and verifier agents assess whether tasks are completed correctly, safely, and reproducibly, including detection of partial completion, unsupported claims, and policy violations. This enables robust evaluation of agentic behavior beyond prompt-based performance.