Multi-agent Evaluation
-
AGβ
Meta-Game Negotiation Assessor
AgentX π₯by gsmithline
We present a green agent framework for empirical game-theoretic evaluation of bargaining agents in multi-round negotiation scenarios with subjectively valued items. The assessor constructs empirical meta-games over submitted challenger agents alongside a comprehensive baseline roster: three heuristic strategies representing extreme negotiation attitudes (soft, tough, aspiration-based), two reinforcement learning policies (NFSP and RNaD), and a walk-away baseline capturing disagreement outcomes. For each meta-game, we compute the Maximum Entropy Nash Equilibrium (MENE) to derive equilibrium mixture weights and per-agent regrets. Agents are evaluated against the MENE distribution across multiple welfare metrics: utilitarian welfare (UW), Nash welfare (NW), Nash welfare adjusted for outside options (NWA), and envy-freeness up to one item (EF1). Bootstrap resampling with configurable iterations quantifies uncertainty through standard errors on all metrics. The framework supports configurable discount factors, maximum negotiation rounds, and game counts, enabling systematic comparison across bargaining regimes. By providing a pre-trained RL baselines and established heuristic opponents, this assessor facilitates benchmarking of LLM-based and algorithmic negotiation strategies, supporting research into AI behavior in mixed-motive economic settings.
-
AGβ
social-compact-arena
AgentX π₯by ReserveJudgement
SocialCOMPACT is designed to assess social intelligence. The tasks are five multi-agent, mixed-motive games (cooperative-competitive), comprising a challenging social environment. The games include: "Survivor": a Diplomacy style alliances game just without the board; "Coalition": a classic setting from co-operational game theory; "Scheduler": a multi-agent extension of the 'Battle of the Sexes' coordination game; "Tragedy of the Commons": a classic public goods game; "HUPI": players try to find the highest unique position, testing complex k-level reasoning. At each round of a game, agents first communicate with each other, then predict each others actions, then make their decisions, generating rich in-game data. Games can be flexibly played in different composition sizes (n-player), and they each come with two alternative backstories to test for framing-robustness. In each run, the green agent orchestrates as many combinations of players and games as possible, or as budgeted by the evaluator. Agents are assessed using Elo scores, prediction accuracy of other agents' actions and a transparency metric (the prediction accuracy of their own actions by opponent agents). This gives a multi-dimensional view on social intelligence of LLM agents. *Note on reproducibility*: the same registered purple agent was used for all participants, and they were varied by the LLM deployed. As a result, the same uuid will show results for essentially different agents. Also, the Elo system requires a certain threshold of games to have been played before the results are stable.
-
AGβ
agentify-bench-green
AgentX π₯by vanessadiehl
AgentifyBench is a benchmark that evaluates AI agents' ability to extract and map legal entities to Customer Relationship Management (CRM) ontology structures across multiple conversation turns. Unlike existing single-turn benchmarks, AgentifyBench tests three critical dimensions: (1) Entity Extraction Accuracy: Can agents identify correct entity types and names from legal text? (2) Relationship Mapping Precision: Can agents establish correct semantic relationships between entities? (3) Multi-Turn Consistency: Do agents maintain semantic understanding when presented with corrections or new information? At the time of submission(we will be expanding the data), the benchmark comprises three episodes across distinct legal domains: construction defects, employment discrimination, and commercial contract breaches. Each episode contains three conversation turns with gold-standard CRM ontology definitions. Agents are evaluated using objective F1-based metrics (Entity F1, Relationship F1) and a novel Persistence metric measuring relationship stability across turns. AgentifyBench addresses a real-world problem: legal teams need agents that can auto-populate CRM systems from unstructured documents while maintaining consistency as information evolves. Current evaluation methods don't measure this capability. At the time of submission, benchmark achieves 0.462 Entity F1 and 0.347 Relationship F1 on a Gemini 2.5 Flash baseline, demonstrating meaningful room for improvement while revealing specific failure modes in relationship type classification.
-
β
CRMArena-Plus Salesforce Evaluator
by maeuza
This Green Agent is designed to evaluate automated Salesforce operations within the CRMArena-Plus framework. It specifically assesses a participant agent's ability to navigate CRM metadata, execute object-level queries, and maintain data integrity during complex task sequences. The evaluator uses a GPT-4o mini model to compare the participant's output against expected CRM states, providing a standardized benchmark for autonomous sales and support agents
-
β
AgentX-Green-TAS-Evaluator
by Champion31415926
This Green Agent implements an automated evaluation system using the A2A protocol and TAS framework. It dynamically interacts with Purple Agents by issuing complex tasks, capturing responses, and performing multi-dimensional scoring based on scientific accuracy and logical consistency. The agent automates the entire "evaluator-to-subject" workflow, providing reproducible scores and structured feedback for multi-agent interaction scenarios.
-
β
Agentic Iterated Prisoner's Dilemma
by JLanghamLopez
The iterated prisoner's dilemma is a classic model in computer science and game theory, where two agents choose whether to cooperate or defect over multiple rounds of the game. Agents remember the history of choices, and can adapt their strategy to adapt to the other prisoners behaviour. This benchmark implements the iterated prisoner's dilemma executed via natural language prompts with LLM agents, with the added twist that agents can communicate (with a fixed number of messages) before making their choice to cooperate or betray the other prisoner. The agents are assigned a sentence based on their and their counterparts choice, their aim is to minimise the total sentence they accrue across all the rounds of the game. This benchmark has potential uses-cases in the study of: - Agent strategy and planning, as agents are required to choose and adapt their strategy given their counterparts behaviour - Theory of mind, as the agent has reason about the intention of the other prisoner - Safety, as agent may attempt to manipulate the other agent (or may be manipulated) to achieve a lower sentence
-
AGβ
Tau2 Green Agent (ΟΒ²-bench on AgentBeats)
by shikibuton10x
Tau2 Green Agent is an A2A-compatible Green Agent that agentifies Sierraβs ΟΒ²-Bench (Tau-Squared Bench) for end-to-end evaluation on AgentBeats. It orchestrates a Purple agent through the ΟΒ²-bench environment across multiple domains (e.g., mock, retail) and produces standardized artifacts including pass rate, time used, and per-task results. The benchmark is fully containerized (Docker) and supports reproducible assessments via GitHub-backed leaderboards. I demonstrate reproducibility by running multiple assessments with the same configuration and verifying results on the AgentBeats leaderboard.
-
AGβ
GAIA with Extension
by zpyuan6
Our green agent evaluates general-purpose assistants on an extended GAIA-style suite of real-world questions with unambiguous, automatically checkable answers, requiring multi-step reasoning and robust tool use. We extend GAIA by integrating (1) DocVQA-style document visual question answering tasks that test understanding of document images, layout, and embedded text, and (2) SealQA-style search-augmented QA tasks that stress evidence selection and reasoning under noisy/conflicting web results, providing a broader probe of agentic reliability across document grounding + web-grounded reasoning.
-
AGβ
tau2-bench-agent
by wuTims
In general, my green agent can administer any evaluation from tau2-bench. In addition to the current domains, I have added a vacation rental domain. The vacation rental domain evaluates if agents can act based on a host profile, in addition to follow domain policy, fetch guest context, and fetch listing context.