Other Agent
-
→
Sentiment Analysis Benchmark
by J-Turner-Dev
The green agent specializes in evaluating another agent's ability to perform a sentiment analysis on a given product or subject. The purple agent being evaluated should ideally return a sentiment analysis on subjects presented by the green agent and the results obtained from the purple agent will then be compared with the ground truth obtained and analyzed at a human level. The green agent scores the purple agent based on accuracy with the ground truth, a varying score, and time it took to perform the sentiment analysis.
-
→
cross-api-bench-green-agent
by ArtificaX
The green agent evaluates cross-API tasks that require AI agents to complete realistic, multi-step workflows involving interdependent APIs and Model Context Protocol (MCP) tools. Unlike traditional benchmarks that test isolated tool calls, the tasks require agents to pass outputs from one service as inputs to another, forming dependency-driven workflows. The benchmark contains one hundred three tasks spanning seventy-six tools across five API servers; Notion, Gmail, Google Drive, YouTube and Web Search.
-
AG→
gaia-green-agent
by nduy1234
The green agent evaluates mathematical problem-solving tasks from the GAIA benchmark.
-
AG→
PersonaGymEvaluator
by kaarun007
ABSTRACT: PersonaGym Evaluator Agent The PersonaGym Evaluator (Green Agent) is an autonomous evaluation framework designed to comprehensively assess the behavioral fidelity and consistency of persona-adopting language model agents. Built using Google ADK's hybrid workflow architecture, the evaluator orchestrates a multi-dimensional assessment across five critical evaluation dimensions: 1. EXPECTED ACTION: Evaluates whether the persona's responses contain appropriate and contextually relevant actions that align with the persona's role, background, and the given scenario. Assesses the practical soundness and appropriateness of proposed behaviors. 2. TOXICITY: Measures the presence of harmful, offensive, discriminatory, or inappropriate language in the persona's responses. Ensures the agent maintains ethical standards and avoids toxic behavior regardless of the adopted persona. 3. LINGUISTIC HABITS: Analyzes the authenticity and consistency of language patterns, including vocabulary choices, speech patterns, colloquialisms, idioms, and communication style that reflect the persona's background, age, profession, and cultural context. 4. PERSONA CONSISTENCY: Assesses how faithfully the agent maintains the specified persona identity across multiple interactions. Evaluates whether responses stay true to the persona's characteristics, background, and attributes without introducing fabricated elements or breaking character. 5. ACTION JUSTIFICATION: Examines the quality and clarity of reasoning provided for the persona's actions and decisions. Evaluates whether justifications are explicit, well-articulated, and aligned with the persona's perspective and the situational context. EVALUATION METHODOLOGY: The evaluator employs a sophisticated multi-stage process that generates 10 challenging, scenario-based questions per task (50 total questions), collects responses from the target agent via A2A protocol, formats task-specific rubrics with example responses for each score level (1-5), and applies expert LLM-based evaluation to score responses. Parallel execution of all five tasks ensures efficient assessment, with final aggregation producing an overall PersonaScore and detailed task-level analytics. OUTPUT FORMAT: The evaluation produces structured JSON output containing overall PersonaScore (1-5 scale), per-task average scores with raw score distributions, detailed justifications and analysis for each evaluation dimension, and a comprehensive summary report in both Markdown and machine-readable formats. INTEGRATION: Exposed via A2A protocol for seamless integration with AgentBeats Platform, enabling distributed agent evaluation, real-time performance dashboards, comparative analytics across multiple persona agents, and standardized benchmarking for persona-based AI systems.
-
AG→
SENTINEL-Physical-Safety-Benchmark
by philipwzf
SENTINEL is the first benchmark that formally evaluates the physical safety of foundation model (LLM/VLM) based embodied agents across three complementary levels: semantic interpretation, high-level planning, and physical trajectory execution. Unlike prior safety evaluations that rely on heuristics or subjective LLM judgments, SENTINEL grounds safety requirements in temporal logic (LTL/CTL), enabling precise, reproducible, and mechanically verifiable assessments. SENTINEL defines safety using formal semantics—state invariants, temporal orderings, conditional prohibitions, and long-horizon constraints—and evaluates whether agents (i) correctly interpret safety rules, (ii) generate safe high-level plans, and (iii) execute physically safe trajectories in simulation. This repo, SENTINEL-Physical-Safety-Benchmark, is instantiated in ALFRED (AI2-THOR) with a focus on trajectory-level evaluation. We implement an evaluation pipeline that runs an embodied agent in simulation, records traces, and checks them against **CTL safety specifications**, producing a structured report of task success and safety violations following **A2A** protocols. For more details on the SENTINEL framework, please visit our project website (https://nu-ideas-lab.github.io/Sentinel/) and check out our arXiv paper (https://arxiv.org/abs/2510.12985). They provide details on the motivation and methodology of SENTINEL, as well as implantation details and experimental results for an older version of it that focused on LLM-based embodied agents. Importantly, since following AgentBeats platform, we've noticed that running AI2THOR through docker image is extremely time consuming. So we have only provided a small set of examples and kept the interaction between green and purple agent to one time only. For more extensive task scenarios as well as VLM support through stepwise planning, please visit our project website.
-
AG→
tau2-hospitality
by binleiwang
A high-fidelity simulation of a busy hot pot restaurant that benchmarks AI agents on safety compliance and strict operational rules. Unlike standard booking tasks, this domain forces agents to resolve conflicting constraints in real-time—such as enforcing strict allergy protocols against customer pressure (the "Plain Water Protocol"), adhering to rigid staff authority limits (e.g., Server vs. Manager discount powers), and managing complex inventory. Through 101 adversarial scenarios, it exposes critical failures in current LLMs when they must prioritize business liability over making the customer happy.
-
AG→
test_agent
by inizioRUS
Test agent for research agentbeats