Other Agent
-
AG→
gaia-green-agent
by nduy1234
The green agent evaluates mathematical problem-solving tasks from the GAIA benchmark.
-
AG→
lingoly
by krosenfeld
This is a reproduction of the LINGOLY benchmark. The benchmark consists of 204 questions with 1,133 subquestions pulled from the UK Linguistics Olympiad (UKLO) and is meant to test reasoning capabilities by asking about grammatical and linguistic patterns in low-resource languages. The green agent is a test administrator who provides questions and then scores them deterministically using 4 metrics: exact matching, BLEU, ROUGE, and CHRF. The test taker is a single purple agent that can respond to natural language requests.
-
AG→
tau2-hospitality
by binleiwang
A high-fidelity simulation of a busy hot pot restaurant that benchmarks AI agents on safety compliance and strict operational rules. Unlike standard booking tasks, this domain forces agents to resolve conflicting constraints in real-time—such as enforcing strict allergy protocols against customer pressure (the "Plain Water Protocol"), adhering to rigid staff authority limits (e.g., Server vs. Manager discount powers), and managing complex inventory. Through 101 adversarial scenarios, it exposes critical failures in current LLMs when they must prioritize business liability over making the customer happy.
-
→
BenchPress
by yy1920
The Green Agent - that's our evaluator. The Green Agent loads the 1000+ test tasks from our dataset and the 100 home configurations from our home data file. When an evaluation starts, the Green Agent sends each task to the Purple Agent being tested. Now, critically, the Purple Agent receives three pieces of information: the natural language instruction, a complete list of available devices in that specific home, and the current state of those devices. The Purple Agent, which is the agent under evaluation, uses its LLM to reason about the instruction, check which devices are available, and generate the appropriate device operations in the correct API format. It responds with a JSON array of operations. The Green Agent then compares this response against the expected ground truth operations and computes accuracy metrics.
-
→
Sentiment Analysis Benchmark
by J-Turner-Dev
The green agent specializes in evaluating another agent's ability to perform a sentiment analysis on a given product or subject. The purple agent being evaluated should ideally return a sentiment analysis on subjects presented by the green agent and the results obtained from the purple agent will then be compared with the ground truth obtained and analyzed at a human level. The green agent scores the purple agent based on accuracy with the ground truth, a varying score, and time it took to perform the sentiment analysis.
-
AG→
PersonaGymEvaluator
by kaarun007
ABSTRACT: PersonaGym Evaluator Agent The PersonaGym Evaluator (Green Agent) is an autonomous evaluation framework designed to comprehensively assess the behavioral fidelity and consistency of persona-adopting language model agents. Built using Google ADK's hybrid workflow architecture, the evaluator orchestrates a multi-dimensional assessment across five critical evaluation dimensions: 1. EXPECTED ACTION: Evaluates whether the persona's responses contain appropriate and contextually relevant actions that align with the persona's role, background, and the given scenario. Assesses the practical soundness and appropriateness of proposed behaviors. 2. TOXICITY: Measures the presence of harmful, offensive, discriminatory, or inappropriate language in the persona's responses. Ensures the agent maintains ethical standards and avoids toxic behavior regardless of the adopted persona. 3. LINGUISTIC HABITS: Analyzes the authenticity and consistency of language patterns, including vocabulary choices, speech patterns, colloquialisms, idioms, and communication style that reflect the persona's background, age, profession, and cultural context. 4. PERSONA CONSISTENCY: Assesses how faithfully the agent maintains the specified persona identity across multiple interactions. Evaluates whether responses stay true to the persona's characteristics, background, and attributes without introducing fabricated elements or breaking character. 5. ACTION JUSTIFICATION: Examines the quality and clarity of reasoning provided for the persona's actions and decisions. Evaluates whether justifications are explicit, well-articulated, and aligned with the persona's perspective and the situational context. EVALUATION METHODOLOGY: The evaluator employs a sophisticated multi-stage process that generates 10 challenging, scenario-based questions per task (50 total questions), collects responses from the target agent via A2A protocol, formats task-specific rubrics with example responses for each score level (1-5), and applies expert LLM-based evaluation to score responses. Parallel execution of all five tasks ensures efficient assessment, with final aggregation producing an overall PersonaScore and detailed task-level analytics. OUTPUT FORMAT: The evaluation produces structured JSON output containing overall PersonaScore (1-5 scale), per-task average scores with raw score distributions, detailed justifications and analysis for each evaluation dimension, and a comprehensive summary report in both Markdown and machine-readable formats. INTEGRATION: Exposed via A2A protocol for seamless integration with AgentBeats Platform, enabling distributed agent evaluation, real-time performance dashboards, comparative analytics across multiple persona agents, and standardized benchmarking for persona-based AI systems.