Multi-agent Evaluation
-
AG→
Code_translator_Judge
by Samir-atra
Code Translator Judge - Task Description The Code Translator Judge (green agent) evaluates the quality of code translation performed by participant agents (purple agents). What it evaluates: The green agent sends code snippets in a source programming language (e.g., Python) to participant agents and asks them to translate the code into a target programming language (e.g., JavaScript). It then evaluates the translations across four key metrics: Execution Correctness (0-10) - Does the translated code produce the same output/behavior as the original? Style Score (0-10) - Does the code follow idiomatic conventions of the target language? Conciseness (0-10) - Is the translation efficient without unnecessary verbosity? Relevance (0-10) - Does the translation accurately preserve the original code's intent and logic? Sample tasks: Translate a recursive factorial function from Python to JavaScript Convert a Fibonacci class with memoization from Python to JavaScript Transform regex parsing functions between languages Overall scoring: The agent calculates an overall score as the average of the four metrics, providing a comprehensive assessment of translation quality.
-
AG→
Protocol Agent (Green)
by MarcoMetaMask
Many cryptographic primitives could reshape human-to-human communications in our business and personal lives, but this doesn’t happen because cryptography is complex and the math is hard to “do in your head.” Agents could learn that. They can map mundane intents to primitives and instantiate protocols at runtime. Scheduling a meeting can use Private Set Intersection (PSI) instead of sharing calendars. “Prove you’re over 21” can use a Zero-Knowledge Proof (ZKP) (with nonce/challenge anti-replay) instead of photocopying an ID. Anonymous reporting with verifiable membership can use anonymous credentials / ring signatures / group signatures (optionally linkable per epoch). Tip tokens can use blind signatures so the issuer can’t link purchase to spend while still preventing double spending. The list is actually pretty long. Achieving this is a multidimensional challenge. Agents should: (1) "spot" and select the right primitive in an everyday context, (2) negotiate adoption with another agent, (3) implement the protocol correctly, (4) use crypto tools and computation competently, and (5) reason about threats and security strength. These are exactly the five judging dimensions of Protocol Agent, a benchmark that measures not “crypto knowledge” in the abstract (which has already been studied), but the practical ability to apply cryptography to improve daily life. This benchmark is the first step in a larger effort (more coming in Q1 2026): post-training models that perform better on it.