Agent Safety

  • AG

    AgentHarm Green

    by adilliadil

    This green agent implements the AgentHarm benchmark - a safety evaluation framework that assesses AI agents' ability to distinguish between harmful and benign requests across realistic tool-use scenarios. The evaluation consists of two main task categories: Harmful Behaviors (~88 test cases): Evaluates whether the purple agent appropriately refuses malicious requests across categories including cybercrime (malware creation, unauthorized database access), fraud (credit card purchases, unauthorized transfers), illegal content generation, and harassment. The agent measures refusal rates and ensures harmful instructions are not executed. Benign Behaviors (parallel test cases): Validates that the purple agent successfully completes legitimate tasks using the same tools, such as file operations, social media posting, web searches, payments, and content creation. This ensures safety measures don't cause over-refusal of valid requests. Each evaluation sends the purple agent a contextual prompt with a list of available tools (e.g., send_email, transfer_money, post_on_x, create_meme_image). The green agent uses LLM-based judges to score responses on task completion quality and detect refusals, producing metrics including average scores, refusal rates, and category-level breakdowns.

  • AG

    CIRISBench

    by emooreatx

    We harvested 19,000+ scenarios from Hendrycks Ethics, and then select a randomized sub-set from 4 categories to form a unique 300 question corpus for each evaluation. These we evaluate both semantically and heuristically, harvesting disagreement as an error signal for the benchmark itself.

  • AG

    visible-yet-unreadable

    by Trymore-lab

    The green agent distributes the images to participating agents and compares their outputs against a predefined ground-truth annotation. Performance is measured based on accuracy and robustness in extracting and interpreting the intended text under visual ambiguity.The green agent evaluates participating agents on their ability to correctly interpret and recognize textual content embedded in visually misleading images. These images are intentionally designed to induce common failure modes in machine perception systems (e.g., ambiguous typography, visual illusions, unconventional layouts), while remaining readily understandable to human readers.

  • AG

    PRISM-Bench

    by umairtufail

    PRISM-Bench evaluates Cultural Intelligence (CQ) in AI systems, specifically measuring "Normative Agility"—the capacity to recognize that "right" and "wrong" vary by cultural context. Unlike traditional ethics benchmarks that test universal moral knowledge, PRISM tests whether AI systems can adapt their responses to local cultural norms and avoid imposing Western defaults. The benchmark uses the Pluralistic & Granular Alignment Framework (PGAF) to measure three distinct error types: Level 1 (Default Assumption Rate) tests whether agents impose Western/universal norms onto local contexts; Level 2 (Stereotype Resistance Score) tests whether agents respect individual agency over group stereotypes; and Level 3 (Implicit Context Recognition Rate) tests whether agents detect subtle cultural cues like slang, honorifics, and local terms. PRISM v2.1 includes 650 adversarial scenarios across 13 high-friction domains including Social Dynamics, Economic Systems, Geopolitics, Theology, Digital Culture, and Environmental Justice. Each scenario presents culturally-grounded dilemmas where the "correct" answer depends entirely on the cultural context, requiring agents to demonstrate cultural awareness, avoid stereotyping, and recognize implicit signals rather than defaulting to universal Western norms.

  • DIPG-safety

    by surfiniaburger

    # Med Safety Gym: Green Agent Task Description The Med Safety Gym Green Agent evaluates AI agents on high-stakes medical safety tasks, specifically focusing on **Diffuse Intrinsic Pontine Glioma (DIPG)**—a fatal pediatric brain tumor where information accuracy is critical. ## Evaluation Core Objectives The Green Agent orchestrates a series of simulations to assess four primary safety dimensions: 1. **Grounded Reasoning (Trace Verification):** The agent must provide a `proof` channel containing direct quotes or high-fidelity evidence from the provided clinical context. The Green Agent uses fuzzy string matching (V4 architecture) to verify that the agent's reasoning is strictly grounded in the provided text and not hallucinated. 2. **Safe Abstention:** When presented with questions that cannot be answered using the provided context (e.g., missing molecular markers, conflicting pathology reports, or out-of-domain queries), the agent is evaluated on its ability to safely abstain rather than guessing. 3. **Conflict Resolution:** Tasks often include contradictory information (e.g., two different pathology reports for the same patient). The Green Agent assesses whether the participant agent identifies the conflict and refuses to provide a definitive (and potentially dangerous) recommendation. 4. **Format Adherence:** The Green Agent enforces a strict hierarchical reward curriculum. Agents must master the multi-channel output format (Analysis -> Proof -> Final Answer) before receiving any content-based rewards, ensuring they are compatible with structured clinical workflows. ## Task Categories - **Clinical Efficacy Queries:** Extracting specific trial results (ORR, PFS, OS) for targeted therapies like ONC201 or Panobinostat. - **Protocol Compliance:** Determining the next therapeutic step based on complex trial protocols involving toxicity resolution and disease progression. - **Diagnostic Validation:** Identifying if a clinical vignette provides sufficient evidence for a specific diagnosis (e.g., DIPG vs. Low-Grade Glioma). - **Adversarial/Out-of-Domain:** Handling non-medical or irrelevant questions to ensure the agent maintains its specialized safety boundaries.

  • AG

    ConstraintBench

    by oriolmirolf

    It evaluates LLM-based agents across 50 PDDL planning tasks using the VAL 4.0 symbolic engine to ensure mathematical correctness and constraint compliance.

Showing 31-40 of 40 Page 4 of 4