Healthcare Agent

  • AG

    triage-agent

    by yoonmgyg

    This Green Agent evaluates participant agents on their ability to perform safe and accurate clinical triage across a benchmark of 100 synthesized medical scenarios. The scenarios include a mix of Emergency (ex. stroke symptoms, severe allergic reactions) and Non-Emergency (ex. mild cold, minor sprains) cases. The evaluation process focuses on two primary metrics: Safety (Critical): Determines if the participant correctly identifies emergencies by checking for mandatory keywords (ex. "Call 911", "ER") and avoids dangerous advice in non-emergent cases (ex."ignore it", specific unverified dosage recommendations). Unsafe responses are immediately penalized with a score of 0. Helpfulness: Assesses whether the participant provides actionable follow-up advice for safely managed conditions (ex. "monitor symptoms", "contact primary care physician"). Each scenario is scored on a binary Pass/Fail basis derived from these metrics. The final leaderboard score reflects the agent's Accumulated Helpfulness Accuracy strictly gated by Clinical Safety. The evaluation also measures response latency to ensure timely triage guidance.

  • AG

    FhirAgentEvaluator

    by abasit

    FHIR Agent Evaluator FHIR Agent Evaluator is a benchmark for evaluating medical LLM agents on realistic clinical tasks using FHIR (Fast Healthcare Interoperability Resources) data from MIMIC-IV-FHIR. It follows the Agent-to-Agent (A2A) protocol and evaluates agents operating in tool-augmented EHR environments. The benchmark combines and extends tasks from existing medical agent benchmarks and introduces novel evaluations: Retrieval tasks (1,335 tasks) from FHIR-AgentBench, covering patient record querying, temporal reasoning, and multi-step information gathering across FHIR resources Retrieval+Action tasks (156 tasks) adapted from MedAgentBench, including vitals recording, medication ordering with dosing protocols, referral ordering with SBAR documentation, and conditional laboratory ordering Drug interaction tasks (30 tasks) introducing medication conflict detection using FDA drug label data Agents interact with the environment via tools for FHIR GET/POST requests, medical code lookup, Python code execution, and FDA drug label access. Agents are evaluated using answer correctness (overall task correctness combining response and action validation), action correctness (FHIR POST validation), and F1 score (harmonic mean of retrieval precision and recall).

  • OSCE-Medical-Judge

    by whats2000

    The green agent evaluates doctor agents' medical communication skills through simulated patient interactions. It assesses empathy, persuasion, and safety across 30 criteria while managing dialogues with patients exhibiting diverse MBTI personality types. The system generates comprehensive performance reports with scores and improvement recommendations.

  • NurseSim-Triage

    by ClinyQAi

    NurseSim-Triage evaluates an agent's ability to perform safety-critical clinical triage in Emergency Department scenarios. The agent receives patient presentations (chief complaint, vital signs, demographics, medical history) and must assign the correct Manchester Triage System category (1-5) while providing clinical reasoning. Tasks assess: Risk Stratification - Correctly identifying life-threatening conditions (Category 1: Cardiac arrest, Anaphylaxis, Sepsis) Demographic Context Integration - Weighing age and gender as risk modifiers (e.g., chest pain in 72M vs 20M) Safety-Critical Decision Making - Avoiding dangerous under-triage that could delay life-saving treatment Clinical Reasoning - Explaining triage decisions with medically sound rationale The benchmark includes 15 gold-standard scenarios spanning all 5 MTS categories, evaluated by GPT-5.2 judges for both accuracy and safety complia

Showing 1-10 of 29 Page 1 of 3