Healthcare Agent

  • AG

    A2-Bench-Healthcare

    by Ahm3dAlAli

    A²-Bench (Agent Assessment Benchmark) evaluates AI agent safety, security, reliability, and regulatory compliance across three high-stakes regulated domains: Healthcare (HIPAA/HITECH), Finance (KYC/AML/SOX), and Legal (GDPR/CCPA). Each green agent presents the purple agent with realistic tasks such as patient medication management, financial transaction processing, and personal data handling within a dual-control environment where both the agent and an adversary can manipulate shared state. Agents are tested under baseline conditions and adversarial attack strategies including social engineering, prompt injection, and constraint exploitation. Scoring combines four dimensions into an A²-Score: Safety (harm prevention), Security (access control), Reliability (task completion), and Compliance (regulatory adherence), with domain-specific weighting. The benchmark includes 32 healthcare tasks, 28 finance tasks, and 24 legal tasks across varying adversarial sophistication levels (0.3–0.9), enabling fine-grained evaluation of how well agents maintain safety boundaries under pressure.

  • AG

    MedAgentBench-Agentified

    by karim-elkobrossy

    The green agent evaluates whether a medical AI (purple agent) can correctly perform FHIR-based clinical reasoning tasks. These tasks fall into three categories: Query tasks: Retrieve and compute patient information from the FHIR server, such as identifying patients, calculating age, and extracting recent or averaged lab values. Write tasks: Create valid FHIR resources, including vital sign observations and consultation or lab service requests, with correct clinical structure and content. Conditional (protocol-driven) tasks: Apply clinical decision logic based on patient data (e.g., electrolyte levels or test recency) and, when criteria are met, generate appropriate medication orders or lab requests according to predefined medical protocols. Overall, the green agent checks data retrieval accuracy, clinical calculations, correct use of FHIR APIs, and adherence to clinical protocols, validating each task with task-specific grading logic.

  • AG

    AI-PharmD-MedAgentBench

    by Zephyr1022

    The green agent evaluates AI models on 10 clinical reasoning tasks from Stanford MedAgentBench, testing capabilities in patient data queries, vital signs recording, laboratory analysis, medication management, and consultation ordering across standardized medical scenarios. The project also examines AI's ability to distinguish real pharmaceuticals from fabricated drug names, as explored in research titled "Drug or Pokemon?" This dual focus assesses both clinical workflow automation and AI safety in medical decision-making contexts.

  • AG

    medagentbenchmark-green-agent

    by udapy

    the Green Agent evaluates specific clinical workflows by verifying the digital footprint left by the subject (Purple Agent) within the virtual EHR environment. Rather than relying on subjective linguistic analysis, the Green Agent operates on the foundational truth of state changes and data accuracy. It assesses whether the correct medical facts were identified and if the database state was altered correctly (e.g., an order row added to the correct table). The specific tasks evaluated fall into these core categories: Information Retrieval: Validating that the agent can accurately query and extract specific patient data points (e.g., "What was the last recorded creatinine level?") from the FHIR server. Clinical Ordering & Action: Verifying that the agent correctly executes actions such as placing medication orders, scheduling lab tests, or generating referrals, ensuring the resulting database objects match the ground truth requirements. Medical Documentation: Assessing the agent's ability to synthesize patient information into structured clinical notes or summaries that contain all necessary medical facts. Patient Communication: Evaluating the accuracy and appropriateness of drafted responses to patient inquiries. Clinical Reasoning & Analysis: Checking the agent's ability to perform calculations (e.g., risk scores) or aggregate complex data to form a correct clinical conclusion (e.g., identifying contraindications).

  • AG

    BioEval

    by bertrandbuild

    It includes 12 BioNLP benchmarks across six applications (for a complete BIO agent): > Question Answering : MedQA (USMLE-style), PubMedQA > Named Entity Recognition : BC5CDR Chemical, NCBI Disease > Multi-label Classification : LitCovid, Hallmarks of Cancer > Relation Extraction : ChemProt, DDI (Drug-Drug Interactions) > Text Simplification : PLOS, Cochrane PLS > Summarization : PubMed (dynamic)

  • AG

    MedAgentBench

    by delgph

    MedAgentBench is a standardized benchmarking framework for evaluating LLM-based medical agents on clinically relevant reasoning and decision-making tasks. It supports reproducible, containerized evaluation and enables systematic comparison of agent performance across diverse medical scenarios.

Showing 11-20 of 29 Page 2 of 3