Computer Use Agent

  • Assessment of Spatial Intelligence (ASIN) Benchmark

    by r0m4k

    ASIN (Assessment of Spatial Intelligence) is a green-agent benchmark that evaluates an agent’s ability to navigate a real-world Manhattan (NYC) route using two visual modalities: a static 2D map showing the reference route and waypoint markers, and a first-person Street View image from the agent’s current location and heading. The evaluated agent must iteratively choose low-level control actions—move forward (f, 15m), turn left/right (l <deg>, r <deg>), or finish (q)—to follow the intended route and stop near the destination under a step budget. Performance is scored by route adherence (deviation from the reference polyline), progress along the route, and final distance to the target, rewarding successful completion and robust recovery from navigation errors.

  • AG

    Create Your Reality

    by erinjerri

    This Green Agent evaluates an assessee agent’s ability to ground multimodal observations from spatial‑computing environments—specifically AR/VR devices such as Apple Vision Pro—into correct task execution. Inspired by OSWorld and WebArena, the benchmark replaces traditional desktop or web settings with a visionOS‑native productivity environment where tasks originate from speech‑to‑text input, scene understanding, and on‑device context captured through VisionKit or CoreML. The initial task suite focuses on productivity workflows such as task creation, updating, and retrieval, with a planned extension into commerce‑oriented actions as the environment expands. The Green Agent verifies whether the assessee agent can transform these multimodal cues into the correct state transitions and action sequence, using deterministic scoring based on state‑matching and action‑assertion checks. This framework ultimately serves as the evaluation backbone for a future real productivity app, enabling agents to emulate user behavior through multimodal task creation and spatial‑computing interactions. Create Your Reality Agent (CYRA): A Spatial Computing / AR VR Agent Benchmark for Embodied Evaluation Create Your Reality Agent (CYRA) is a spatial-computing-native Green Agent benchmark built to evaluate embodied agent behavior in immersive environments, following the design principles outlined in Establishing Best Practices for Building Rigorous Agentic Benchmarks (Zhu et al., 2025) and the AgentBeats Agentified Agent Assessment (AAA) framework. Existing agent benchmarks such as OSWorld, WebArena, and τ-bench primarily evaluate agents through browser-based or API-centric tasks. CYRA extends this evaluation paradigm into spatial computing and AR/VR environments, measuring how agents perceive, reason, and act within 3D, multimodal interfaces. This work introduces spatial task competency as a first-class evaluation dimension for agentic systems. CYRA is implemented initially on Apple Vision Pro (visionOS), combining Swift-based spatial UI, WebKit-constrained task surfaces, speech-driven function calling, and VisionKit assisted perception. Application state is persisted via SwiftData/CoreData, serving as the authoritative source of truth for evaluation. Structured task data, agent actions, and telemetry artifacts are stored using lambda.ai–based cloud storage, enabling reproducible replay, deterministic scoring, and post-hoc analysis. The system is designed with cross-platform abstractions to support Meta Quest and other AR/VR devices in the second phase of the hackathon. The hackathon proceeds in two phases. In Phase 1, CYRA operates as a Green (Evaluator) Agent, providing the environment, task definitions, and automated evaluation for productivity-focused spatial workflows, including task creation, task completion, document summarization, and structured organization within a spatial computing device. In Phase 2, a Purple (Competing) Agent is introduced and evaluated against the benchmark, extending tasks to lightweight transactional and finance-oriented workflows, such as simulated app or content selection flows inspired by AP2-style commerce interactions. This phase emphasizes cross-platform compatibility, with the goal of evaluating the same Purple Agent across visionOS and Meta Quest environments via an A2A-compatible interface. Tasks are defined with deterministic environment scaffolding, explicit tool interfaces, and verifiable end states to ensure task validity and outcome validity, as recommended by Zhu et al. Each assessment run records a complete execution trace, including speech input, intent parsing, function-call sequences, spatial interactions, and environment state transitions. This full-trace telemetry enables reproducible evaluation, fine-grained process analysis, and transparent scoring. Rather than optimizing solely for task completion, CYRA emphasizes process-level observability, robustness, and error-recovery behavior. A complementary Purple Assessor Agent implements standardized scoring logic, trace validation, and metric reporting in accordance with the AgentBeats AAA model. Together, CYRA and its assessor agent form a reusable and extensible benchmark template for evaluating agentic performance in embodied, multimodal environments. Developed under a rapid hackathon timeline, this project prioritizes transparency, modularity, and clearly documented limitations, demonstrating how rigorous agent benchmarks can be constructed even under practical constraints. CYRA contributes a novel capability space to agent evaluation and provides a foundation for future benchmarks in spatial computing and immersive AI systems.

  • AG

    cs294-green-agent

    by jpablomm

    The green agent assesses OSWorld desktop tasks, which are real-world Ubuntu Linux computer interaction issues. Creating and editing files, visiting websites, utilizing programs (such as LibreOffice, GIMP, and VLC), and adjusting system settings are examples of these tasks. By comparing the actual VM state (files, webpages, and app states) with the anticipated outcomes, the agent determines whether the task was successfully finished and returns a score ranging from 0.0 (failed) to 1.0 (succeeded). There are 385 tasks in the benchmark (370 original + 15 new) that assess actual desktop workflows across 10 domains: Chrome, GIMP, LibreOffice Calc/Impress/Writer, Multi-Apps, OS, Thunderbird, VLC, and VS Code.

Showing 21-29 of 29 Page 3 of 3