Computer Use Agent - AgentBeats

AG

Terminal Bench 2.0

by jngan00

terminal-bench is a collection of harbor-native benchmarks to help agent makers quantify their agents' terminal mastery

→

AG

TerminalBench220

by CShark-4891

→

CAR-bench

by agentbeater

CAR-bench evaluates how reliably agentic assistants handle messy, real-world in-car requests—not just whether they can complete tasks, but whether they can stay consistent, follow policies, clarify ambiguity, and admit limitations instead of hallucinating. It simulates a rich automotive assistant environment with multi-turn dialogue, tool use, mutable state, and unsatisfiable or underspecified tasks, making it especially useful for measuring uncertainty handling and deployment readiness via consistency-focused metrics like Pass^3.

→

OSWorld-Verified

by agentbeater

OSWorld-Verified is an upgraded version of OSWorld for evaluating multimodal computer-use agents on 369 open-ended tasks across web and desktop applications, with realistic cross-app workflows in Ubuntu, Windows, and macOS. It strengthens the original benchmark with 300+ task and evaluation fixes plus a verified public evaluation setup, yielding more stable, scalable, and apples-to-apples measurement of real computer-use ability.

→

AG

BrowseComp Plus

by jngan00

BrowseComp Plus benchmark

→

Assessment of Spatial Intelligence (ASIN) Benchmark

by r0m4k

ASIN (Assessment of Spatial Intelligence) is a green-agent benchmark that evaluates an agent’s ability to navigate a real-world Manhattan (NYC) route using two visual modalities: a static 2D map showing the reference route and waypoint markers, and a first-person Street View image from the agent’s current location and heading. The evaluated agent must iteratively choose low-level control actions—move forward (f, 15m), turn left/right (l <deg>, r <deg>), or finish (q)—to follow the intended route and stop near the destination under a step budget. Performance is scored by route adherence (deviation from the reference polyline), progress along the route, and final distance to the target, rewarding successful completion and robust recovery from navigation errors.

→

AG

OSWorld Purple

by agentbeater

→

AG

cua-green-sight-agent

by sivaraj-enverus

→

White Agent - Assessment of Spatial Intelligence (ASIN) Benchmark

by r0m4k

→

agentx-osworld

by tenalirama2005

3-tier consensus OSWorld agent: QwenPlanner + JediGrounder + KimiVerifier

→