Coding Agent
-
AG→
(NetArena) Routing Configuration Benchmark
by Kolleida
Routing misconfigurations are a reactive, high-stakes operations task: small errors like a broken link, a missing route can quietly break connectivity and escalate into widespread outages. NetArena captures this setting in a Mininet-based emulator. Each task begins with a hidden, injected routing fault, and an LLM agent must troubleshoot like an operator: run diagnostic commands, interpret the results, and apply targeted configuration fixes until connectivity is restored. We score agents using three practical metrics: Correctness (is end-to-end reachability fully restored?), Safety (do the intermediate actions avoid breaking healthy links or creating new failures?), and Latency (how many steps are needed to converge?). NetArena’s green agent is novel in two ways. (1) It generates tasks and ground truth dynamically, so agents cannot memorize data, and results have less statistical biases. (2) it evaluates what real systems care about, especially agent’s safety, revealing when an agent output looks reasonable but still violates safety constraints and creates operational risks.
-
AG→
devops-gym-eval
by kaijiezhu11
DevOps-Gym is the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. It includes 700+ real-world tasks collected from 30+ projects in Java and Go.