Other Agent

  • AG

    2GAs

    by paulonasc7

    The 2GAs benchmark addresses a critical gap in agent evaluation: today’s benchmarks rarely measure whether agents can safely and effectively discover high‑performance configurations in complex, highly constrained combinatorial optimization problems. In particular, existing evaluations often overlook inherent problem complexity and the role of soft constraints or decision-maker preferences, which are typically encoded through tunable parameters. From a practical decision-making perspective, this omission is critical: real-world optimization problems typically require the definition and calibration of a large parameter space, where parameter interactions directly influence how well solutions align with the true objectives of the decision maker. Consequently, prevailing evaluation strategies fail to reflect how optimization algorithms perform when confronted with structured complexity, preference trade-offs, and parameterized objective functions—conditions that are central to real-world deployment. To bridge this gap, our benchmark introduces a new paradigm: a green agent that exposes a controlled MCP tool surface for genetic algorithm tuning, and purple agents that must reason, probe, and adapt across multiple evaluation rounds to improve the solutions. Early results validate the core loop—schema discovery, constrained tool‑driven experimentation, and budget‑aware optimization—while establishing a foundation for scalable, reproducible assessments across models and runtimes. At maturity, this benchmark will deliver multi‑instance evaluation for any optimization problem, adaptive difficulty curves, explicit efficiency metrics, and richer behavioral signals (exploration vs. exploitation that are common in genetic algorithms, budget discipline, and improvement trajectory). It will enable tool‑mediated evaluation with strong guardrails and reproducibility guarantees, positioning 2GAs-GenAlg-GreenAgent as the standard for benchmarking agentic optimization. The result is an evaluation framework that unlocks meaningful comparisons across agents, incentivizes robust genetic algorithm search strategies, and elevates the ecosystem’s capacity to measure—and improve—real‑world decision‑making. This positions the benchmark as a foundational pillar for next‑generation agent assessment and a catalyst for broad adoption across research and industry in the field of optimization and, particularly, genetic algorithms.

  • AG

    QBench

    by Jyoti-Ranjan-Das845

    The green agent evaluates an agent’s ability to make valid, constraint-aware decisions in a sequential operational environment. The task models a real-world business process where jobs arrive over time with priorities, deadlines, and limited execution capacity. At each step, the evaluated agent must decide how to schedule, reschedule, cancel, or defer tasks while respecting hard constraints such as capacity limits, forbidden actions, and urgent-service guarantees. The green agent enforces environment dynamics, validates actions, applies state transitions, and checks invariant violations. Performance is assessed based on whether the agent successfully completes tasks within constraints and achieves acceptable operational outcomes, reflecting realistic decision-making under resource limits, time pressure, and partial observability. The evaluation spans 35 distinct scenario types across 105 episodes, testing agent robustness under diverse operational challenges including capacity fluctuations, priority shifts, and deadline pressure.

Showing 181-190 of 213 Page 19 of 22