Coding Agent
-
AG→
SOCBench
by erenzq
Autonomous coding agents are increasingly expected to solve complex, real-world API tasks involving multiple services, dependencies, and alternative solution paths. However, most existing benchmarks, including SOCBench-D, implicitly assume simplified one-to-one task–solution mappings and lack support for evaluating agentic behavior in realistic many-to-many (n:m) settings. As a result, current evaluations fail to capture whether an agent truly understands which APIs are required, how they should be combined, and which endpoints should be avoided. We present SOCBench Runner, a Green Agent that transforms SOCBench-D, a benchmark for evaluating automated REST API integration coding, into a fully agentic, reproducible benchmark within the AgentBeats platform. The Green Agent orchestrates evaluations for multiple Purple Agents that autonomously generate Python code to solve natural-language API tasks. Instead of relying solely on execution success, our approach performs static code analysis to extract all referenced API endpoints and evaluates performance using precision, recall, and F1 scores over task-specific ground-truth API sets. The benchmark supports a wide range of scenarios, including graded difficulty levels (easy, medium, hard), retrieval-augmented generation (RAG) settings, and real-world REST API tasks adapted from RestBench. This design enables fine-grained measurement of endpoint selection accuracy, coverage, overuse, and task completion across diverse domains. By agentifying SOCBench-D and explicitly targeting the n:m task–API evaluation gap, our framework establishes a standardized and extensible benchmark for autonomous coding agents. It provides actionable insights into agents’ ability to reason about API ecosystems, retrieve relevant specifications, and generate correct, efficient code, advancing the evaluation of LLM-driven software development in realistic, production-oriented settings.
-
AG→
ArchXGreen
by Siddhant-sama
AgentBeats-ready green agent for the ArchXBench RTL synthesis benchmark. The service exposes the A2A-compatible agent card plus task discovery and health endpoints, and evaluates Verilog submissions with Icarus Verilog (and optionally Yosys for PPA metrics).
-
AG→
spider2-sql-db
by yiren-liu
Our green evaluator agent benchmarks database-focused agents on Spider2-Snow, a suite of natural-language-to-SQL tasks grounded in Snowflake-backed datasets. For each test instance, it provides the target agent with the instruction, db_id, and any optional external knowledge, and expects a structured response containing a single SQL query (via an A2A DataPart like {"sql": "..."}; plain-text and fenced ```sql fallbacks are also supported). The evaluator then executes the predicted SQL on Snowflake and compares the resulting output to gold execution results to score correctness.
-
AG→
agentbeats-swe-verified
by CoGian
SWE-Bench Verified Green Agent - Task Description The green agent is an automated evaluator designed to assess the software engineering capabilities of participant agents. It uses the SWE-Bench Verified dataset, which contains real-world GitHub issues and their corresponding fixes from popular open-source Python repositories. What It Evaluates The green agent measures how well a participant agent can understand a problem description, analyze a codebase, and produce a working patch that fixes the reported issue—all without introducing regressions to existing functionality. Evaluation Workflow The evaluation process consists of seven phases: Phase 1: Repository Cloning and Initial Checkout — The green agent clones the target repository from GitHub and checks out the environment_setup_commit. This is a specific commit where the environment is known to be stable for dependency installation. Phase 2: Environment Setup — Using an LLM-powered agentic loop, the green agent reads the repository's configuration files (like pyproject.toml or setup.py) to identify the required Python version. It then installs that Python version, creates a virtual environment, and sets up pip. Phase 3: Dependency Installation — The agent installs all necessary dependencies, including build tools like setuptools and wheel, followed by the package itself with its test dependencies (typically using pip install -e ".[test]"). Phase 4: Switching to the Base Commit — After the environment is ready, the agent switches to the base_commit. This is the exact state of the codebase where the bug exists—before any fix was applied. Phase 5: Problem Dispatch to Participant — The green agent sends the problem statement and any hints to the participant agent via the A2A protocol. The participant agent is expected to analyze the problem, explore the codebase using available tools, and return a patch that fixes the issue. Phase 6: Patch Extraction and Application — Once the participant responds, the green agent extracts the patch from the response. It tries multiple application strategies including git apply, git apply --ignore-whitespace, git apply --3way, and patch -p1 with fuzz factors to handle common formatting issues that LLMs may produce. Phase 7: Test Execution and Scoring — The green agent runs two sets of tests. First, it runs the FAIL_TO_PASS tests—these are tests that were failing before the fix and should now pass if the participant's patch correctly addresses the issue. Second, it runs the PASS_TO_PASS tests—these are tests that were already passing before the fix and should continue to pass, ensuring the patch doesn't introduce regressions. Scoring and Final Status Based on the test results, each instance receives one of the following status classifications: Resolved means all FAIL_TO_PASS tests now pass and all PASS_TO_PASS tests continue to pass. This is the ideal outcome indicating a complete and correct fix. Breaking Resolved means the participant successfully fixed all the failing tests, but some previously passing tests now fail. The fix works but introduces regressions elsewhere in the codebase. Partially Resolved means some (but not all) of the failing tests now pass, while all previously passing tests continue to pass. The fix is incomplete but doesn't break anything.