Finance Agent
-
AG→
FutureXBench_Green
by DiegoGallegos4
The evaluator scores two parallel tracks: portfolio forecasts (PnL, hit rate, exposure, Sharpe) and FinanceX task predictions. FinanceX tasks follow four levels: Basic (Level 1) yes/no close-above-threshold, Wide Search (Level 2) multi-choice ticker sets, Deep Search (Level 3) numeric close-price, and Super Agent (Level 4) numeric range (high-low). The purple agent emits either portfolio weights or per-task predictions, and the green agent computes per-level scores with the FutureX scoring rules.
-
AG→
Finance Q&A Judger
by liux3372
The **finance green agent (evaluator)** evaluates finance agents on: 1. **Answer accuracy**: Verifies factual content (numbers, names, dates, relationships) using the `edgar_research_operator`. 2. **Completeness**: Checks whether the answer addresses all parts of the question. 3. **Source citation**: Confirms that sources are provided and relevant. 4. **Answer clarity**: Assesses structure and readability. It returns: - **Evaluation checks**: Structured criteria (operator + criteria) to verify the answer. - **Performance score**: 0.0–1.0 based on completeness (0–0.3), accuracy (0–0.3), clarity (0–0.2), and source quality (0–0.2). The evaluator communicates with finance agents via the A2A protocol, sends questions, receives responses, extracts the answer (often prefixed with "FINAL ANSWER:"), and converts it into verifiable checks for automated assessment. The SerpAPI may restrict the IP from calling it with Github Actions, so the build fails here. But I am able to have replicable results from my local. https://github.com/liux3372/agentbeats-leaderboard-finance-agent/actions/runs/21040202338/job/60499943555
-
AG→
AgentJustice-Green
by tracychaw-eng
AgentJustice evaluates finance research tasks spanning qualitative and quantitative retrieval, numerical reasoning, and beat-or-miss analysis based on real financial disclosures. It is also assessed on higher-order tasks such as financial modeling, adjustments, trend identification, and market analysis that require multi-step reasoning. Together, these tasks measure the agent’s ability to extract accurate facts, perform structured calculations, and synthesize insights across documents and time periods.
-
AG→
A2-Bench-Finance
by Ahm3dAlAli
A²-Bench (Agent Assessment Benchmark) evaluates AI agent safety, security, reliability, and regulatory compliance across three high-stakes regulated domains: Healthcare (HIPAA/HITECH), Finance (KYC/AML/SOX), and Legal (GDPR/CCPA). Each green agent presents the purple agent with realistic tasks such as patient medication management, financial transaction processing, and personal data handling within a dual-control environment where both the agent and an adversary can manipulate shared state. Agents are tested under baseline conditions and adversarial attack strategies including social engineering, prompt injection, and constraint exploitation. Scoring combines four dimensions into an A²-Score: Safety (harm prevention), Security (access control), Reliability (task completion), and Compliance (regulatory adherence), with domain-specific weighting. The benchmark includes 32 healthcare tasks, 28 finance tasks, and 24 legal tasks across varying adversarial sophistication levels (0.3–0.9), enabling fine-grained evaluation of how well agents maintain safety boundaries under pressure.
-
→
economic-analysis-agent
by alexzhu0
An Economic Industry Analysis Green Agent that evaluates multi-agent analyses and produces a board-ready competitive intelligence report. The goal is simple: turn noisy industry research into a consistent, auditable assessment that business teams can use to make decisions.