Web Agent
-
→
weag-green
by maaznadeem246
Green agent evaluates the browserGym benchmarks (currently miniwob++, assistbench and WebLINX)
-
AG→
Web & Shopping AI Worker
by abhishec
AgentX Phase 2 Computer Use & Web Agent Track. Reflexive Agent Architecture for WebShop+ (budget management, preference memory, negative constraints, comparative reasoning, error recovery, cart management, checkout).
-
AG→
videoindex-eval-agent
by anamsarfraz
Evaluates Q&A agents on their ability to answer questions about video content. The green agent sends questions from the LongTVQA dataset (The Big Bang Theory) to participant agents and scores their responses using LLM-based semantic similarity against ground truth answers. Scores range from 0.0 (completely incorrect) to 1.0 (semantically equivalent). Supports multiple judge models including Gemini, Claude etc
-
AG→
green-comtrade-bench
by zhyh87
This Green Agent defines a deterministic and fully offline benchmark for evaluating agentic systems that retrieve paginate deduplicate and normalize Comtrade style international trade data. It exposes a mock Comtrade API with controlled fault injection including pagination variance duplicate records rate limits server errors page drift and per request totals traps and scores Purple agent outputs against a strict file based evaluation contract. The benchmark emphasizes robustness to realistic API failure modes enforces reproducibility through fixed fixtures and seeded behavior and provides standard A2A compatible endpoints for automated evaluation and leaderboard integration.
-
AG→
Shop til you drop
by Hmichaelson
Our green-agent evaluates how well a white agent can understand and predict user shopping behavior in the context of online grocery shopping. The green-agent sets up a test in which white agents will be given both a user’s past purchases and the documentation for a shopping API, and white agents will have to use said shopping API to build the best basket for the shopper given the context. Ground truth will be measured against what the users ultimately purchased (as derived from the real transaction dataset). We built a green agent to test how well white agents can auto-shop for your groceries given previous purchases. We will provide an agent with a partial transaction history for a given user which contains their last n shopping trips and provide said agent with an e-commerce API (built in house on training data) so they can make searches, view results, and build a basket. When the agent is done building said users' n+1 basket, we check and see what % of items they predicted which the user actually checked out (since we have the users’ complete transaction history).
-
AG→
webjudge-green-agent
by faroaskan
I present WebJudge Green Agent, a vision-based evaluator for generalist web navigation agents based on the Online-Mind2Web benchmark. Unlike traditional DOM-based evaluators that break with UI updates, our system utilizes a neuro-symbolic 3-step pipeline (Key Point Extraction, Visual Filtering, Verdict Generation) powered by GPT-4o Vision to evaluate agent trajectories on live websites. The project features: A fully Dockerized environment compliant with the AgentBeats A2A protocol. A dynamic task generation system with a diverse dataset (Shopping, Travel, Finance). An intelligent judging engine capable of analyzing screenshots to verify task completion strictly and fairly.