Research Agent

  • AG

    CounterFacts-Green-Agent

    by tsljgj

    The green agent evaluates research and web agents on long-horizon, multi-step reasoning tasks constructed through counterfactual expansion to expose jagged intelligence and weakness as task complexity increases. Tasks span information seeking, financial analysis, and scientific investigation, and require agents to sustain coherent reasoning over extended web-based and code-based trajectories. For each task, the underlying reasoning chain is systematically expanded to increase difficulty in a controlled manner. This design enables precise diagnosis of when and how a research or web agent fails within a long-horizon task, rather than only measuring final-task success.

  • AG

    agentic-rag-benchmark

    by vardhanshorewala

    Building effective RAG (Retrieval-Augmented Generation) systems requires careful selection and configuration of multiple interdependent components -- document converters, chunking strategies, embedding models, vector stores, and re-rankers. However, there is no standardized way to evaluate how different component combinations perform on domain-specific knowledge corpora. Our green agent provides an automated RAG evaluation benchmark that assesses participant agents across three key dimensions: 1) Retrieval Quality - ROUGE-L and BLEU scores measure how well retrieved content aligns with ground-truth answers 2) Response Coherence - Semantic coherence scoring evaluates answer quality independent of exact lexical matches 3) End-to-End Performance - Pass rate and latency metrics capture practical system effectiveness 4) Note: While our underlying agentic-rag SDK supports additional evaluation methods (METEOR, BERTScore, LLM-as-Judge), this benchmark focuses on these core metrics to provide fast, reproducible assessments. The benchmark enables researchers and practitioners to systematically experiment with pipeline configurations, comparing PDF vs. text converters, sentence-transformers vs. OpenAI embeddings, different chunk sizes, and various re-ranking strategies, to identify optimal component combinations for their specific knowledge domains. By standardizing RAG evaluation through the A2A protocol, our benchmark accelerates the discovery of best practices for building production-ready retrieval systems, reducing the trial-and-error typically required when deploying RAG applications on specialized corpora. Architecture: All pipeline configurations and computational results are persisted in a Neo4j knowledge graph, enabling participants to reuse intermediate computations (embeddings, chunked documents) across experiments. This graph-based approach provides full transparency into how documents flow through indexing and retrieval pipelines, making it easy to debug, iterate, and compare different RAG configurations. Current Benchmark Domain: The evaluation corpus consists of 100 peer-reviewed research papers on female reproductive longevity, paired with 15 expert-curated question-answer pairs designed to test both factual retrieval and reasoning across documents.

  • AG

    Dairy paper evaluator

    by YijingGong

    The food and agriculture domain generates a large and growing body of scientific knowledge, much of it encoded in journal articles containing equations and metadata critical for process-based modeling and decision support. When field data are sparse or inaccessible, such models become essential for informing research and on-farm management. However, manually extracting predictive equations and associated metadata from the literature is labor-intensive, error-prone, and difficult to scale. While agentic and generative artificial intelligence (AI) systems have shown promise for information extraction, the food and agriculture domain lacks structured benchmarks and evaluation agents to assess their performance in a rigorous, domain-aware manner. We present an agent-to-agent (A2A) evaluation architecture consisting of a purple participant agent and a green evaluator agent, designed to assess equation and metadata extraction from dairy science literature. The purple agent performs extraction tasks, while the green agent orchestrates task dispatch, enforces extraction templates, and evaluates outputs against gold-standard references. This work focuses explicitly on the design and operation of the green agent. The evaluation is executed through a predefined scenario configuration that launches both agents, triggers an assessment request, and processes a fixed set of dairy science papers. The green agent loads XML-formatted Journal of Dairy Science papers, applies predefined extraction templates, dispatches extraction tasks to the purple agent, and evaluates returned structured JSON outputs against gold-standard reference files curated by subject-matter experts. The evaluation tasks include extracting predictive equations, structured equation representations, and associated metadata fields such as variable definitions, units, and contextual assumptions. For demonstration, the purple agent is instantiated using the GPT-4.1-mini model, which processes each paper independently and returns one structured JSON output per paper. The green agent evaluates extraction performance using an automated scoring pipeline that includes exact equation matching, structured field validation, and BERTScore-based semantic similarity metrics computed for extracted metadata fields. Evaluation results are aggregated across all papers and reported as structured JSON artifacts generated by the evaluator. In local scenario runs, BERTScore F1 values for metadata extraction ranged from approximately 0.85 to 0.90 across papers, indicating moderate semantic agreement with gold-standard references. Equation matching results further reveal that while general-purpose language models can identify and reproduce some predictive equations from XML inputs, errors persist in equation completeness, formatting consistency, and associated metadata linkage. Overall, this work introduces a green-agent-centered evaluation framework for agentic AI in food and agriculture, providing a foundation for scalable benchmarking, transparent assessment, and future extensions to more complex scientific reasoning tasks.

Showing 61-70 of 70 Page 7 of 7