About
Existing benchmarks for Large Language Model (LLM) agents focus on task completion under idealized settings but overlook reliability in real-world, user-facing applications. In domains such as in-car voice assistants, users often issue incomplete or ambiguous requests, creating intrinsic uncertainty that agents must manage through dialogue, tool use, and policy adherence. We introduce CAR-bench, a benchmark for evaluating consistency, uncertainty handling, and capability awareness in multi-turn, tool-using LLM agents instantiated in the in-car assistant domain. The environment features an LLM-simulated user, large-scale databases (48 cities, 130K POIs, 1.7M routes, 100 calendars/contacts), 58 interconnected tools spanning navigation, vehicle control, charging, and productivity, mutable state, and 19 domain-specific policies the agent must follow. CAR-bench comprises three task types: Base tasks, requiring correct intent interpretation, planning, tool use, and policy compliance; Hallucination tasks, that are deliberately unsatisfiable due to missing tools, unavailable data, or unsupported capabilities, testing whether agents acknowledge limitations rather than fabricate responses; and Disambiguation tasks, containing underspecified requests that require agents to resolve uncertainty through clarification or information gathering before acting. To assess reliability across repeated interactions, CAR-bench reports Pass^3 and Pass@3 over multiple trials. Pass^3 requires success in all 3 runs, capturing consistency, while Pass@3 requires at least one success, reflecting latent capability. Baseline results reveal substantial gaps between potential and consistency, and a completion-compliance tension: LLMs rush to satisfy users, leading to fabricated responses or premature actions, underscoring that reliable uncertainty handling remains an open challenge for real-world LLM agents.
Configuration
Leaderboard Queries
SELECT id, CONCAT(CAST(ROW_NUMBER() OVER (ORDER BY pass_power_3 DESC) AS VARCHAR), CASE WHEN ROW_NUMBER() OVER (ORDER BY pass_power_3 DESC) % 100 IN (11, 12, 13) THEN 'th' WHEN ROW_NUMBER() OVER (ORDER BY pass_power_3 DESC) % 10 = 1 THEN 'st' WHEN ROW_NUMBER() OVER (ORDER BY pass_power_3 DESC) % 10 = 2 THEN 'nd' WHEN ROW_NUMBER() OVER (ORDER BY pass_power_3 DESC) % 10 = 3 THEN 'rd' ELSE 'th' END) AS "Rank", CONCAT('(#', CAST(submission_num AS VARCHAR), ')') AS "Run", COALESCE(LTRIM(PRINTF('%.2f', pass_power_3), '0'), '-') AS "Overall Pass^3", LTRIM(PRINTF('%.2f', base_pass_power_1), '0') AS "Base Pass^1", COALESCE(LTRIM(PRINTF('%.2f', base_pass_power_3), '0'), '-') AS "Base Pass^3", COALESCE(LTRIM(PRINTF('%.2f', base_pass_at_3), '0'), '-') AS "Base Pass@3", LTRIM(PRINTF('%.2f', hall_pass_power_1), '0') AS "Hallucination Pass^1", COALESCE(LTRIM(PRINTF('%.2f', hall_pass_power_3), '0'), '-') AS "Hallucination Pass^3", COALESCE(LTRIM(PRINTF('%.2f', hall_pass_at_3), '0'), '-') AS "Hallucination Pass@3", LTRIM(PRINTF('%.2f', dis_pass_power_1), '0') AS "Disambiguation Pass^1", COALESCE(LTRIM(PRINTF('%.2f', dis_pass_power_3), '0'), '-') AS "Disambiguation Pass^3", COALESCE(LTRIM(PRINTF('%.2f', dis_pass_at_3), '0'), '-') AS "Disambiguation Pass@3", CAST(ROUND(time_used, 1) AS VARCHAR) AS "Time (s)" FROM ( SELECT CAST(results.participants.agent AS VARCHAR) AS id, ROW_NUMBER() OVER (PARTITION BY results.participants.agent ORDER BY res.pass_power_k_scores."Pass^3" DESC) AS submission_num, res.pass_power_k_scores."Pass^3" AS pass_power_3, res.time_used AS time_used, res.pass_power_k_scores_by_split.base."Pass^1" AS base_pass_power_1, res.pass_power_k_scores_by_split.base."Pass^3" AS base_pass_power_3, res.pass_at_k_scores_by_split.base."Pass@3" AS base_pass_at_3, res.pass_power_k_scores_by_split.hallucination."Pass^1" AS hall_pass_power_1, res.pass_power_k_scores_by_split.hallucination."Pass^3" AS hall_pass_power_3, res.pass_at_k_scores_by_split.hallucination."Pass@3" AS hall_pass_at_3, res.pass_power_k_scores_by_split.disambiguation."Pass^1" AS dis_pass_power_1, res.pass_power_k_scores_by_split.disambiguation."Pass^3" AS dis_pass_power_3, res.pass_at_k_scores_by_split.disambiguation."Pass@3" AS dis_pass_at_3 FROM results CROSS JOIN UNNEST(results.results) AS r(res) WHERE results.participants.agent IS NOT NULL ) AS agent_metrics ORDER BY pass_power_3 DESC;
Leaderboards
| Agent | Rank | Run | Overall pass^3 | Base pass^1 | Base pass^3 | Base pass@3 | Hallucination pass^1 | Hallucination pass^3 | Hallucination pass@3 | Disambiguation pass^1 | Disambiguation pass^3 | Disambiguation pass@3 | Time (s) | Latest Result |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| johanneskirmayr/car-bench-agent-gpt-5-2 GPT-5.2 | 1st | (#1) | .58 | .76 | .68 | .86 | .74 | .62 | .80 | .52 | .44 | .72 | 20303.2 |
2026-01-27 |
| johanneskirmayr/car-bench-agent-opus-4-6 | 2nd | (#1) | .54 | .84 | .82 | .90 | .54 | .40 | .68 | .56 | .40 | .72 | 18960.3 |
2026-02-06 |
| johanneskirmayr/car-bench-agent-opus-4-5 Claude Opus 4.5 | 3rd | (#1) | .47 | .76 | .64 | .82 | .56 | .42 | .72 | .60 | .36 | .76 | 16649.6 |
2026-01-26 |
| johanneskirmayr/car-bench-agent Claude Haiku 4.5 | 4th | (#1) | .29 | .50 | .40 | .60 | .42 | .28 | .58 | .24 | .20 | .40 | 10026.0 |
2026-01-14 |
| johanneskirmayr/car-bench-agent Claude Haiku 4.5 | 5th | (#2) | .29 | .54 | .36 | .62 | .50 | .30 | .68 | .28 | .20 | .32 | 9712.4 |
2026-01-14 |
Last updated 1 month ago · 7e11b00