Leaderboard

Rank System Category Score A1 A2 A3 A4 A5 A6 TS BT Cap

Click a system name to see its arc-by-arc detail page (per-session transcripts, callback ledger, per-turn rubric heatmap, judge notes, cost breakdown). Use the pairwise comparison viewer to put two systems side-by-side on the same scenarios.

TrueSkill confidence intervals

Forest plot of TrueSkill μ ± 3σ. Wider whiskers mean fewer pairwise matches have been recorded yet. Conservative rating (μ − 3σ) is what the leaderboard column cites.

Loading…