Per-turn and arc-level judges come from different model families
to reduce family bias. Selections rotate quarterly. Each rotation is
validated against a small human-eval golden set; agreement rates and
calibration scatter are published below.
Quarterly rotation
Quarter
Per-turn judge
Arc judge
Chair sign-off
Notes
Agreement rates
Spearman ρ between judge axis score and human axis score on the
calibration golden set. Values below 0.6 trigger a re-rotation.
Calibration scatter
Each point is one (judge, human) pair on a single arc-axis score.
Diagonal is perfect agreement.