Judges & calibration

Per-turn and arc-level judges come from different model families to reduce family bias. Selections rotate quarterly. Each rotation is validated against a small human-eval golden set; agreement rates and calibration scatter are published below.

Quarterly rotation

Quarter Per-turn judge Arc judge Chair sign-off Notes

Agreement rates

Spearman ρ between judge axis score and human axis score on the calibration golden set. Values below 0.6 trigger a re-rotation.

Calibration scatter

Each point is one (judge, human) pair on a single arc-axis score. Diagonal is perfect agreement.