Files

Matt 64e7be2418 docs: add design spec for juror-balance toggle and round-scoping fixes

Captures the per-round toggle, side-panel deeper display, "How scores
are calculated" explainer dialog, and the cross-round contamination
fixes for getProjectDetail and getProjectRankings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-27 12:50:32 +02:00

14 KiB

Raw Blame History

Juror-Balanced Scoring Toggle + Round-Scoping Fixes

Status: design Date: 2026-04-27 Author: Matt + Claude

Goal

Two related changes to the ranking system:

Add a per-round toggle that controls whether the ranking dashboard ranks projects by the juror-balanced (z-normalized) score or by the raw average. The toggle persists in Round.configJson and is shared across all viewers. Admins flip it from the side panel of the admin ranking dashboard; observers see the effect (which score is "active") but don't get the toggle UI themselves, matching today's role gates on the dashboard.
Fix cross-round contamination in two analytics procedures (getProjectDetail, getProjectRankings) and several UI surfaces that consume them. Per-juror balance contexts must be computed within a single round; aggregate stats (avg score, evaluator count, pass rate) must be scoped to the round being viewed.

A side panel "deeper display" replaces the small ⇢ X.X annotation on the list view: the list view stays clean, and clicking into a project surfaces the raw + balanced numbers, the toggle, an explainer, and per-juror balance contributions.

Background

Juror-balanced scoring (src/server/services/juror-balance.ts) corrects for per-juror grading harshness using z-normalization. Each juror's scores are normalized against their own mean + stddev across the round, then rescaled onto the round's overall mean + stddev so balanced numbers are comparable to raw averages.

The math is correct, but two scoping problems exist:

Problem 1 — getProjectDetail is round-blind. The query at src/server/routers/analytics.ts:1417-1422 pulls every SUBMITTED evaluation for a project across every round it ever participated in, then computes Avg Score / Evaluators / Pass Rate from that pool. Meanwhile the per-juror list rendered in the admin sheet at src/components/admin/round/ranking-dashboard.tsx:1034-1036 filters to the current round. Result: stats card disagrees with the visible per-juror list.

Problem 2 — getProjectRankings (programId/edition mode) pools z-context across rounds. At src/server/routers/analytics.ts:212-218, when invoked with programId (instead of roundId), evaluations from every round in the edition are fed into a single computeBalanceContext. A juror's mean/stddev is then computed across mixed contexts (e.g. quick intake screening + deep evaluation), producing meaningless personal calibration.

Other call sites (ranking.ts, ai-juror-calibration.ts) already filter by round and are unaffected.

Surfaces affected

#	Surface	Procedure	Issue
1	Admin ranking dashboard side sheet	`analytics.getProjectDetail`	Stats card pulls cross-round evals
2	Observer full project detail page	`analytics.getProjectDetail`	Same; observer-side
3	Observer reports preview dialog	`analytics.getProjectDetail`	Same; observer-side
4	Admin reports overview tab rankings	`analytics.getProjectRankings`	Edition mode uses cross-round z-context
5	Admin reports detail tab rankings	`analytics.getProjectRankings`	Same
6	Admin reports overview "Balanced Avg" tile	derives from #4	Inherits the bad numbers
7	Result lock controls	`analytics.getProjectRankings` (roundId only)	OK — already round-scoped
8	Admin ranking dashboard list	`ranking.getRoundRanking`	OK — already filters by roundId
9	AI juror calibration service	self-contained	OK — already filters by roundId

Design

1. Round-scoping fixes

`analytics.getProjectDetail`

Add an optional roundId to the input schema.
When roundId is provided, filter submittedEvaluations (the query at line 1417) by assignment: { roundId }. The stats block computed from those evaluations becomes round-scoped automatically.
When roundId is not provided, return stats: null and a new field statsByRound: Array<{ roundId, roundName, stats }> so callers can render per-round breakdowns instead of one misleading aggregate. (The current dialogs always know which round they want — they just weren't passing it.)
Pass roundId from the three callers (#1, #2, #3 above).

`analytics.getProjectRankings`

When called in edition mode (programId only), z-normalization must run per round, not across the pool:

Group points: ScorePoint[] by roundId (we'll need to include roundId in each point — currently evalWhere returns flat evaluations; add assignment.round.id to the select).
For each round, call computeBalanceContext(pointsForRound) and computeBalancedProjectScores(pointsForRound, ctx).
Aggregate per-project: a project's edition-level balancedScore is the unweighted mean of its per-round balanced averages. Its averageScore (raw) is the unweighted mean of its per-round raw averages.
evaluationCount becomes the total across rounds (unchanged in spirit).

In roundId mode, behavior is unchanged.

Default round resolution (observer full project page, #2)

The observer page at /observer/projects/[projectId] doesn't know which round to focus on. Resolution logic:

Among rounds where ProjectRoundState exists for this project:
  1. If exactly one round.status = ROUND_ACTIVE, use it.
  2. Else use the most recent round with status = ROUND_CLOSED
     (ordered by sortOrder desc, or exitedAt desc as tiebreak).
  3. Else if only ROUND_DRAFT rounds exist, fall back to none (stats: null).

A small round selector chip near the stats card lets the user switch contexts; the URL updates with ?round=<id>.

2. Per-round balanced-scoring toggle

Storage

Add useBalancedRanking: boolean to Round.configJson (default true — preserve current behavior). No schema migration needed since configJson is already a flexible JSON column.

tRPC procedure

Extend ranking.updateConfig (or add setUseBalancedRanking) — admin/observer-procedure level. The page is admin-only today, so observer access for this toggle would be a deliberate widening. Decision: keep it adminProcedure (PROGRAM_ADMIN + SUPER_ADMIN). The user said "anyone who can view should be able to toggle," and the page is gated to admins.

UI integration

Toggle lives at the top of the side sheet (not the list view) — labeled "Use balanced scoring for ranking" with a help icon that opens the explainer.
When toggled, the dashboard re-sorts immediately (the list-view sort at ranking-dashboard.tsx:417,879 reads from evalScores.balanced[id]?.balancedAverage; we'll wrap that in useBalancedRanking ? balanced : raw).
The list row's compact ⇢ X.X annotation is removed. Visual delta lives in the side panel only.

3. Side panel deeper display

The existing side sheet (ranking-dashboard.tsx:970-1090) gains:

Stats area (replaces the current 3-card grid)

┌──────────────────────────────────────────────────────────────┐
│ Avg Score                                                    │
│   Raw: 8.3      Balanced: 8.0  ← used for ranking            │
│                                                              │
│ Evaluators: 3        Pass Rate: 67%                          │
│                                                              │
│ ⓘ How is this calculated?  (collapsible)                     │
└──────────────────────────────────────────────────────────────┘

"Raw" and "Balanced" sit side-by-side. The active one (per the round's toggle) gets a subtle "← used for ranking" tag and bolder weight.
Both numbers always show one decimal (.toFixed(1)).
Below the numbers, a clickable affordance: "How scores are calculated" (small button or link with an info icon). Clicking opens an explainer dialog (see "Score explainer dialog" below).

Per-juror rows (extends current `Juror Evaluations` block)

Each row currently shows Name · Yes/No badge · Score: 9.0. New layout when balanced is on:

Rachid Benchaouir          Yes   Score: 9.0   (typical 7.2 → contributes 8.5)

The trailing chip is muted text. When balanced is off, the chip is hidden. Tooltip on the chip explains the calculation.

Per-round toggle row at top

[Use balanced scoring for ranking]  [toggle]   ⓘ

Single horizontal row, just below the project header. Persists on flip. The ⓘ icon opens the same "How scores are calculated" dialog.

Score explainer dialog ("How scores are calculated")

A reusable dialog component (<ScoreExplainerDialog />) opens from the affordance in the side panel and from a matching affordance on the observer surfaces (#2, #3) so both audiences see the same explanation. Content is plain-language, not academic, and walks through one concrete worked example.

Structure:

What it does (1 paragraph) — "Different jurors have different grading styles. Some grade harshly, some leniently. Balanced scoring corrects for that so a project isn't punished for drawing harsh jurors or rewarded for drawing lenient ones."
How it works, step by step — five short numbered points:
1. For each juror, calculate their personal average and spread across all the projects they scored in this round.
2. Convert each individual score into "how many standard deviations above or below this juror's typical" — a 6 from a juror who averages 5 reads the same as a 9 from a juror who averages 8.
3. Average those normalized values across the project's jurors.
4. Rescale back onto the same 1–10 scale using the round's overall average and spread.
5. The result is directly comparable to the raw average — same scale, but corrected for grading style.

Worked example — a concrete table using fabricated jurors, e.g.:

Juror	Their typical avg	Their score for "Project X"	What that means
Juror A (lenient)	8.2	9.0	Just slightly above their typical (+0.4σ)
Juror B (harsh)	5.8	7.5	Well above their typical (+1.5σ)
Juror C (typical)	7.0	8.0	Slightly above their typical (+0.7σ)

"Raw average: (9.0 + 7.5 + 8.0) / 3 = 8.2 Balanced average rescales each juror's enthusiasm to the round's overall scale and lands at 8.4 — Juror B's strong endorsement (well above their harsh baseline) carries more weight than the raw 7.5 suggests."

When it kicks in / when it doesn't — short paragraph:
- Needs ≥ 2 evaluations from the round to compute a juror's spread; otherwise that juror falls back to the round-wide average.
- Needs at least one juror with non-zero spread for the round; if everyone gave identical scores, balanced equals raw.
- Computed within a single round only — a juror's grading style in an intake screening round doesn't affect their balance in a deeper evaluation round.
Why "Raw" is still shown — "We always show both numbers so admins can sanity-check. The toggle at the top of the panel decides which one is used for ranking."

The dialog is a shadcn/ui Dialog, max-width ~md, scrollable. No live data — content is static text + the static example table. Lives in src/components/shared/score-explainer-dialog.tsx so it can be imported by admin and observer surfaces alike.

4. Decimal display audit

Standardize on one decimal for all balanced/raw score surfaces:

admin/reports/page.tsx:368 currently shows toFixed(2) — change to toFixed(1).
All other sites already use .toFixed(1) or compute integers.

Data flow summary

Round.configJson.useBalancedRanking ──→ ranking-dashboard reads on mount
                                    ──→ list sort uses raw or balanced based on flag
                                    ──→ side panel shows both, marks the active one

getProjectDetail({ id, roundId })  ──→ filtered submittedEvaluations
                                   ──→ round-scoped stats
                                   ──→ optionally: per-round balance context computed
                                       inline for the side panel deeper display

getProjectRankings({ programId })  ──→ group by roundId
                                   ──→ per-round balance context
                                   ──→ aggregate per-project means across rounds

Out of scope

Migrating historical ResultLock snapshots that captured the old (potentially miscomputed) edition-level rankings. Past locks were round-scoped, so they're already correct; only the read-time edition rollup was broken.
Exposing the toggle to OBSERVER role. Today it's admin-only, matching page access.
AI calibration service changes — already round-scoped.
Changing the underlying juror-balance math. The algorithm is correct; only the inputs needed scoping.

Risks

Edition rollup semantic change. Anyone currently looking at "all rounds" balanced rankings sees different numbers after the fix. This is the right outcome but should be communicated to the team. The numbers shown today are not trustworthy.
Toggle default. Defaulting useBalancedRanking = true preserves today's behavior. Existing rounds without the field set use the default.
Side-panel re-renders. The toggle live-updates the list sort; ensure useQuery invalidations are wired so a flip in the panel triggers a re-fetch / re-sort without a full page reload.

Open items

None blocking. Implementation plan can proceed.

Acceptance criteria

With 3 round-scoped evaluations of 9, 8, 8, the side panel stats card shows Avg 8.3 (not 8.0) and Evaluators 3 (not 5).
Flipping the per-round toggle re-sorts the list view; the choice persists across page reloads and is shared across users.
The list view shows no per-row balanced delta annotation.
The side panel always shows both Raw and Balanced; the active one is marked.
Edition-level rankings (programId mode) compute one balance context per round and aggregate, never pooling across rounds.
Observer project detail page defaults to the currently-active or most-recently-closed round the project participated in.
All score displays use one decimal.
A "How scores are calculated" affordance is present in the admin side panel, the observer full project page, and the observer reports preview dialog. Clicking it opens an explainer dialog with the algorithm summary, a step-by-step plain-language walkthrough, and a worked example.

14 KiB Raw Blame History Unescape Escape