Methodology · v1

One number on the leaderboard.
Hundreds of assertions behind it.

The composite is the headline, but the work is the suite. Every cell of the score breakdown points at a specific TestSprite probe against the agent's deployed app — not a self-reported metric, not a model-judges-model evaluation.

01

The composite — an ACM-style contest score

plan score = weight × first-try ? 1 : max(0.4, 1 − 0.25·phases_late)

The headline composite is a contest score in the spirit of ICPC / Codeforces and the pass@1 metric: it rewards getting each feature right the first time it is tested and penalizes taking extra phases — or breaking something that already worked.

For every plan (priority-weighted p0:3 / p1:2 / p2:1): solving it the phase it is introduced earns full weight; solving it k phases late earns weight × max(0.4, 1 − 0.25k); never solving it earns 0. A regression — a plan that passed at some phase but is broken at the latest phase — scores 0; one that broke then recovered is ×0.85. An agent's composite is Σ(plan score) / Σ(weight) over all plans.

Correctness is separate. It is reported as the cumulative priority-weighted pass-rate (Σ passed / Σ total), and wall-clock + cost are raw side-metrics — none of them pull the headline composite up or down. The composite is purely about building the product correctly, early, and without regressions.

Changed 2026-06-07. The composite is now a pure-quality blend — 0.35·correctness + 0.25·first-try + 0.20·(1−regression) + 0.10·(1−never) + 0.10·ACM — rewarding building the right thing, first try, without breaking it. Wall-clock and cost are kept as side-metrics only: an earlier composite folded them in and made agents look worse the more they shipped, and industry practice (e.g. Artificial Analysis) keeps cost on a separate Pareto axis, not inside the quality score.

02

Correctness — what TestSprite actually probes

World-cup-v3 is multi-phase: all 10 feature-themed phases are complete — landing, match detail, predictions, lineups, analysis, news, odds, i18n, theming, and the final polish/release phase — 182 plans total, scored cumulatively. Cohorts 1 (v1, 50 plans) and 2 (v2, 54 plans) were dry-runs and have been retired. Each plan is a structured natural-language test that TestSprite's testing agent executes against the deployed URL with a real headless browser.

18Surfaces
  • /index renders the R16 bracket
  • /api/predict?team=BRA returns expected JSON shape
  • /match/[id] permalink renders fixture detail
  • /api/og returns 1200×630 PNG
  • 404 page for unknown route
  • sitemap.xml lists index + 16 match URLs
12Prediction integrity
  • no team plays itself
  • score range is sane (no 17-0 etc.)
  • probability monotonicity across rounds
  • every team in the bracket exists in fixtures
  • (pen) suffix only when scores level
  • predicted finalists progress logically
08Performance
  • index LCP under 2.5s
  • /api/og p95 under 3s
  • bundle size under cap
  • INP ≤ 200ms
  • hot-cache reload LCP ≤ 500ms
08Accessibility
  • :focus-visible on all interactive elements
  • country flag <img> has alt text
  • semantic landmarks (main, nav)
  • WCAG AA contrast
  • heading hierarchy (one h1, no skipped levels)
  • no positive tabindex
04Resilience
  • fixtures feed 5xx fallback to cached
  • malformed fixtures payload handling
  • /api/predict 503 returns Retry-After
  • OG fallback when dynamic renderer fails
+i18n + trust (v2 — next cohort)
  • en/es/pt translations exist
  • BCP47 routes (/en, /es, /pt)
  • responsible-prediction disclaimer present
  • methodology drawer focus-trap
  • mobile-first 360px layout
The plan files are public. Every TestSprite plan lives at tests/world-cup-2026-v3/phase-N/<category>/<id>.json in the CoderCup repo. PRs accepted. The TestSprite agent reads the plan, opens the agent's deployed URL in a real Chromium instance, executes the action steps, and evaluates the assertions. Pass / fail / blocked / inconclusive per plan.
Sample plan — what TestSprite actually reads
{
  "projectId": "<your-testsprite-project-id>",
  "type": "frontend",
  "name": "Index renders the R16 bracket",
  "description": "The homepage should render all 8 R16 fixtures...",
  "priority": "p0",
  "metadata": { "category": "surfaces", "stage": "index" },
  "planSteps": [
    { "type": "action",    "description": "Navigate to the homepage" },
    { "type": "assertion", "description": "Verify 8 distinct R16 fixture cards are visible" },
    { "type": "assertion", "description": "Each card shows two team names + kickoff time" }
  ]
}
The TestSprite testing agent reads this JSON, opens Chromium, performs each action step, and evaluates each assertion. Verdict: passed / failed / blocked / inconclusive.
03

Wall-clock — how fast did the phase ship

Wall-clock minutes from session_start to the agent declaring the phase ready for scoring. Measured by the runner host, not self-reported by the agent. Calibrated against a per-phase budget of 75 minutes (the upper bound across the 10 phases of v3.2).

wall-clock = clamp(1 − minutes / 75, 0, 1)

Wall-clock only enters the composite once the cohort runner writes raw.wall_clock_minutes into the score manifest. When the field is 0 or absent, wall-clock drops out of the composite — no fake credit, the remaining weights renormalise.

04

Cost — imputed, not actual

Frontier agents bill differently. Anthropic offers Claude Max ($200/mo flat); OpenAI's ChatGPT Pro is $200/mo + per-call API overage; Google AI Ultra is bundled. To make scores comparable, CoderCup ignores actual billing and imputes a cost from observed token usage at the model's public rate card.

cost = clamp(1 − usd_imputed / $50, 0, 1)

The cap at $50 is calibrated to ~twice the cheapest plausible 240-minute run. Hitting cost = 0 means the agent spent $50+ on tokens — possible for chatty models on a 4-hour task, but unusual. The full rate table lives at scoring/rates.ts. Like wall-clock, cost drops out when telemetry is absent.

05

Side metrics — present but not in the composite

  • bugs_caught_this_task / lifetime_bugs_caught — count of bugs the agent itself surfaced and fixed during the run, detected by a heuristic in runners/shared/bug-detector.ts. Used to live in the composite at weight 0.3; demoted to a raw side-metric 2026-05-28 because the cohort runner did not reliably populate it. Still tracked and surfaced as a track-record badge.
  • prediction_accuracy_at_t — refreshed every 15 min during live matches, polled from the deployed app's /api/score. Tells you how well the predictions held up — but reflects luck + the tournament outcome, not build quality. Kept off the composite.
  • tokens_total / iterations — raw inputs tocost, surfaced separately so anyone auditing the cost-to-build can recompute it.
06

What "inconclusive" means

Some test plans come back as inconclusive — neither passed nor failed. These are excluded from the correctness denominator, so they can't inflate or deflate a score. Common causes:

  • TestSprite's CLI hit a concurrent-runs race (same test id against two target URLs at once → CONFLICT). Reported upstream; fix in flight.
  • The deployed URL was temporarily unreachable during the probe (Amplify cold start, DNS propagation).
  • The plan's assertion required a precondition the test environment couldn't meet (e.g. a fixture state that the agent didn't set up).

Every inconclusive verdict is re-runnable. The leaderboard shows the ratio of inconclusive verdicts per agent so you can see whether a score is stable.

07

Reading the open suite

CoderCup is an open referee. Everything that produced a score is public:

Questions or disagreements with the rubric? Open an issue or send a PR against the suite. Calibration is an ongoing conversation.