Test suites · world-cup-2026-v3 · all 10 phases scored · 182 plans

What TestSprite probes,
line by line.

Every score on the leaderboard derives from this suite. Each plan is a structured natural-language test that the TestSprite agent reads, then executes against the deployed URL with a real headless Chromium. Pass / fail / blocked per plan. The full plan JSONs are PR-able on GitHub.

How scoring uses these →Browse on GitHub Propose a plan

Suite

182plans total0agents · phases 1-10 graded0plan runs executedAll suites →

Phase 1 · Landing pageBracket UI, 12 group standings, FIFA-style hero

Routes · 5

Interaction-driven navigation: R16 cell clicks, Groups nav, team-row navigation, 404 recovery, Final cell. Each plan walks a click sequence and asserts the destination URL distinct from the start.

Data · 4

UI-rendered cardinality checks: bracket stages reach distinct pages, group rows differ, three R16 clicks yield three pages, kickoff + venue visible on match detail.

Match Detail · 3

Content assertions on match permalink pages — stage label on QF, group letter on group matches, five permalinks load.

Accessibility · 2

Keyboard Tab focus visibility and non-empty alt text on flag images.

Visual · 1

Hero treatment in the top viewport and bracket section visible below.

SEO · 1

A match URL listed in sitemap.xml renders real match content when visited.

d6e7f8a9A match URL listed in sitemap.xml renders real match content when visitedP1

Phase 2 · Match details78 /match/<id> SSR permalinks with teams, flags, kickoff, venue, round

Permalinks · 5

Each /match/<id> SSR route returns 200 with team names, kickoff, venue, round, and a back-to-bracket link rendered into the initial HTML.

Details data · 4

Match-page payload checks: both team names, both flag images with alt, kickoff timestamp, stage badge — all rendered server-side.

SEO · 3

Sitemap completeness (≥80 URLs spanning index, groups, matches), per-match og:title containing both team names, and canonical link tag matching the URL.

Security · 2

Response-header hygiene: Content-Security-Policy and either X-Frame-Options or frame-ancestors directive on every match page.

Errors · 2

Unknown match id returns HTTP 404 with a rendered error page; malformed slug input lands on a graceful error page (not a stack trace).

Phase 3 · PredictionsPer-match winner + scoreline + probability bars + reasoning; KO tie resolution; champion locked at SIGSTART

Prediction shape · 5

The Prediction block is the phase deliverable: a match page renders it with a named winner (or draw), a concrete scoreline, probability bars with percentages, and probabilities that sum to approximately 1.0.

Invariants · 6

The math is non-negotiable: no team is predicted to play itself, scorelines stay within a sane 0–9 range, knockout predictions never declare a draw, tied knockouts resolve via extra-time/penalties, group stages may draw, and every prediction carries a reasoning paragraph.

Champion lock · 3

The champion is locked at SIGSTART: the Final match page surfaces the predicted champion, the pick is one of the 48 FIFA 2026 teams, and the champion is reachable from the landing page.

Bracket progression · 3

Predictions are internally consistent across rounds: the bracket shows the expected cell count per knockout round, predicted finalists derive from predicted semi-final winners, and a top-probability R16 team also appears in the QF predictions.

Visible · 3

Predictions are surfaced, not buried: the block sits above the fold on desktop, bracket match cards surface the predicted winner, and the block renders cleanly at a 360px viewport.

Phase 4 · LineupsLineups tab — predicted XI (11 per team), formation label + pitch diagram, per-player injury/suspension notes

Tab present · 3

The Lineups tab is the phase deliverable: visible on /match pages and opening a populated panel, reachable and activatable by keyboard, and present in the document when reached by scrolling.

Starting XI · 4

Predicted XI is real structured data: exactly 11 players per team (home and away), every player a non-empty name with a valid position, and the rendered list is distinct named rows rather than a text blob.

Formation · 3

Formation label matches a valid pattern whose digits sum to ten, both teams carry a visible formation label, and a pitch diagram renders at least twenty-two position markers.

Injuries · 2

At least one player status note is surfaced on the match lineups, and every injury entry uses a severity drawn from a fixed standard vocabulary.

Head check · 2

Cross-team consistency: no single player appears in both teams' starting lineups, and starting XI cardinality is exactly eleven per team across a sample of matches.

Responsive · 2

At a 360px viewport the lineups render without horizontal scroll and the formation pitch diagram stays legible inside the viewport.

Phase 5 · Your analysisAnalysis tab — 3-5 paragraphs per match with inline citations resolving to a References panel; no boilerplate; 200-600 chars/paragraph

Tab present · 2

The Analysis tab is visible on a match detail page and opens a panel with real content when clicked.

Paragraph count · 3

The analysis panel contains at least three and no more than five paragraphs, and two different match pages both show analysis content.

Paragraph length · 3

Each analysis paragraph is between 200 and 600 characters, and the text reads as substantive and match-specific rather than generic boilerplate.

Citations · 2

Analysis paragraphs carry inline citation markers, and a References panel or section follows the analysis text.

Anchors resolve · 3

Clicking a citation marker scrolls to or highlights its reference, reference entries display a source URL, and the reference count matches the unique inline citations.

URL liveness · 3

The first reference URL is a reachable web address, all visible reference URLs use HTTPS, and reference source links are clickable anchor tags.

Freshness · 2

Analysis text references 2026 World Cup context and contains no placeholder dates or lorem ipsum.

Uniqueness · 2

Two different match pages show different analysis text, and the analysis mentions the specific teams playing in the match.

No paywall · 2

Analysis content is fully visible without requiring login or signup, and the panel loads without intrusive external popups or overlays.

Phase 6 · Related newsNews section — ≥3 fresh items per match with title, source, date, and HEAD-checked URL; source diversity across domains

Section present · 3

A Related news tab or section is visible on a match detail page, opens a populated panel on click, and shows at least three news cards.

Card structure · 3

Each news card has a readable article title, shows a source name and publication date, and is a clickable link to the source article.

Freshness · 3

News cards show dates within the last seven days, do not reference only the 2022 tournament, and contain no placeholder or fake article titles.

Head check · 2

News source URLs use HTTPS and point to real recognizable domains.

Source diversity · 2

News cards come from at least two different source domains, and different match pages show different news article sets.

Coverage · 3

News card titles mention the match teams or relevant football terms, the panel loads without visible error states, and cards are accessible without excessive scrolling.

Phase 7 · Betting oddsOdds tab — implied probabilities + consensus row + agent-vs-market consistency

Accessibility · 3

3 plans: Probability bars carry text labels, not color alone; Betting Odds tab is reachable and activatable by keyboard; Responsible-gambling / 18+ disclaimer is present on the Odds surface

Agent Consistency · 2

2 plans: Agent implied probability on the Odds tab matches the phase-3 prediction; Lowest-odds favorite aligns with the agent's predicted winner

Math Consensus · 3

3 plans: Market consensus probabilities sum to 1.0; Consensus equals the arithmetic mean of de-vigged book probabilities; Consensus is computed on probabilities, not by averaging decimal odds

Math Devig · 4

4 plans: Each book's de-vigged probabilities sum to 1.0; Raw implied probabilities sum above 1.0 (the vig is real); De-vigged probability equals raw implied normalized by the overround; …

Staleness Badge · 2

2 plans: Staleness warning appears when the oldest book is over six hours old; No staleness warning on a match whose odds are fresh

Three Books · 2

2 plans: Every match offers at least three bookmaker lines; Bookmakers are real named sportsbooks, not invented placeholders

Ui Rendered · 3

3 plans: Odds tab renders a per-book table with concrete decimal values; Consensus row and agent implied row both render with percentages; Probability bars render and their widths track the consensus values

Phase 8 · Multi-language i18nen/es/pt locale routes, persisted switcher, real translation, no placeholder data

Data Authenticity · 3

3 plans: No mock, TBD, or placeholder data anywhere in the rendered product; Group standings are projected from predictions, not all-zero placeholders; Match predictions are real and varied and a real champion is named

Date Localization · 2

2 plans: Match kickoff time renders with locale-specific month names; Portuguese date formatting is distinct from English

Html Lang · 2

2 plans: html lang attribute matches the selected locale; Head lists hreflang alternates for all three locales

Locale Switcher · 3

3 plans: Locale switcher is present in the nav and offers all three languages; Switching language preserves the current path; Selected locale persists across in-app navigation

No Leakage · 3

3 plans: Spanish page chrome reads in Spanish, not English; Portuguese match page translates data labels, not just nav; No hardcoded English UI strings leak onto a Spanish page

Number Localization · 2

2 plans: Decimal numbers use locale-correct separators; Portuguese decimals use a comma separator

Regression · 2

2 plans: Prior-phase features remain functional in a non-default locale; Missing translations fall back gracefully without blank or broken text

Routes Exist · 3

3 plans: All three locale root routes return 200; Match detail pages resolve under es and pt locale prefixes; Group standings page exists under all three locales

String Coverage · 4

4 plans: All three locale translation files exist and are non-empty; Locale files share an identical key shape; Translation values are real text, not key-name placeholders; …

Phase 9 · Matchday polishDesign-token skin + light/dark + completeness, graceful 404, a11y, perf

Dom · 2

2 plans: All images have non-empty alt text and the document lang is set; Interactive elements show a visible focus ring on keyboard focus

Matchday · 1

1 plan: The product is branded "Matchday" with its favicon

d95902e3The product is branded "Matchday" with its faviconP1

Both Modes · 2

2 plans: Landing bracket and group standings render correctly in both light and dark; Match page prediction, lineup and odds render correctly in both light and dark

404 Html · 1

1 plan: Unknown match route returns HTTP 404 with a styled page

9a1e10faUnknown match route returns HTTP 404 with a styled pageP1

404 Json · 1

1 plan: Unknown API match returns 404 JSON with error and code

d0648268Unknown API match returns 404 JSON with error and codeP2

Dom · 1

1 plan: Images declare dimensions to avoid layout shift

308430d3Images declare dimensions to avoid layout shiftP2

Contrast · 2

2 plans: Light mode body and heading text meet WCAG AA contrast; Dark mode body and heading text meet WCAG AA contrast

System Pref · 2

2 plans: With prefers-color-scheme dark, the site renders in dark mode on first paint; With prefers-color-scheme light, the site renders in light mode

Toggle Persists · 2

2 plans: The nav theme toggle visibly flips between light and dark; Theme choice persists across navigation and a hard reload via localStorage

Token Adoption · 2

2 plans: Surface, border and text colors come from the provided token palette; Light and dark are two real themes, not a CSS invert

Phase 10 · Final polish · releaseBranded hero image + AA contrast, cross-surface consistency (champion/scoreline/groups/prose), no-TBD authenticity, editorial bracket, light-default theme, prior-phase regression

Hero Branded · 2

2 plans: Landing shows a real photographic/generated hero image, not a flat fill or SVG; The hero title "FIFA World Cup 2026 Predictions (AI)" is real DOM text over the image

Hero Contrast · 1

1 plan: Hero title meets WCAG AA contrast over the image in light and dark

247ca6fbHero title meets WCAG AA contrast over the image in light and darkP1

Champion · 2

2 plans: Predicted champion is identical on the Final page, the bracket, and the landing; The predicted champion is a real qualified nation, not a placeholder

Scoreline · 1

1 plan: Each predicted winner is the higher-scored side (no winner/scoreline contradiction)

36847d06Each predicted winner is the higher-scored side (no winner/scoreline contradiction)P1

Groups · 1

1 plan: Each group’s top-2 standings match the teams advancing in the bracket

75e63de9Each group’s top-2 standings match the teams advancing in the bracketP1

Prose · 1

1 plan: No prose/reasoning sentence contradicts the structured prediction

71144427No prose/reasoning sentence contradicts the structured predictionP0

No TBD · 2

2 plans: No TBD/placeholder data on the landing or a match page; No placeholder/untranslated data in the es and pt locales

Bracket · 1

1 plan: Knockout bracket matches the editorial design (flag chips, rank badges, green winner, champion panel)

10ee0b7fKnockout bracket matches the editorial design (flag chips, rank badges, green winner, champion panel)P1

Theme · 1

1 plan: Light theme is the default on first load

4cff15dcLight theme is the default on first loadP1

Regression · 4

4 plans: landing still SSRs real team names; /match still shows the prediction block; the odds tab still renders; en/es/pt locale routes still resolve

world-cup-2026-v3 · all 10 phases scored · 182 plans

The v3.2 spec ships 10 feature-themed phases: landing (phase 1) → match details (phase 2) → predictions (phase 3) → lineups (phase 4) → analysis (phase 5) → related news (phase 6) → betting odds (phase 7) → multi-language (phase 8) → light/dark polish (phase 9) → final polish (phase 10). All 10 are scored — 182 plans total, re-run cumulatively against the deployed app at every later phase. Cohorts 1 (v1) and 2 (v2) have been retired as dry-runs.

Browse the v3 suite →