World Cup 2026 · Ratings Investigation

What the power ratings actually say
(and what they only pretend to)

A two-lens statistical read of the 48-team field, stress-tested through an owl, a potato, and an owl until only the true claims were left standing.

/parallel-skill-orchestrator + statistical-analysis+ exploratory-data-analysis then /owl/potato/owl /output-html

Primary data: Jeff Fogle market power ratings (48 teams). Cross-checks: Elo (35/48), Polymarket (9/48). Generated June 2026.

0

Bottom line, after the audit

The headline survived a beating, smaller and truer. Two strong pools (UEFA and CONMEBOL, dead even at 1.97) sit a full tier above the world; the field is left-skewed because the 48-team format swallowed a tail of minnows; and two independently built rating systems (Fogle and Elo) agree on order at ρ = 0.91. That much is solid. The adversarial pass then killed one headline outright (a "Fogle loves African teams" claim that turned out to be a rank-scale bug), downgraded three more from insight to texture, and left a short list of claims you can actually lean on.
48
teams, 12 groups of 4
1.37
mean rating (median 1.60)
−0.72
skew (weak-end tail)
0.91
Fogle vs Elo rank agreement
4
teams at championship grade (≥2.5)
HIGH
UEFA ≈ CONMEBOL dominance; a left-skewed field with Curacao the lone true outlier; strong Fogle–Elo order agreement. Survives intact.
MEDIUM
The most useful real insight is structural: the "group of death" is about variance, not mean. Group F is a balanced grinder, Group I is top-heavy. The 12 group means are statistically identical by seeding design (ANOVA p = 0.999).
LOW (flagged)
One market-vs-model gap survives correction: USA host optimism (Fogle ranks them 13 spots above Elo). Argentina is the contender the two systems most disagree on (Elo #2, Fogle #5).
CANNOT SAY
Predictive accuracy (no outcomes), attack vs defense (one number), and everything the betting principles call foundational (xG, travel, venue, lineups). Use this as a coarse prior, never as the bet.
1

The data: one number, dressed in fourteen columns

The spreadsheet's "ratings" are Jeff Fogle's market-consensus goal-supremacy ratings: higher means stronger, the scale runs from Curacao at −1.1 to Spain and France at 2.8, and 2.5 is roughly championship caliber. Crucially it is a goal scale, the spread between two teams is a predicted goal margin.

Read the provenance before you trust a digit. This is a single Substack snapshot (May 31 2026), not a model and not measured. It is rounded to 0.1, so 36 of 48 teams share a value with someone else. The source even contradicts itself on four teams (Czech Republic, Iran, Cape Verde, Haiti differ between its main list and its group section). The Confederation and Host columns and the "Congo → Congo DR" fix were added by the analyst, not in the source. Eight of the fourteen columns are arithmetic functions of the one Fogle number, so the sheet holds far less independent information than its width suggests.
48 / 48
Fogle rating complete
35 / 48
Elo present
9 / 48
Polymarket present
~4 / 48
xG present (the "foundation")

Explore all 48 teams

Click any column header to sort. Filter by confederation, or isolate the three hosts. The Lean column compares each team's rank under Fogle vs under Elo, computed inside the 35-team set that has both (so it is an apples-to-apples comparison, see the audit for why that matters).

Tier dots: Elite (≥2.3) Strong (1.9–2.2) Solid (1.4–1.8) Fringe (0.7–1.3) Minnow (<0.7). = host. Elo rank is a global world rank; Lean re-ranks within the shared 35.

2

Statistical analysis

Descriptive shape, normality, outliers, correlation, and group tests on the 48-team cross-section.

Shape and tiers

Mean 1.367, median 1.60, SD 0.835, range 3.9. The mean sitting below the median is moderate left skew (−0.72): a dense cluster of decent teams with a thin tail of weak ones dragging the average down. Natural tiers fall out of the 0.1 grid, though heavy ties mean the tier lines are softer than they look.

Fogle rating distribution
Distribution of Fogle ratings: single peak near 1.6–1.7, long only on the weak side. Curacao is the lone outlier.
Histogram, QQ and box
Histogram, normal QQ, and boxplot. The points hug the QQ line in the middle and bend at the tails.

Normality (handle with care)

Shapiro p = 0.062, D'Agostino p = 0.067, Jarque–Bera p = 0.114 all fail to reject; Anderson–Darling rejects (A² = 0.79 > 0.73). Verdict: approximately normal with a real mild left tail. Honest caveat surfaced by the audit: running normality tests on a variable rounded to 0.1 with 36 ties is close to decorative. The descriptive skew is the real takeaway, the p-values are not.

Outliers

Only one team clears the 1.5×IQR fence: Curacao (−1.1), also the most extreme z-score (−2.95). Every flagged tail is at the weak end. The elite at 2.8 sit comfortably inside the fence, the ceiling is compressed, the floor runs long.

Correlation, and which ones are real

  • Fogle vs Elo: r = 0.887, Spearman 0.915 (n=35). Genuine cross-system agreement, and robust (it survives dropping the eight most extreme teams). This is the single most reassuring number in the file.
  • Fogle vs Polymarket: r = 0.77 (n=9). Favorites only, directional at best.
  • Fogle vs Avg group spread: r = 0.976. tautology Avg spread is computed from Fogle, this is arithmetic agreeing with itself.
  • Fogle vs Group difficulty: r = −0.69. A seeding artifact, strong teams do not play themselves.
Correlation heatmap
Correlation heatmap (available-case). The strong off-diagonal blocks include several mechanical, Fogle-derived relationships, the only external validators are Elo and Poly.

Confederations differ; the test that proves it; the test that does not

Mean rating by confederation: UEFA 1.97 and CONMEBOL 1.97 are statistically tied at the top, well above CAF 1.06, AFC 0.81, and CONCACAF 0.62. Because variances differ wildly (Levene p = 0.001, CONCACAF spans Curacao to Mexico), the right test is Kruskal–Wallis: H = 24.6, p = 6e-05, a large effect (ε² = 0.49). Caveat: this is ecological, it describes pools, not individual teams.

Fogle by confederation
Ratings by confederation. UEFA and CONMEBOL boxes sit highest and tightest; CONCACAF is the widest, hosts are the strong end of a weak pool.

The host test, and why it proves nothing

Hosts average 1.70 vs 1.34 for the field, and a Welch t gives p = 0.048. Do not trust it. With n = 3 and two hosts identical at 1.8, the host SD collapses to 0.17 and manufactures the "significance". A permutation null beats the host mean 26.6% of the time, Mann–Whitney gives p = 0.47, and Canada (1.5) is below the field median. There is no provable host effect here.

The "group of death," done properly

Across all 12 groups the means are statistically indistinguishable (ANOVA F = 0.16, p = 0.999). That is not boredom, it is the fingerprint of FIFA's pot system balancing every group. The difficulty lives in the variance, see the next section.

3

Exploratory data analysis

Profiling, structure, and the two pictures that carry the most meaning: the Fogle–Elo scatter and the group-difficulty spread.

Cross-system check: where Fogle and Elo disagree (corrected)

Hover any point. The diagonal is rough agreement, teams above the line are rated higher by Elo than by Fogle, below the line the reverse. Both axes are ranked inside the 35-team set that has both, the only fair way to compare.

Fogle rank vs Elo rank (35 teams with both)

The divergences that actually survive

The biggest real gaps once the rank scales are matched. Green means Fogle is more bullish than Elo, blue means Elo is more bullish. USA is the cleanest signal (host optimism the market prices and Elo ignores).

Within-set rank gap, Fogle vs Elo

Group difficulty: same average, different animal

Groups F and I tie on mean rating (1.675), but they are nothing alike. Group F (Netherlands, Japan, Sweden, Tunisia) is a balanced grinder, SD 0.37, no soft team. Group I (France, Norway, Senegal, Iraq) is top-heavy, SD 1.03, France strolls while three teams fight for second and Iraq absorbs the hardest single schedule on the board.

Group difficulty sorted
Per-group difficulty, hardest to easiest. F and I top the chart on average, but averages hide the F-vs-I difference in shape entirely.

Completeness and missingness

Missingness is not random. Elo is absent for 13 teams, none in the top six, they are upper-middle UEFA (Norway, Czech Republic, Turkey, Sweden, Scotland, Bosnia), four AFC sides, and the three weakest minnows. Polymarket exists only for nine favorites. Nothing was imputed.

Missingness map
Missingness map. The Elo and Poly gaps are visible as the two sparse columns, the Fogle column is complete.
4

The gauntlet: owl → potato → owl

The process Michael loves. First read the numbers like an owl. Then turn into the potato and attack with no mercy. Then return to the owl to weigh which blows landed and render the real verdict.

🦉Owl, first passslow · observant · analytical
Think like an owl, slow, observant and analytical. Examine this from multiple perspectives and identify the hidden factors most people overlook.
  1. The rating is a goal engine, not a power score. Fogle is a "goal supremacy" scale and the spread column is literally rating minus rating, so France 2.8 vs Curacao −1.1 is a falsifiable prediction of about a 3.9-goal margin. tail extrapolation untested, see Owl 2
  2. The left skew is the 48-team expansion made visible. Every outlier sits at the weak end because the field absorbed a tail of minnows. The shape is telling us about the format, not just the teams.
  3. The disagreements with Elo are systematic, and that is where the money is. Fogle looked far more bullish on African sides (Ghana, South Africa) and on the USA. the CAF half was a measurement bug, see Owl 2
  4. The "group of death" is a variance illusion. F and I tie on mean, but F is a balanced grinder and I is top-heavy with France plus Iraq's brutal schedule. "Hardest group" splits into hardest-to-win vs hardest-to-escape.
  5. Groups being statistically identical is the seeding working, not noise. p = 0.999 across 12 groups is manufactured by the pot system, so most "group of death" talk is theater.
  6. The sheet has less independent information than it looks. Strip the Fogle-derived columns and you have one rating, one partial validator (Elo), and a nine-team market sliver.
  7. Polymarket looks far more top-heavy than Fogle. Fogle ties Spain and France at 2.8, the market splits them about 2 to 1. over-read an incomplete capture, see Owl 2
  8. Argentina is the divergence at the top nobody flags. Fogle has them 5th, Elo has them 2nd-highest, above France, England and Brazil. real but not "the biggest", see Owl 2
  9. The host bump is the cleanest market-vs-model example. Hosts sit above the field, Elo is unimpressed, n = 3 forbids a significance claim.
🥔Potatohostile · flaws only · no politeness
Become a hostile critic looking for flaws only. No politeness. Find what's wrong. This is for when I need tough feedback, not validation.

Spawned as an independent agent with no stake in the analysis, told to verify every attack against the actual files before making it.

  1. FATALThe CAF thesis is a rank-scale artifact. "Ghana +43, South Africa +43" subtracts a within-field Fogle rank (1–48) from a global Elo world rank (running to 81). Re-rank both inside the shared 35 and Ghana's gap is −1, South Africa −5. The real biggest Fogle tilt is the USA. The project's own EDA file had the correct numbers; the narrative used the broken ones.
  2. FATALThe "9-team tournament" reading misreads an incomplete scrape. Polymarket quotes the whole field, the analyst captured nine favorites. Their 1.02 sum is where the download stopped, not evidence the other 39 have no title equity.
  3. SERIOUSThe host test should never have run. Welch p=0.048 exists only because two hosts are identical at 1.8. A random three-team draw beats the host mean 26.6% of the time, and Canada (1.5) is below the field median, so "all three hosts above the field" is false.
  4. SERIOUSArgentina is cherry-picked. By the same gap logic, Panama (+9) and South Korea (+5) are bigger Elo-over-Fogle divergences.
  5. SERIOUSr=0.89 is two reputation proxies agreeing. No backtest, no ground truth, no falsification criterion.
  6. SLOPPY"Only 6 reach 2.5" is wrong (four do). Normality tests on a 0.1 grid with 36 ties are meaningless. Spain's Elo is 93 points above France's, so their Poly gap is not "pure path." The rename and the analyst-added columns are undisclosed, yet the CAF claim runs through a hand-typed column.
💣 Gut punch: the entire deliverable is one unaudited, self-contradicting, error-bar-free Substack number from a single day, rounded so a third of the field is tied, copied into eight columns and cross-examined against itself, with its marquee insight produced by subtracting a 48-team rank from an 81-team rank.
🦉Owl, reconciliationweigh the blows · render verdict
Back to the owl: weigh which harsh objections are real and render the final verdict.

Every potato hit was re-checked against the data before being accepted. Most landed.

Conceded, corrections now binding

  • FATAL The CAF divergence was a unit-mismatch bug. Re-ranked within the shared 35, Ghana goes −43 → −1, South Africa −43 → −5. What survives is narrower and cleaner: the largest real Fogle-over-Elo tilt is USA (−13), then Ivory Coast and Tunisia (−6). Host optimism, not a continental verdict.
  • FATAL The "9-team tournament" line is retracted. Capture artifact. What stands: Spain 33% vs France 17% is a real, if partial, market split.
  • fix That split is not "pure path": Spain's Elo is 93 points above France's, two systems say Spain is simply better. Fogle's tie is the outlier.
  • fix No provable host effect. Canada is below the median, the Welch p is an artifact. The honest residue: the market prices a host tilt (USA is the #1 Fogle-over-Elo team), n = 3 forbids proof.
  • fix Argentina reframed: not the largest divergence, but the most decision-relevant one, because it sits among the contenders (Elo #2 vs Fogle #5).
  • fix Four teams reach 2.5, not six. Normality testing on this discrete grid is decorative. Provenance (single source, four self-contradictions, analyst-added columns) travels with the numbers from here on.

Where the potato overreached

  • "Eight columns are deterministic" is not a rebuttal, it confirms Owl-1 point 6 stated hostilely. True, and well sharpened, but not a new fatal flaw.
  • The gut punch is rhetorically total but analytically partial: the potato itself conceded the Fogle–Elo agreement is robust, not a leverage trick. A market snapshot that tracks a results-based Elo at ρ = 0.91 is a reasonable coarse prior, low resolution, not worthless.

What is left standing

UEFA and CONMEBOL dominance, the left-skewed expansion field, strong Fogle–Elo order agreement, the variance-not-mean reading of the "group of death," and one actionable market gap (USA). The gauntlet earned its keep: it killed a headline that was a bug, downgraded three stories from insight to texture, and left a smaller, truer set of claims.

5

How to actually use these ratings

Tied back to the project's own betting principles.

  • Treat Fogle as a coarse prior, then override it. It is a one-dimensional, single-source, 0.1-grid snapshot. The betting principles say xG is the foundation and Elo beats reputation, so use Fogle to frame the matchup, then let xG and matchup work move the price.
  • The one survivable market gap is USA host optimism. Fogle rates the USA 13 rank-spots above Elo. That is the market pricing the host bump the principles predict, fade or back it deliberately, do not assume it is free value.
  • Argentina is the contender to form your own view on. Elo treats them as a co-favorite, Fogle a clear notch below Spain and France. If Elo is the sharper instrument, that is where a futures look starts.
  • Ignore "group of death" averages, read the variance. Group F is where favorites drop points (no easy game), Group I is where the second qualifier is a coin flip behind France. Same mean, opposite betting texture.
  • Do not bet the rating itself. No outcomes back it, it cannot separate attack from defense, and it is silent on travel, venue, and lineups, the very edges 2026 rewards.
6

Method, sources, and files

The pipeline ran exactly as requested. One canonical cleaned dataset was built first (single source of truth, reconciled to control totals: 48 teams, 12 groups of 4, six confederations, three hosts). Two skill agents then analyzed it in parallel and blind to each other via the parallel-skill-orchestrator: one applying statistical-analysis, one applying exploratory-data-analysis. Their findings were synthesized, read by an owl, attacked by an independently spawned potato, and reconciled by a second owl. This page is the output-html step.

Inputs

  • Fogle Market Power Ratings: Jeff Fogle, "Figure it Out" Substack, May 31 2026. Market-consensus goal-supremacy estimate (not a personal pick). 48 teams.
  • Elo: international-football.net, Jun 8 2026 (35/48). Polymarket win%: Jun 2026 (9/48). Both from the workbook's xG & Stats sheet.

Artifacts written to ratings_analysis/

wc2026_ratings_canonical.csv // the single source of truth (48 x 14) stats/ results.json, analyze.py, 4 figures // statistical-analysis agent eda/ eda_profile.json, EDA_REPORT.md, 13 figures // exploratory-data-analysis agent synthesis_owl1.md potato.md owl2.md // the gauntlet, in full build_html.py // this report's generator

A correction worth keeping: the statistical agent's rank-divergence table compared a 1–48 Fogle rank against a global Elo world rank, which manufactured a false "Fogle loves CAF" signal. The EDA agent computed it correctly within the shared set. When two agents disagree on a derived number, that disagreement is the audit doing its job.