Research Note

Pokemon JEPA: Learning a Battle World Model

I trained a small Transformer on Pokemon Showdown replays to answer a simple question: does predicting the next state teach a model more about a game than predicting the final winner alone?

Intro

Pokemon Showdown is an online battle simulator where players test competitive Pokemon teams and every turn is recorded as a text log. A standard singles battle is a six-on-six game: each player has one active Pokemon, five on the bench, and chooses a move or switch every turn until one side has no usable Pokemon left.

That makes the modeling problem unusually clean. The battle log gives me a sequence of states and a final winner, but the cleanest label is also the most delayed one: somebody wins at the end. Pokemon is full of latent momentum: a full-health Pokemon can be useless into the wrong wall, a chipped setup sweeper can be decisive, and a missing sixth team slot can change the entire plan.

The experiment was to make the model learn those mechanics directly. I used a Joint-Embedding Predictive Architecture objective over structured battle states, then compared it against winner-only Transformers and tabular baselines. The punchline from the stricter replay-held-out run was not "winner accuracy went to the moon." It was more useful: the JEPA latent retained far more next-state information, cutting frozen next-HP probe MSE by 37-76% versus a winner-only latent across the replay budgets I tested.

1

Represent each battle as 12 object tokens: both active Pokemon plus both benches.

2

Train the latent state to predict next-turn HP movement, not only the final winner.

3

Reuse the same latent space for a masked-species teambuilder that fills a missing sixth slot.

Background

A standard supervised model can learn Pokemon correlations: high-usage species, HP advantage, faint counts, and obvious matchups. That is already useful. The failure mode is that correlations can masquerade as understanding. If the model only sees final winners, it can be right for shallow reasons and brittle in states where momentum is hidden.

JEPA changes the pressure on the representation. Instead of asking only who eventually won?, it also asks what should the next board look like? For this environment, I used next-turn HP as the predictive target. It is not a complete simulator, but it is a dense proxy for damage, recovery, switching pressure, setup, walls, and tempo.

Schematic showing a Pokemon battle state as twelve object tokens
Figure 1. The required data is already in the parsed replay state: six slots for each player, with active Pokemon and bench context encoded together.

Model

The parser converts each replay turn into a fixed sequence of 12 Pokemon objects: P1's active slot and bench, then P2's active slot and bench. The lightweight study used species, HP, and status. The teambuilder extension adds sparse item and move channels, plus a masked-species head for asking, "what should fill this missing slot?"

Parsed Replays 991

[Gen 9] PU Pokemon Showdown logs

Transitions 20,831

within-replay next-state pairs

State Tokens 12

six Pokemon per side

Local GPU 1080 Ti

12GB VRAM; minutes per pruning run

Diagram of the Pokemon JEPA architecture
Figure 2. The encoder maps 12 object tokens into a latent state. One head predicts next-turn HP, and another predicts the eventual winner.

Input

A structured board state: species embeddings, normalized HP, status, and in the teambuilder model, item and move embeddings.

Latent

A compact Transformer representation over all 12 Pokemon, allowing active threats, bench answers, and opponent structure to interact through self-attention.

Objectives

Mean-squared error on next-turn HP, binary cross-entropy on winner, and masked species prediction for team completion.

Concretely, each training example is a pair: the current board state and the next board's HP vector. The model is not asked to imitate a move directly. It has to compress the current situation into a latent state that is predictive of what the board will look like after the next turn.

Sample Pokemon JEPA input state and next-turn HP target
Figure 3. A sample input/target pair: current board in, next-turn HP values and winner label out. The HP target supplies dense supervision long before the final win label.

Results

The main comparison is deliberately small: 991 parsed public [Gen 9] PU replays from the Pokemon Showdown replays dataset, 20,831 within-replay turn transitions after filtering, and local training on a GTX 1080 Ti. The goal was not to claim a universal Pokemon engine. It was to test whether a dynamics objective changes what a compact model learns. The code, generated figures, and lightweight result artifacts are available in the public project repo.

Line chart comparing next-HP prediction error from JEPA and supervised latents

Result 1

The dynamics signal is the clean win

I freeze the trained encoders, fit a small ridge probe from latent state to next-turn HP, and evaluate on held-out replays. Lower MSE means the latent retained more information about how the board changes.

The JEPA latent is consistently easier to decode into the next HP vector than the winner-only latent. That is the core empirical claim: the objective changes what the representation keeps.

Bar chart showing JEPA next-HP probe MSE reduction versus supervised latents

Result 2

The relative gap grows with more data

This chart turns the previous curve into a percent reduction versus the supervised latent. At 25-100 training replays the advantage is already visible; by 500 replays, the JEPA latent has 76% lower next-HP probe error.

Winner accuracy remains noisy in this small replay-held-out setup, so this is the better result to lead with: the world-model objective makes the state representation more predictive of the next state.

Metric 25 Replays 100 Replays 500 Replays Interpretation
JEPA latent -> next-HP probe MSE 0.272 0.146 0.045 Lower is better; frozen latent, ridge readout
Supervised latent -> next-HP probe MSE 0.429 0.261 0.186 Same encoder shape, trained only on winner labels
JEPA MSE reduction vs supervised latent 37% 44% 76% How much next-state information the JEPA objective preserved
Winner-head accuracy 52.2% 50.4% 55.2% Noisy under replay-held-out evaluation; not the main claim

This is why the results section should lead with dynamics, not a two-bar winner-accuracy chart. The winner head is noisy under replay-held-out evaluation, while the frozen-latent probe shows the effect of the objective directly: the JEPA representation remembers much more about how HP will move next.

PCA map of held-out JEPA latents colored by winner and next-turn HP swing

Visual Probe

Latent-space map

This is an interpretability check, not the headline metric. I take held-out battle states, encode each one into a JEPA latent vector, and project those vectors to 2D with PCA. The two panels show the exact same dots; only the color changes.

  • The left coloring asks whether nearby latent states tend to share the same eventual winner.
  • The right coloring asks whether nearby latent states also share similar next-turn HP movement.
  • If the representation were just species lookup noise, both colorings would look spatially random.

The map is intentionally modest: there are no clean separable clusters, and PCA is lossy. The useful signal is that the same latent geometry carries both strategic outcome information and tactical HP-delta information, matching what the frozen dynamics probe measured quantitatively.

Heatmap comparing actual and JEPA-predicted HP deltas for one held-out Pokemon battle transition

Visual Probe

HP-delta map

This shows one high-movement held-out transition, slot by slot. The top row is the actual next-state HP delta; the bottom row is the JEPA HP head's prediction.

Because active slots can change during switches, this is a next-state slot target, not a pure same-Pokemon damage log. The useful question is whether the model has learned where large changes are likely to happen.

Examples

The most useful way to read the examples is as a sequence. First, the JEPA model evaluates battle states more like a forward simulation than a popularity score. Then the same representation transfers into teambuilding, where it fills missing roles instead of merely suggesting common Pokemon.

State Evaluation

Dynamics Example 1

A healthy team that is already losing

The label-only model sees six healthy Pokemon and several recognizable attackers. The JEPA model sees a stalled damage trajectory.

P1 board

Salazzle sprite Salazzle 100% special pressure
Bellibolt sprite Bellibolt 100% bulky special
Cryogonal sprite Cryogonal 100%
Grimmsnarl sprite Grimmsnarl 100%
Paldean Tauros Aqua sprite Paldean Tauros Aqua 100%
Rotom Mow sprite Rotom Mow 100%

P2 board

Froslass sprite Froslass 100%
Coalossal sprite Coalossal 100% Salazzle answer
Scream Tail sprite Scream Tail 100% special wall
Grimmsnarl sprite Grimmsnarl 100%
Dusknoir sprite Dusknoir 100%
Clawitzer sprite Clawitzer 100%
Supervised 71.6% P1
JEPA 0.5% P1
Actual P2 wins
  1. Surface read: Salazzle, Bellibolt, and Rotom-Mow look like enough special pressure to carry P1.
  2. World-model read: Scream Tail absorbs that pressure, Coalossal turns Salazzle into tempo, and P1 lacks a physical breaker that changes the HP trajectory.
  3. Why it matters: the JEPA objective rewards representations that notice future board movement, so it can call a collapse before the winner label becomes obvious.

Dynamics Example 2

The head-count trap

P1 appears to have material left. The real game is about whether anything can survive the boosted Scyther sequence.

P1 board

Alcremie sprite Alcremie 100%
Gastrodon East sprite Gastrodon East 0% fainted
Coalossal sprite Coalossal 0% fainted
Rotom Mow sprite Rotom Mow 28% chipped
Paldean Tauros Blaze sprite Paldean Tauros Blaze 45% chipped
Hisuian Braviary sprite Hisuian Braviary 100%

P2 board

Scyther sprite Scyther 77% Swords Dance + Trailblaze
Hattrem sprite Hattrem 3%
Bombirdier sprite Bombirdier 100%
Arcanine sprite Arcanine 64%
Hisuian Decidueye sprite Hisuian Decidueye 100%
Rhydon sprite Rhydon 100%
Supervised 73.0% P1
JEPA 10.4% P1
Actual P2 wins
  1. Surface read: P1 still has four visible bodies and two untouched Pokemon, so the supervised model remains bullish.
  2. World-model read: Scyther has already converted setup into speed and damage. The next few turns are not independent; they are a sweep line.
  3. Why it matters: predicting HP deltas forces the latent to encode momentum, not just count remaining Pokemon.

Team Completion

Teambuilder Example 1

Finishing the sand core

Once the latent understands weather mechanics, the missing sixth slot stops being a popularity contest.

Five-slot P1 core

Hippopotas sprite Hippopotas 100% sand setter
Sandaconda sprite Sandaconda 100%
Sandslash sprite Sandslash 100%
Lycanroc sprite Lycanroc 100%
Quaxwell sprite Quaxwell 100%
?

Opponent context

Ambipom sprite Ambipom 100%
Uxie sprite Uxie 100%
Tatsugiri sprite Tatsugiri 100%
Shaymin sprite Shaymin 100%
Paldean Tauros Blaze sprite Paldean Tauros Blaze 100%
Florges sprite Florges 100%
Supervised pick Grumpig sprite Grumpig

44.1% predicted P1 win

JEPA pick Houndstone sprite Houndstone

77.6% predicted P1 win

+33.5 pts
  1. Supervised suggestion: Grumpig is a plausible generic utility pick, but it does not amplify the existing plan.
  2. JEPA suggestion: Houndstone completes the sand offense. Sand Stream plus Sand Rush changes the speed landscape, which changes the win trajectory.
  3. Modeling win: the same latent that predicts turn movement also recognizes team-level synergy.

Teambuilder Example 2

Patching the defensive hole

Adding more firepower to a glass cannon team looks attractive until the opponent has a setup line.

Five-slot P1 core

Lycanroc sprite Lycanroc 88% Focus Sash + Stealth Rock
Goodra sprite Goodra 37% chipped pivot
Minior sprite Minior 100%
Zoroark sprite Zoroark 100%
Alolan Sandslash sprite Alolan Sandslash 100%
?

Opponent context

Snorlax sprite Snorlax 38% Belly Drum threat
Exeggutor sprite Exeggutor 100%
Carbink sprite Carbink 100%
Galarian Articuno sprite Galarian Articuno 100%
Dusknoir sprite Dusknoir 0% Trick Room revealed
Grimmsnarl sprite Grimmsnarl 100%
Supervised pick Arcanine sprite Arcanine

0.8% predicted P1 win

JEPA pick Uxie sprite Uxie

28.5% predicted P1 win

+27.7 pts
  1. Supervised suggestion: Arcanine adds another attacker to an already fragile core.
  2. JEPA suggestion: Uxie gives the team a bulky utility pivot and a way to disrupt the Snorlax / Articuno-G endgame.
  3. Modeling win: the latent does not merely ask which Pokemon is common; it asks which missing role changes the future state.

Discussion

The result changed my view of the problem. The useful artifact is not a magically better winner classifier; under a replay-held-out split, that signal is noisy and easy to overfit. The useful artifact is a latent state that can be reused: it preserves next-state information, supports masked team completion, and produces tactical disagreements that are explainable in Pokemon terms.

The current system is still a proof of concept. It uses only 1,000 Gen 9 PU replays, its move and item channels are sparse, and next-turn HP is only a proxy for battle quality. The next steps are clear: scale to larger Random Battles corpora, add richer move/item and weather/terrain features, probe the latent space for interpretable clusters, and connect the state model to action selection rather than only evaluation and teambuilding.

References