ReadPaper Blog
NITP: Next Implicit Token Prediction for LLM Pre-training
The paper argues that standard Next-Token Prediction (NTP) trains large language models with sparse one-hot supervision in the output logit space, leaving hidden representations under-constrained and prone to degenerate, anisotropic geometry. It proposes Next Implicit Token Prediction (NITP), an auxiliary pre-training objective that asks the model to predict the next token’s latent semantic representation using shallow-layer self-supervised targets. The reported result is better representation geometry and stronger downstream performance across dense and Mixture-of-Experts models with negligible extra training cost and no inference-time cost.
Source: NITP: Next Implicit Token Prediction for LLM Pre-training

Mission Briefing: Why NTP Feels Under-Constrained
The paper’s central motivation is that Next-Token Prediction, despite being the dominant pre-training objective for large language models, may not sufficiently supervise the geometry of hidden representations. In standard NTP, the model minimizes cross-entropy against a discrete next-token label, so the strongest signal reaches the final hidden state through the output projection and the target token direction. The authors argue that this creates many weakly constrained degrees of freedom in latent space, because many different hidden-state configurations can produce similar token logits. This under-constraint matters because base pre-training representations determine how well later models generalize across downstream tasks. NITP is introduced as a way to preserve the strengths of autoregressive token prediction while adding direct supervision over how hidden states are structured.

The Warning Sign: Representation Degeneration
The paper diagnoses the failure mode as representation degeneration, a phenomenon in which hidden states collapse toward a narrow, anisotropic region of the embedding space. To study this during training, the authors track Effective Rank, which measures the effective dimensionality used by representations, and Average Cosine Similarity, which acts as a proxy for global anisotropy. Under standard NTP, the reported pattern is that Effective Rank falls rapidly while cosine similarity rises, indicating that representations become less diverse and more directionally aligned. The authors connect this behavior to prior work on anisotropy and collapsed representations in language models, including studies by Ethayarajh and others. The implication is that optimizing token likelihood alone can reward discriminative next-token accuracy while allowing semantic expressiveness in the latent space to erode.

The New Technique: NITP
Next Implicit Token Prediction addresses this gap by adding dense, continuous supervision in representation space alongside the usual NTP loss. Rather than asking the model only to predict the identity of the next discrete token, NITP asks the final hidden state at position t to predict an implicit token: a latent semantic representation associated with token t+1. This reframes autoregressive learning as both a token-level prediction problem and a representation-level prediction problem. The paper presents NITP as complementary to methods such as multi-token prediction, because NITP does not primarily extend the discrete prediction horizon but instead changes the kind of supervision applied to hidden states. The intended effect is to regularize the optimization landscape by reducing unconstrained latent directions and encouraging a more compact, structured representation geometry.

How It Works: Shallow Layers as Targets
The method constructs NITP targets from the model’s own shallow-layer representations, which the authors describe as retaining richer lexical and local semantic content than deeper, more task-specialized layers. During the forward pass, these shallow representations are temporally shifted so that the current final hidden state is trained to match the next token’s implicit representation. A stop-gradient operation prevents the target representation from being directly pulled by the auxiliary loss, making it a stable self-supervised anchor. The last hidden state is mapped through a projector and optimized with a cosine similarity loss against the implicit target, while the ordinary cross-entropy NTP objective remains in place. This design avoids external encoders and additional annotation, which is important for scaling the method to large pre-training runs.

Outcome: Better Geometry, Better Performance
The paper reports that NITP improves both representation geometry and downstream performance across dense and Mixture-of-Experts models ranging from 0.5B to 9B parameters. In the geometry analysis, NITP maintains higher Effective Rank and prevents excessive growth in cosine similarity compared with standard NTP, supporting the claim that the auxiliary objective mitigates representation degeneration. Empirically, the excerpt reports consistent downstream gains, including a 5.7% absolute improvement on MMLU-Pro for the 9B MoE model, along with 6.4% on C3 and 4.3% on CommonsenseQA. The additional training cost is described as approximately 2% FLOPs, and the method adds no inference-time cost because the auxiliary training machinery is not needed during deployment. The paper’s broader implication is that improving the latent supervision of pre-training objectives can enhance LLM generalization without changing the autoregressive generation interface.
