Input window length T Target is next token
T
T embeddings -> vector length T*D
T*D
T*D -> hidden -> V logits
Predict next token, slide window