Lookup table W of shape (V, D)
W
(V, D)
Table P of shape (Tmax, D)
P
(Tmax, D)
input[i] = token_emb[i] + pos_emb[i]
k = 1/sqrt(D); sample U(-k, k)
k = 1/sqrt(D)
U(-k, k)