Self-Attention Checkpoint

1. The attention formula scales by sqrt(d_k) to:

Prevent extreme softmax values Speed up computation Reduce memory Enable parallelism

2. In causal attention, position i can attend to:

Positions 0 to i Positions i to n All positions Only position i

3. Multi-head attention allows the model to:

Attend to different representation subspaces Process longer sequences Use less memory Train faster

4. Self-attention complexity is O(n^?) with respect to sequence length.

5. Q, K, V in attention stand for:

Query, Key, Value Quality, Kernel, Vector Quantized, Known, Variable Queue, Key, Validation