Transformer Block Checkpoint

1. A transformer block consists of:

Attention + FFN + LayerNorm + Residuals Only attention Only FFN Convolutions

2. Pre-norm means layer normalization is applied:

Before attention/FFN After attention/FFN Only at the end Not at all

3. The FFN typically expands dimensions by:

4x 2x 8x 1x

4. Residual connections help with training deep networks by enabling:

5. GPT-2 uses which activation in the FFN?

GELU ReLU Sigmoid Tanh