Allgemein

The Transformer Family Version 2.0

The Transformer Family Version 2.0

Many new Transformer architecture improvements have been proposed since my last post on “The Transformer Family” about three years ago. Here I did a big refactoring and enrichment of that 2020 post — restructure the hierarchy of sections and improve many sections with more recent papers. Version 2.0 is a superset of the old version, about twice the length.

Notations

Symbol Meaning
$d$ The model size / hidden state dimension / positional encoding size.
$h$ The number of heads in multi-head attention layer.
$L$ The segment length of input sequence.
$N$ The total number of attention layers in the model; not considering MoE.
$mathbf{X} in mathbb{R}^{L times d}$ The input sequence where each element has been mapped into an embedding vector of shape $d$, same as the model size.
$mathbf{W}^k in mathbb{R}^{d times d_k}$ The key weight matrix.
$mathbf{W}^q in mathbb{R}^{d times d_k}$ The query weight matrix.
$mathbf{W}^v in mathbb{R}^{d times d_v}$ The value weight matrix. Often we have $d_k = d_v = d$.
$mathbf{W}^k_i, mathbf{W}^q_i in mathbb{R}^{d times d_k/h}; mathbf{W}^v_i in mathbb{R}^{d times d_v/h}$ The weight matrices per head.
$mathbf{W}^o in mathbb{R}^{d_v times d}$ The output weight matrix.
$mathbf{Q} = mathbf{X}mathbf{W}^q in mathbb{R}^{L times d_k}$ The query embedding inputs.
$mathbf{K} = mathbf{X}mathbf{W}^k in mathbb{R}^{L times d_k}$ The key embedding inputs.
$mathbf{V} = mathbf{X}mathbf{W}^v in mathbb{R}^{L times d_v}$ The value embedding inputs.
$mathbf{q}_i, mathbf{k}_i in mathbb{R}^{d_k}, mathbf{v}_i in mathbb{R}^{d_v}$ Row vectors in query, key, value matrices, $mathbf{Q}$, $mathbf{K}$ and $mathbf{V}$.
$S_i$ A collection of key positions for the $i$-th query $mathbf{q}_i$ to attend to.
$mathbf{A} in mathbb{R}^{L times L}$ The self-attention matrix between a input sequence of lenght $L$ and itself. $mathbf{A} = text{softmax}(mathbf{Q}mathbf{K}^top / sqrt{d_k})$.
$a_{ij} in mathbf{A}$ The scalar attention score between query $mathbf{q}_i$ and key $mathbf{k}_j$.
$mathbf{P} in mathbb{R}^{L times d}$ position encoding matrix, where the $i$-th row $mathbf{p}_i$ is the positional encoding for input $mathbf{x}_i$.

Transformer Basics

The Transformer (which will be referred to as “vanilla Transformer” to distinguish it from other enhanced versions; Vaswani, et al., 2017) model has an encoder-decoder architecture, as commonly used in many NMT models. Later simplified Transformer was shown to achieve great performance in language modeling tasks, like in encoder-only BERT or decoder-only GPT.