三种 Transformer 架构的注意力矩阵对比

分类: 注意力与Transformer · 难度: 中级 · 关联讲座: L06

本文对比 Encoder（BERT）、Decoder（GPT）和 Encoder-Decoder（T5）三种 Transformer 架构在注意力机制上的核心差异，包括注意力矩阵的约束形式、Cross-Attention 的数学表达，以及 RoPE 位置编码的旋转构造原理。

1. 三种注意力矩阵的形式化

Encoder（BERT） — 双向全注意力：

$A_{ij} = \frac{q_i^T k_j}{\sqrt{d_k}} \quad \forall\, i, j \in \{1, \ldots, n\}$

注意力矩阵无约束，每个 token 可关注所有 token（包括未来位置）。

Decoder（GPT） — 因果掩码：

$A_{ij} = \begin{cases} \frac{q_i^T k_j}{\sqrt{d_k}} & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}$

softmax 后 $j > i$ 的位置权重为 0，保证自回归生成的因果性。

Encoder-Decoder（T5） — Cross-Attention：

$\text{CrossAttn}(Z, H) = \text{softmax}\!\left(\frac{(ZW_Q)(HW_K)^T}{\sqrt{d_k}}\right)(HW_V)$

其中 $Z \in \mathbb{R}^{m \times d}$ 是 Decoder 隐状态， $H \in \mathbb{R}^{n \times d}$ 是 Encoder 输出。Q 来自 Decoder，K/V 来自 Encoder — 注意力矩阵维度为 $m \times n$ 。

RoPE 位置编码（Su et al., 2021）：

$f(x, m) = \begin{pmatrix} x_1 \cos m\theta_1 - x_2 \sin m\theta_1 \\ x_1 \sin m\theta_1 + x_2 \cos m\theta_1 \\ \vdots \end{pmatrix}$

其中 $\theta_i = 10000^{-2i/d}$ 。优势：相对位置 $q_m^T k_n$ 仅依赖 $m-n$ （平移不变性），且无需额外参数。