Cross-Attention

分类: 深度学习基础

Cross-Attention

定义

交叉注意力（Cross-Attention）是一种注意力机制，其中 Query 来自一个序列（通常是 Decoder），而 Key 和 Value 来自另一个序列（通常是 Encoder）。它是 Encoder-Decoder Transformer 架构的核心组件，使 Decoder 能够在生成过程中”查看”整个输入序列的编码表示。

数学形式

\text{CrossAttn}(Z, H) = \text{Softmax}\left(\frac{(Z W^Q)(H W^K)^\top}{\sqrt{d_k}}\right) (H W^V)

$Z$ : Decoder 隐状态（提供 Query）

$H$ : Encoder 输出（提供 Key 和 Value）

$W^Q, W^K, W^V$ : 可学习的投影矩阵

核心要点

与自注意力的区别：自注意力中 Q/K/V 均来自同一序列；交叉注意力中 Q 来自目标侧，K/V 来自源侧，实现跨序列的信息交互

Encoder-Decoder 桥梁：在机器翻译、文本摘要等 seq2seq 任务中，Cross-Attention 是连接编码器和解码器的唯一通道，替代了传统 seq2seq 中的 Bahdanau Attention

多模态融合：在视觉-语言模型（如 LLaVA、Flamingo）中，Cross-Attention 可让语言模型关注视觉特征，实现跨模态对齐

Decoder-only 模型不使用：GPT 系列等纯 Decoder 模型没有 Cross-Attention，所有上下文信息通过拼接 prompt 和 causal self-attention 处理

计算复杂度： $O(N_{\text{dec}} \times N_{\text{enc}} \times d)$ ，其中 $N_{\text{dec}}$ 和 $N_{\text{enc}}$ 分别为 Decoder 和 Encoder 的序列长度

代表工作

Vaswani et al. (2017): Attention Is All You Need（原始 Transformer Encoder-Decoder）

Raffel et al. (2020): T5（统一文本到文本框架，大量使用 Cross-Attention）

Alayrac et al. (2022): Flamingo（跨模态 Cross-Attention）

Cross-Attention

Cross-Attention

定义

数学形式

核心要点

代表工作

相关概念