DeepViT

分类: 网络架构

定义

DeepViT 研究了 ViT 在深度增加时出现的 attention collapse 问题（深层的注意力图趋于相同），并提出 Re-attention 机制来解决。

发现 ViT 深度超过一定层数后性能不升反降，原因是 attention map 在深层趋向一致（attention collapse）

提出 Re-attention：对注意力矩阵做跨 head 的可学习线性组合，增加深层 attention 的多样性

Re-attention 计算开销极小，可作为 drop-in 替换

使得 ViT 可以有效训练到 32 层

$\text{Re-Attention}(Q, K, V) = \Theta(\text{Softmax}(\frac{QK^T}{\sqrt{d}}))V$

其中 $\Theta$ 是跨 head 的可学习变换矩阵。

DeepViT (Zhou et al., 2021): 原始论文