CS224N / 学习笔记

SwinV2

分类: 网络架构

SwinV2

定义

Swin Transformer V2 是 ViT 家族的分层视觉 Transformer，通过 shifted window attention 和改进的训练策略支持更大规模（3B 参数）和更高分辨率。

核心要点

引入 cosine attention 替代 dot-product attention 提高稳定性

Log-spaced continuous position bias 支持跨分辨率迁移

支持 1536×1536 分辨率和 30 亿参数规模

在 ImageNet 上取得 SOTA 结果

代表工作

Liu et al., “Swin Transformer V2: Scaling Up Capacity and Resolution” (CVPR 2022)

相关概念

ViT — 基础 Vision Transformer

DeiT — 另一种 ViT 变体