Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

作者: Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, Omer Levy 年份: 2024 会议: arXiv 分类: 视觉任务

论文笔记：Transfusion

一句话总结

提出 Transfusion，在单一 Transformer 中同时用 next-token prediction 训练文本和用 diffusion loss 训练图像，避免将图像离散化带来的信息损失。

核心贡献

混合训练目标：文本部分使用标准的 causal language modeling loss（交叉熵），图像部分使用 diffusion denoising loss（MSE），两个 loss 在同一模型中联合优化
连续图像表示：图像通过 VAE 编码为连续 latent patch，而非离散 token，保留了更多视觉信息，避免 VQ 量化带来的保真度下降
U-Net 式注意力：在 diffusion 解码阶段引入 intra-image bidirectional attention，使图像 patch 之间可以双向交互（而文本仍保持 causal mask），兼顾两种模态的特性
Scaling 优势：在 7B 参数规模上，Transfusion 在图像生成（FID）和文本理解上均显著优于纯离散 token 方法（如 Chameleon），且 scaling 曲线更优

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

论文笔记：Transfusion

一句话总结

核心贡献

相关概念