Off-policy 问题与 RoPE Position Scaling

分类: 推理与评估 · 难度: 进阶 · 关联讲座: L13

本文涵盖两个在推理模型训练和部署中关键的技术主题：Off-policy 训练的分布漂移问题及其解决方案，以及 RoPE 位置编码的长上下文扩展方法（Position Interpolation 和 YaRN）。

📐 Off-policy 问题的形式化

Off-policy RL 训练分布：训练数据来自旧策略 $\pi_{old}$ 或参考模型，但实际部署时策略是 $\pi_\theta$ （已更新，与 $\pi_{old}$ 不同）。

重要性采样修正（IS Ratio）：

$\rho_t = \frac{\pi_\theta(y_t \mid y_{<t}, x)}{\pi_{old}(y_t \mid y_{<t}, x)}$

如果 $\pi_\theta$ 和 $\pi_{old}$ 差距大，IS ratio 方差爆炸（可能 $\rho_t = 100$ ），梯度更新不稳定。

PPO clip 就是截断 IS ratio：

$\rho_t^{\text{clip}} = \text{clip}(\rho_t,\ 1-\epsilon,\ 1+\epsilon)$

防止更新步子太大。

On-policy Distillation：用当前策略 $\pi_\theta$ 生成训练数据（在线采样），再学习目标分布。没有 off-policy drift，但每步都需要重新生成数据（计算代价高）。

📐 RoPE 的 Position Scaling 推导

RoPE 的基础：对位置 $pos$ ，对向量第 $2i$ 和 $2i+1$ 维进行旋转：

$R(pos) = \begin{bmatrix} \cos(pos \cdot \theta_i) & -\sin(pos \cdot \theta_i) \\ \sin(pos \cdot \theta_i) & \cos(pos \cdot \theta_i) \end{bmatrix}, \quad \theta_i = \frac{1}{10000^{2i/d}}$

问题：模型用位置 $[0, L_{\text{train}})$ 训练，推理时遇到位置 $\geq L_{\text{train}}$ 的 token， $\cos(pos \cdot \theta_i)$ 的值超出训练分布 → 性能下降。

Position Interpolation（PI）（Chen et al., 2023）：线性压缩位置：

$pos' = pos \times \frac{L_{\text{train}}}{L_{\text{test}}}$

使得最大位置从 $L_{\text{test}}$ 缩放回 $L_{\text{train}}$ ，不超出训练分布。代价：相邻位置差缩小，区分度降低，需要少量 fine-tune。

YaRN（Yet another RoPE extensioN）：对不同频率维度分别处理（低频→插值，高频→外推），比纯 PI 效果更好，fine-tune 成本更低。