#PPO 共 5 个条目 论文 (2) Direct Preference Optimization: Your Language Model is Secretly a Reward Model Training language models to follow instructions with human feedback 拓展阅读 (3) DeepSeek-R1 训练流程与 RL 方法对比 RLHF 完整数学推导 DAPO 非对称裁剪比率的数学机制