#RLHF
共 8 个条目
讲座 (1)
论文 (5)
Let's Verify Step by Step AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback DAPO: An Open-Source LLM Reinforcement Learning System at Scale Direct Preference Optimization: Your Language Model is Secretly a Reward Model Training language models to follow instructions with human feedback