CS224N / 学习笔记
讲座 22
  • Stanford CS224N: NLP with Deep Learning (Winter 2026)
  • CS224N 学习计划
  • L01: Introduction and the History of NLP
  • L02: Word Vectors
  • L03: Backpropagation and Neural Networks
  • L04: Language Models and Recurrent Neural Networks
  • L05: Attention and Transformers
  • L06: Final Projects & Practical Tips
  • L07: Pretraining
  • L08: Post-training
  • L09: Efficient Adaptation (PEFT)
  • L10: RAG and Language Agents
  • L11: Evaluation
  • L12: Reasoning 1/2
  • L13: Reasoning 2/2
  • L14: Tokenization and Multilinguality
  • L15: Interpretability (Guest: Been Kim)
  • L16: AI's Impact on Humanity
  • L17: Multimodality (Guest: Luke Zettlemoyer)
  • L18: LoRA Without Regret (Guest: John Schulman)
  • L19: Open Questions in NLP
  • CS224N Final Project
作业 4
  • A1: Introduction to Word Vectors
  • A2: Neural Networks & Dependency Parsing
  • A3: Self-Attention & Transformers
  • A4: LLM Evaluation & Red-Teaming
概念 494
NAS与自动化设计 4
  • 神经架构搜索
  • DARTS
  • Once-for-All
  • slimmable
NLP基础 19
  • BLEU
  • BPE
  • Co-occurrence Matrix
  • Dependency Parsing
  • Distributional Semantics
  • GloVe
  • GPT
  • Language Model
  • Machine Translation
  • N-gram
  • Negative Sampling
  • NER
  • ROUGE
  • SentencePiece
  • Sentiment Analysis
  • Tokenization
  • Transition-Based Parsing
  • Word Embedding
  • Word2Vec
剪枝与稀疏化 52
  • 幅度剪枝
  • 剪枝压缩比
  • 结构化剪枝
  • 可学习门控
  • 网络剪枝
  • ART
  • Block Influence
  • CDPruner
  • DeepHoyer
  • depth pruning
  • DivPrune
  • DynamicViT
  • EfficientVLA
  • EViT
  • FastV
  • GOHSP
  • GraSP
  • IMP
  • LAMP
  • LayerDrop
  • ...还有 32 项
基础理论 60
  • 遍历动力系统
  • 残差能量
  • 测度论
  • 归一化第二矩
  • 核方法
  • 核函数
  • 计算复杂度
  • 模运算
  • 球面码
  • 全变差
  • 时间差分
  • 贪心算法
  • 无放回采样
  • 误差-性能折衷
  • 消融
  • 信息检索
  • 压缩顺序定理
  • 振荡模式
  • 整数线性规划
  • 正交投影
  • ...还有 40 项
待分类 7
  • A800
  • AI Safety
  • Jailbreak
  • NVIDIA H20
  • Red-Teaming
  • Sycophancy
  • Value Alignment
数据集与评估 35
  • BBH
  • Benchmarking
  • BoolQ
  • CIFAR-10
  • CIFAR-100
  • COCO
  • FineWeb-Edu
  • FLOPs
  • GenEval
  • GPQA
  • GSM8K
  • HellaSwag
  • HotpotQA
  • HumanEval
  • ImageNet
  • IoU
  • LIBERO
  • LiveCodeBench
  • LLM Evaluation
  • LLM-as-Judge
  • ...还有 15 项
模型增长 6
  • 训练动态
  • 灾难性遗忘
  • function-preserving
  • GradMax
  • Loss of Plasticity
  • progressive training
深度学习基础 57
  • 残差连接
  • 多数投票
  • 二分搜索
  • 焦点频率损失
  • 温度缩放
  • 信息熵
  • 形态学操作
  • 余弦相似度
  • 自回归解码
  • 自适应阈值
  • Activation Function
  • AdaLN
  • Backpropagation
  • Batch Normalization
  • Binary Cross-Entropy
  • Computation Graph
  • Cross-Attention
  • Cross-Entropy Loss
  • Cross-Modal Attention
  • DeepNorm
  • ...还有 37 项
知识蒸馏 15
  • 特征蒸馏
  • 知识蒸馏
  • 自蒸馏
  • DeFeat
  • DistiLLM
  • FGFI
  • FitNet
  • FreeKD
  • GID
  • GKD
  • Hinton KD
  • KL Divergence
  • MasKD
  • MiniLLM
  • self-distillation
网络架构 88
  • 残差连接
  • 循环神经网络
  • ALiBi
  • BeiT
  • BERT
  • BigBird
  • CaiT
  • classification head
  • CLIP
  • ConvGRU
  • ConvNext
  • CycleNet
  • Decision Transformer
  • DeepSeek
  • DeepViT
  • DeiT
  • DeltaNet
  • DenseFormer
  • DenseNet
  • DINOv2
  • ...还有 68 项
视觉任务 7
  • 类增量学习
  • 立体匹配
  • Diffusion Models
  • EDSR
  • GFL
  • SwinIR
  • Vision-Language Models
训练优化 39
  • AdamW
  • Alignment
  • BALD
  • BatchBALD
  • Constitutional AI
  • continual learning
  • Cosine Annealing
  • Cosine Decay
  • CosineAnnealingLR
  • Curriculum Learning
  • DDPG
  • DeepSpeed
  • DiLoCo
  • DPO
  • DreamBooth
  • EASY
  • EMA
  • EWC
  • FedAvg
  • Fine-tuning
  • ...还有 19 项
量化与低秩 52
  • 标量量化
  • 低秩分解
  • 混合精度
  • 量化分布
  • 向量量化
  • AdaLoRA
  • Adapter
  • AQLM
  • ASVD
  • AutoBit
  • AWQ
  • BitNet
  • BitNet b1.58
  • DFMC
  • DoRA
  • GPTQ
  • Hadamard rotation
  • HALO
  • HGQ
  • HQQ
  • ...还有 32 项
高效推理与部署 53
  • 动态路由
  • Acceptance Rate
  • ACT
  • adaptive computation
  • ADEPT
  • AnchorAttention
  • CALM
  • collaborative inference
  • Cumulative Acceptance Rate
  • DeeBERT
  • DSA
  • dynamic depth
  • EAGLE
  • early exit
  • edge AI
  • EE-LLM
  • Element-wise LUT
  • ELUT
  • FAST
  • Fast Graph Decoder
  • ...还有 33 项
论文 73
NLP基础 4
  • Distributed Representations of Words and Phrases and their Compositionality
  • Efficient Estimation of Word Representations in Vector Space
  • GloVe: Global Vectors for Word Representation
  • Improving Distributional Similarity with Lessons Learned from Word Embeddings
_待整理 1
  • A Novel Pulse-Agile Waveform Design Based on Random FM Waveforms for Range Sidelobe Suppression and Range Ambiguity Mitigation
剪枝与稀疏化 11
  • Adaptive MLP Pruning for Large Vision Transformers
  • Alternating Gradient Flow Utility: A Unified Metric for Structural Pruning and Dynamic Routing in Deep Networks
  • Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language
  • Deterministic Differentiable Structured Pruning for Large Language Models
  • Diet Your LLM: Dimension-wise Global Pruning of LLMs via Merging Task-specific Importance Score
  • HiAP
  • IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
  • Rényi Entropy: A New Token Pruning Metric for Vision Transformers
  • ResPrune: Text-Conditioned Subspace Reconstruction for Visual Token Pruning in Large Vision-Language Models
  • The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
  • VLA-IAP: Training-Free Visual Token Pruning via Interaction Alignment for Vision-Language-Action Models
基础理论 8
  • Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  • Demystifying When Pruning Works via Representation Hierarchies
  • Language Models are Few-Shot Learners
  • Learning Representations by Backpropagating Errors
  • Let's Verify Step by Step
  • On the difficulty of training Recurrent Neural Networks
  • Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters
  • Self-Consistency Improves Chain of Thought Reasoning in Language Models
数据集与评估 4
  • AlpacaEval: An Automatic Evaluator for Instruction-Following Language Models
  • Challenges and Opportunities in NLP Benchmarking
  • Holistic Evaluation of Language Models
  • Measuring Massive Multitask Language Understanding
模型增长 4
  • Anatomical Heterogeneity in Transformer Language Models
  • Grow, Assess, Compress: Adaptive Backbone Scaling for Memory-Efficient Class Incremental Learning
  • Grow, Don't Overwrite: Fine-tuning Without Forgetting
  • Growing Networks with Autonomous Pruning
网络架构 10
  • Attention Is All You Need
  • Attention Residuals
  • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  • Contextual Word Representations: A Contextual Introduction
  • Image Transformer
  • Language Models are Unsupervised Multitask Learners
  • Layer Normalization
  • The Illustrated BERT, ELMo, and co.
  • The Illustrated Transformer
  • The Llama 3 Herd of Models
视觉任务 3
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models
  • Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
  • Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
训练优化 7
  • AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  • DAPO: An Open-Source LLM Reinforcement Learning System at Scale
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
  • Scaling Instruction-Finetuned Language Models
  • Training language models to follow instructions with human feedback
量化与低秩 10
  • Beyond Outliers: A Data-Free Layer-wise Mixed-Precision Quantization Approach
  • Big2Small: A Unifying Neural Network Framework for Model Compression
  • BinaryAttention
  • Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
  • LLVQ
  • LoRA: Low-Rank Adaptation of Large Language Models
  • Parameter-Efficient Transfer Learning for NLP
  • Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression
  • RAMP: Reinforcement Adaptive Mixed-Precision Quantization for Efficient On-Device LLM Inference
  • SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization
高效推理与部署 11
  • Fast Inference from Transformers via Speculative Decoding
  • Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
  • FlashHead: Efficient Drop-In Replacement for the Classification Head in Language Model Inference
  • Language Agents: Foundations, Prospects, and Risks
  • MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
  • PagedAttention
  • ReAct: Synergizing Reasoning and Acting in Language Models
  • Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  • Self-Distillation for Multi-Token Prediction
  • TIDE: Token-Informed Depth Execution for Per-Token Early Exit in LLM Inference
  • Toolformer: Language Models Can Teach Themselves to Use Tools
拓展阅读 48
推理与评估 17
  • Chain-of-Thought 的概率论视角
  • DeepSeek-R1 训练流程与 RL 方法对比
  • Scaling Laws 与 Chinchilla 最优
  • Agent 评估的奖励设计
  • RAG 与 Agent 系统的形式化推导
  • Goodhart 定律的形式化
  • NLP 评估指标与协议
  • Speculative Decoding 算法与加速分析
  • Off-policy 问题与 RoPE Position Scaling
  • Best-of-N 采样分析
  • 线性表示假说与 CAV
  • 校准模型必须幻觉:信息论证明
  • 算法单一文化(Algorithmic Monoculture)形式化
  • DAPO 非对称裁剪比率的数学机制
  • G-Vendi Score:梯度度量数据多样性
  • RLP Information Gain Reward 推导
  • GRPO 目标函数与 Pass@K 的关系
概率模型 3
  • HMM 完整推导:前向算法、维特比解码与 Baum-Welch EM
  • PCFG 与 CYK 算法:句法分析的概率化
  • N-gram 语言模型与平滑技术
注意力与Transformer 4
  • Self-Attention 完整推导
  • Multi-Head Attention 与 Transformer 核心组件
  • Transformer 计算复杂度分析
  • 三种 Transformer 架构的注意力矩阵对比
神经网络基础 6
  • 反向传播与神经网络前向传播完整推导
  • 矩阵微积分:Jacobian 与链式法则
  • 激活函数导数完整推导
  • LSTM 完整推导与梯度消失分析
  • 语言模型的概率基础
  • 梯度连乘与条件语言模型推导
词向量与表示学习 4
  • Word2Vec Skip-gram 目标函数与梯度推导
  • 负采样(Negative Sampling)理论与推导
  • GloVe 目标函数推导
  • 词类比公式与窗口分类推导
预训练与微调 14
  • BPE 算法完整步骤
  • 探针任务(Probing)
  • 预训练目标函数与架构对比
  • RLHF 完整数学推导
  • DPO 与 GRPO 完整推导
  • SimPO 与 DPO 对比推导
  • 结构化剪枝的一般框架
  • LoRA、Adapter 与 Prompt Tuning 推导
  • 字节级(Byte-Level)模型分析
  • BPE 算法完整伪代码
  • Tokenization 理论与多语言分析
  • Constitutional AI 两阶段算法
  • Transfusion 的混合损失函数
  • LoRA 的完整数学结构(Guest Lecture)

批注

选中文字即可添加高亮

#word-embeddings

共 5 个条目

讲座 (1)

L02: Word Vectors

论文 (4)

GloVe: Global Vectors for Word Representation Improving Distributional Similarity with Lessons Learned from Word Embeddings Distributed Representations of Words and Phrases and their Compositionality Efficient Estimation of Word Representations in Vector Space

CS224N: NLP with Deep Learning · Stanford Winter 2026 · 个人学习笔记