2026年6月13日 · 4 分钟阅读

Solving OPSD

Solving OPSD (basically) Author : ar0cket1 (@ar0cket1) Date : 2026 06 13 Source : X Article Tweet Related : 前作: On Policy Self Distillation RLRT paper...

Solving OPSD (basically)

Author: ar0cket1 (@ar0cket1)
Date: 2026-06-13
Source: X Article | Tweet
Related: 前作: On Policy Self Distillation | RLRT paper | OPSD arxiv
Tags: #RL #Distillation #OPSD #SelfHintedTeacher #GRPO #Training

概述

ar0cket1（独立 ML 研究者）一周内对 OPSD（On Policy Self Distillation，在线自蒸馏）的系统性分析与"基本解决"。核心论点：hint 条件教师本质上处于"regret"状态——学生一旦探索就会激怒教师，导致 response length collapse 与探索失败。作者发现"positive pressure"（教师同意学生超过学生自身）单独使用几乎能解决所有稳定性问题，"negative pressure"（教师不同意学生）则是有害噪声。提出只保留 positive pressure + 与 GRPO 混合，预计可达 OOM 级 RL 加速。

核心术语

OPSD student: Olmo 3-7B-think SFT（不带 hint）
OPSD teacher: Olmo 3-7B-think SFT + hint 条件
OPD student: Olmo 3-7B-think SFT
OPD teacher: Olmo 3-7B-think RL 终态（参考上限）
数据集：Nemotron Math v2 的 math 数据
positive pressure: teacher 对某 token 的概率 > student（强化方向）
negative pressure: teacher 对某 token 的概率 < student（负强化方向）
只用 token-level KL（full-vocab KL 噪声过大，不可用）

关键论点

1. Hint Variations 极其可控

通过简单的 hint 改写（hint rewrite），可以可靠地控制 hinted teacher 输出的幅度，同时保留整体结构。这是可学习的——而且 hint 改写完全可以在训练前离线完成（off policy）。

OPSD 的 KL shock 不是真正的问题——只要用合理的 hint rewrite 就能消除。示例 hint（来自作者的 hint generator）：

Give positive weight to reasoning that first extracts structural constraints from 2n-1=x², such as parity of x and the parametrization n=(x²+1)/2. It is also strong progress to set u=n² in the second equation and recognize y²-2u²=-1 as a negative Pell equation, while remembering that only Pell solutions with u a perfect square can correspond to integer n. ...

这种"不太揭示答案"的 hint rewrite 保留大部分结构，同时把幅度稳定地降下来。

2. Regretful Teacher 是 OPSD 的根本病灶

Hinted teacher 是regret-conditioned 的：它已经知道 hint，所以每当 student 探索或偏离 hint 时，teacher 都会"惊慌"。文献表明这会 collapse response length，进而杀死探索。

作者通过 trace 分析发现：

negative pressure 几乎没有真实信息——它包含的全部是 disagreement，且不溢出到 positive pressure
但 negative pressure 一旦加入训练就导致 collapse（文献中所有论文都多多少少引入它，即使是最好的论文也是 negative 比 positive 还重）
所以直接砍掉 negative pressure

3. Positive Pressure 几乎解决一切

将 OPSD 拆分为 positive / negative 两路：

Negative OPSD 与 OPD 差异巨大，把大量 mass 放在低熵位置（危险）
Positive OPSD 与 OPD 几乎完美匹配，且极其安全

通过对 trace 的人工标注发现，positive pressure 出现在几乎正好是你想监督的地方：

near promising repairs（接近成功的修复）
representation shifts（表征切换）
constraint checks（约束检查）

这种 ICL 强度比作者本人、比文献、比大多数人预期的都要强得多。

4. Qualitative 案例证据（数学竞赛 trace）

positive pressure 自动识别出的"修复"几乎都是真正的算法进展：

几何级数构造：rewritten +0.843, geometric +1.396（在 3^44 / 第四幂问题）
下界不变量激活：use +0.776, modular +1.344, modulo +0.684
函数方程归一化：f(x+f(x))=f(x)f(1) +0.776，f(x+zf(x))=f(x)f(z) +0.493
a=f(1) 锚定：NT +0.716, h20 +1.510
拉格朗日结构：Lag +0.882/+1.014，约束回代 Sub +0.475
几何降维：3D cone → 轴截面 cross +0.692/+0.820, cone +1.200
图着色结构：colorings of the 13 ranks +0.565/+0.879, cycle graph +0.352
Sophie Germain 分解：Wait +0.453, product +0.244/+0.772, squares +0.659
列乘积不变量：column +0.555, arrange +0.699

唯一"略有害"的是一些 boilerplate，但相对次要、可能无害。

5. Self-Reinforcing Dynamic

越强的模型 / 越靠后 RL 阶段，positive supervision 越占主导——这是一种自增强动态，意味着很可能在 scale 上表现极佳。

6. 对 SOTA（Rebellious Self Teacher / RLRT）的批评

当前 SOTA RLRT paper 的做法：

观察 negative pressure 反探索 → 把 disagreement 的监督翻转（reverse），鼓励那些 conditional on 正确答案的偏离 hint 的 token

作者观点：

这**"看似"** 工作，但本质是逃避 regretful teacher 机制而非修复它
stability 来自"correct final answer" 这一额外条件
RLRT 用 full-solution hint → 需要 clipping / GRPO mixing / decay 等大量 regularization（作者的 softer hint 直接解决 KL 不稳）
翻转 disagreement 在数学上污染了信号：有用的探索 + 大量坏推理
模型越强，这种翻转越具破坏性、越远离 OPD

结论与下一步

"So you can consider it solved (going to run small scale training soon to confirm but I'm extremely confident)."

作者方案：positive pressure only + 可选 GRPO mixing（约 50%）

预期效果：OOM 级 RL 加速 + 稳定训练
可应用于 continual learning（持续学习），使其大规模化可行
邀请大实验室做大规模 self-teacher training 验证

"training a model to train a model after training"

关键洞察

Hint rewrite 是 free lunch：离线一次性生成、可控幅度、消除 KL shock
Regret 是教师问题的本质：hint 条件让 teacher 永远活在"如果学生听话就好了"的反事实里
Positive/Negative 解耦是核心发现：之前所有文献都把这两者混在一起，作者证明它们是截然不同的两种信号
Negative pressure 是噪声 + 灾难：数学上不携带新信息、操作上 collapse response length
Grazing on positive only 是安全的：与 OPD 几乎完美匹配，scale 上自增强
ICL 比想象强得多：positive pressure 几乎完美地落在"应该监督的地方"
不要相信 reverse trick：SOTA 的 negative-pressure reversal 是污染信号，越强越糟