杨艺韬2026-04-285,191 字约 10 分钟

第7章 Gate 之变：sqrtsoftplus、noaux_tc 与 routed_scaling

“Routing is the politics of MoE.” —— Jeff Dean

Gate 决定每个 token 去哪些专家。它选错时，整层 MoE 的 384 个专家像一个失聪的乐团——演奏的是噪音。

7.1 引子：384 个专家面前 Gate 的工程问题

V4 一层 MoE 有 384 个 routed expert + 1 个 shared expert。每个 token 必须选 6 个 routed expert 激活——选哪 6 个？这就是 Gate 要解决的核心问题。

把这个问题展开成一组工程约束：

正确性：选的 6 个专家应该是”对当前 token 最相关的 6 个”——否则模型表达力损失
均衡性：所有 384 个专家的”使用频率”必须接近——否则极少数专家承担大部分计算，多数专家闲置
稳定性：训练不能让”被选少的专家”陷入”越选越少”的死循环（路由坍塌）
效率：Gate 本身的计算成本必须小——否则 384 倍的”打分”就把 MoE 节省的 FLOPs 吃回去
灵活性：Gate 不能依赖特定 batch 大小或硬件——同一份权重要能在不同部署形态下工作

V4 的 Gate 在每个约束上都做了具体设计：sqrtsoftplus（正确性 + 稳定性）、noaux_tc（均衡性 + 不污染主 loss）、bias term（动态均衡）、routed_scaling_factor（输出幅度校准）。

flowchart TB
  Token["token 的 hidden state"] --> Gate{Gate}
  Gate -->|sqrtsoftplus 评分| Scores["384 个专家分数"]
  Scores -->|加 bias| BiasedScores["384 个 biased 分数"]
  BiasedScores -->|topk=6| Top6Idx["选 6 个 expert 的索引"]
  BiasedScores -.原始分数.-> Weights["6 个 weight"]
  Top6Idx --> Out["实际激活的 6 个 expert"]
  Weights -->|缩放 routed_scaling=2.5| FinalWeights["最终 routing weights"]

本章拆 Gate 的每一个细节。

7.2 Gate 类的源码全景

V4 的 Gate 类（inference/model.py）：

class Gate(nn.Module):
    """MoE gating: computes expert routing scores and selects top-k experts."""
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.dim = args.dim
        self.topk = args.n_activated_experts
        self.score_func = args.score_func
        self.route_scale = args.route_scale
        self.hash = layer_id < args.n_hash_layers
        self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
        if self.hash:
            self.tid2eid = nn.Parameter(
                torch.empty(args.vocab_size, args.n_activated_experts, dtype=torch.int32),
                requires_grad=False
            )
            self.bias = None
        else:
            self.bias = nn.Parameter(torch.empty(args.n_routed_experts, dtype=torch.float32))

    def forward(self, x: torch.Tensor, input_ids: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor]:
        scores = linear(x.float(), self.weight.float())
        if self.score_func == "softmax":
            scores = scores.softmax(dim=-1)
        elif self.score_func == "sigmoid":
            scores = scores.sigmoid()
        else:
            scores = F.softplus(scores).sqrt()
        original_scores = scores

        if self.bias is not None:
            scores = scores + self.bias

        if self.hash:
            indices = self.tid2eid[input_ids]
        else:
            indices = scores.topk(self.topk, dim=-1)[1]

        weights = original_scores.gather(1, indices)
        if self.score_func != "softmax":
            weights /= weights.sum(dim=-1, keepdim=True)
        weights *= self.route_scale

        return weights, indices

把这个类拆成 4 块：

静态参数：weight: [n_routed_experts, dim] —— 路由打分的投影矩阵
路由策略：hash 标志决定走 hash 路由还是学习路由
bias term：仅 hash=False 的层才有，给 noaux_tc 用
score function：通过 score_func 字符串切换（softmax / sigmoid / sqrtsoftplus）

forward 的 6 步：

linear(x, weight) 算每个 expert 的原始打分
根据 score_func 套激活函数（V4 用 sqrtsoftplus）
加 bias（如果不是 hash 层）
topk 选索引（hash 层走 tid2eid 表）
gather 出对应位置的”原始分数”作为路由权重
归一化 + 乘 route_scale

7.2·补 Gate forward 的 6 步骤数据流

把 §7.2 中 Gate.forward 的 6 步骤画成数据流图：

flowchart TB
  X["x: [B*S, 7168] BF16"] --> Step1["1. linear(x.float, weight.float)<br/>→ scores: [B*S, 384] FP32"]
  Step1 --> Step2{"2. score_func"}
  Step2 -->|sqrtsoftplus| SqSp["scores = softplus(x).sqrt()"]
  Step2 -.softmax / sigmoid.-> SqSp
  SqSp --> Cache["original_scores = scores 缓存"]
  Cache --> Step3["3. + bias (若非 hash 层)"]
  Step3 --> Step4{"4. 是 hash 层?"}
  Step4 -->|是| Lookup["indices = tid2eid[input_ids]"]
  Step4 -->|否| Topk["indices = scores.topk(6)"]
  Lookup --> Step5["5. weights = original_scores.gather(indices)"]
  Topk --> Step5
  Step5 --> Step6["6. weights /= sum<br/>weights *= route_scale (2.5)"]
  Step6 --> Out["返回 (weights, indices)"]

注意 bias 仅参与 topk 选取（步骤 3 后只影响步骤 4）；weights 取自 original_scores（步骤 5 用的是步骤 3 之前 cache 的）——这是 noaux_tc”分数双轨制”的代码体现。

7.3 sqrtsoftplus：V4 的新 score function

V4 的 score function 是 F.softplus(scores).sqrt()——

\text{sqrtsoftplus}(x) = \sqrt{\ln(1 + e^x)}

为什么用这个看起来奇怪的函数？把它与 softmax / sigmoid 对比：

函数	输出范围	概率归一化	数值稳定性	单调性	训练梯度
softmax	[0, 1]	是	长尾下溢	单调	良好
sigmoid	[0, 1]	否	中等	单调	良好
sqrtsoftplus	[0, ∞)	否	极稳定	单调	更平滑

sqrtsoftplus 的关键优势：

优势一：输出无上限

softmax 把输出压在 [0, 1]——当某个 expert 极相关时，分数最高也只能到 ~1.0，与”中等相关 expert” 的差距被压扁。sqrtsoftplus 没有上界——极相关 expert 可以分到 5、10、50 的分数，topk 选取更有”区分度”。

优势二：低分区域平滑

sigmoid 在 x ≈ 0 时输出 0.5——这意味着所有”中性 token”对所有 expert 都给 0.5 分，topk 选取退化为随机。sqrtsoftplus 在 x ≈ 0 时输出约 0.83，但在负 x 区间快速衰减到 0——给”明显不相关 expert”零分，topk 选取更稳定。

优势三：sqrt 的”放大小差异”特性

sqrt(0.1)=0.316 vs sqrt(0.5)=0.707——sqrt 把 [0, 1] 内的小差异放大。这让 Gate 在”几个 expert 都比较相关”的情况下仍能做出果断选择。

优势四：训练梯度更平滑

softplus 的梯度 1/(1+e^(-x)) 是 sigmoid——平滑且全域非零。sqrt 的梯度 0.5/sqrt(x) 在小 x 时梯度大、大 x 时梯度小——这给训练时的”小分数 expert”更多更新机会，避免马太效应。

V3 用 sigmoid，V4 换成 sqrtsoftplus——这个变化是 V3 训练后期”专家分布长尾化”问题的针对性回应。

7.4 noaux_tc 的工业化：bias term 怎么工作

V4 的 topk_method="noaux_tc" 配合 bias term 实现”无 auxiliary loss 的负载均衡”。这套机制的工作原理：

普通 aux loss 方案的问题：

total_loss = main_loss + λ × aux_loss
aux_loss = sum_e (frequency_e^2)   # 惩罚某些 expert 被选过多

aux loss 的问题是它与主 loss 竞争梯度——模型为了降低 aux_loss，会牺牲主任务的学习质量。把”专家均衡”硬性变成训练目标，反而损害模型表达力。

noaux_tc 的解法：

score_for_topk = original_score + bias[e]
score_for_weight = original_score   # 不加 bias

# bias 在训练循环外更新：
for e in experts:
    if frequency[e] > target:
        bias[e] -= step
    else:
        bias[e] += step

精妙之处：

bias 只影响 topk 选取——决定哪些 expert 被激活
bias 不进入 routing weight——expert 的输出权重由原始分数决定
bias 的更新完全在训练 loop 之外——不污染主 loss 的梯度
这样模型学到的”哪个 expert 最相关”是纯粹的，bias 只调整”哪个 expert 当前应该被多用”

V4 的源码里 bias 的角色非常清晰：

if self.bias is not None:
    scores = scores + self.bias                    # bias 仅给 topk 用

if self.hash:
    indices = self.tid2eid[input_ids]
else:
    indices = scores.topk(self.topk, dim=-1)[1]    # topk 看 biased score

weights = original_scores.gather(1, indices)       # weight 用原始分数

original_scores = scores（在加 bias 之前 cache 的），bias 后的 scores 用于 topk，原始 scores 用于 gather weight。这种”分数双轨制”是 noaux_tc 的工程精髓。

7.5 routed_scaling_factor：路由权重的最终缩放

V4 的 route_scale = 2.5（来自 config.json 的 routed_scaling_factor）。在 forward 最后一步：

weights *= self.route_scale

把所有路由权重乘以 2.5。这看起来只是常数缩放，但对 MoE 输出幅度有重要影响。

考虑 V4 的最终 MoE 输出公式：

y = sum_e (routed_weight[e] × routed_expert[e](x)) + shared_expert(x)

如果不缩放，路由权重总和约 1（归一化后）——routed 部分对最终输出贡献的”幅度”被压在 1.0。但 shared expert 输出没有归一化约束——它的幅度可以是任意值。

这导致一个不平衡：shared expert 可能”主导”输出，而 6 个 routed expert 共同只贡献”被压扁的 1.0 量级”。

route_scale = 2.5 把 routed 部分的总幅度提升到 ~2.5，让 routed expert 与 shared expert 在量级上平衡。这个 2.5 这个具体数字是从训练时的”输出方差监控”调出来的——保证两条路径的输出 norm 接近。

如果你训练自己的 V4-style MoE，这个常数需要根据你的 shared expert 输出幅度调整——不是固定的”魔法数字”。

7.6 hash 层与学习层的双轨架构

V4 在前 3 层用 hash routing（num_hash_layers=3），后 58 层用学习 routing。从 Gate.__init__ 看：

self.hash = layer_id < args.n_hash_layers
self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))
if self.hash:
    self.tid2eid = nn.Parameter(
        torch.empty(args.vocab_size, args.n_activated_experts, dtype=torch.int32),
        requires_grad=False
    )
    self.bias = None
else:
    self.bias = nn.Parameter(torch.empty(args.n_routed_experts, dtype=torch.float32))

hash 层的特点：

tid2eid: [vocab_size, n_activated_experts] —— 一个直接查表的 embedding，token id → expert id 列表
requires_grad=False —— 不参与梯度更新，是预计算的固定映射
bias = None —— hash 层没有 bias，因为路由是固定的，不需要均衡

forward 中：

if self.hash:
    indices = self.tid2eid[input_ids]    # 直接查表
else:
    indices = scores.topk(self.topk, dim=-1)[1]  # 学习路由

为什么前 3 层要 hash？因为前 3 层的 hidden state 还非常接近原始 token embedding——直接基于 token id 路由，比基于 hidden state 学习路由更高效、更稳定。第 8 章会专门展开 hash 路由的设计哲学。

7.7 训练时 bias 的更新机制

V4 的训练源码不公开，但从架构可以反推 bias 更新的标准做法（DeepSeek-V3 论文中描述过）：

# 训练 loop 之后（每个 step 或每 N steps）
with torch.no_grad():
    # 统计每个 expert 在本 step 内的"使用次数"
    expert_load = ...  # [n_routed_experts]

    # 计算 target load（均匀分布）
    target_load = batch_size * topk / n_routed_experts

    # 调整 bias
    for e in range(n_routed_experts):
        if expert_load[e] > target_load:
            gate.bias[e] -= step_size  # 这个 expert 太忙，压低
        else:
            gate.bias[e] += step_size  # 这个 expert 太闲，抬高

    # 防止 bias 失控（绝对值上限）
    gate.bias.clamp_(min=-bias_max, max=bias_max)

step_size 是个超参，典型值在 1e-3 量级。bias_max 防止某个长期不被用的 expert 把 bias 推到无穷——通常设 0.5 到 1.0 之间。

这个更新机制的几个关键点：

完全在 no_grad 上下文中——不参与梯度反向传播
以 batch 为单位——避免单 token 决策导致 bias 抖动
有 clamp 兜底——防止 bias 失控

V4 沿用 V3 这套机制。第 18 章会基于公开技术报告展开训练 pipeline 的具体细节。

7.8 Gate 与其他 MoE 模型的对比

把 V4 的 Gate 与 Mixtral / Qwen MoE / GShard 等横向对比：

模型	专家数	top-k	评分函数	均衡机制	是否有 shared expert
GShard (Google)	64-2048	2	softmax	aux loss	否
Switch Transformer	64-2048	1	softmax	aux loss	否
Mixtral 8x7B	8	2	softmax	aux loss	否
Qwen2-MoE	64	4	softmax	aux loss	否（早期版本）
Qwen3-MoE	128	6	sigmoid	aux loss	是
DeepSeek-V3	256	8	sigmoid	noaux_tc	是
DeepSeek-V4	384	6	sqrtsoftplus	noaux_tc	是
Llama 4 MoE	64-128	1-2	softmax	aux loss	否

V4 的 Gate 在三个维度上”特立独行”：

noaux_tc —— 唯一不用 aux loss 的工业化 MoE
sqrtsoftplus —— 唯一不用 softmax / sigmoid 的评分函数
shared expert + 384 routed —— 极致细粒度 + 共享通用知识

这些选择的源头都是 DeepSeekMoE 论文（arXiv:2401.06066）的工程化路线——V4 把这条路线推到了 384 专家的规模，并在 V3 的 sigmoid 基础上又向前一步换成了 sqrtsoftplus。

7.9 动手实验：自己写一个 noaux_tc Gate

import torch
import torch.nn as nn
import torch.nn.functional as F

class MyGate(nn.Module):
    def __init__(self, dim=512, n_experts=64, topk=6, route_scale=1.0):
        super().__init__()
        self.topk = topk
        self.route_scale = route_scale
        self.weight = nn.Parameter(torch.randn(n_experts, dim))
        self.bias = nn.Parameter(torch.zeros(n_experts), requires_grad=False)

    def forward(self, x):
        # x: [B, dim]
        scores = F.linear(x.float(), self.weight.float())
        scores = F.softplus(scores).sqrt()           # sqrtsoftplus
        original = scores

        biased = scores + self.bias
        idx = biased.topk(self.topk, dim=-1)[1]      # [B, topk]
        weights = original.gather(1, idx)             # [B, topk]
        weights = weights / weights.sum(dim=-1, keepdim=True)
        weights = weights * self.route_scale

        return weights, idx


# 测试：模拟 noaux_tc 的 bias 更新
gate = MyGate()
x = torch.randn(64, 512)
weights, idx = gate(x)

# 统计每个 expert 的使用次数
expert_load = torch.bincount(idx.flatten(), minlength=64)
target = 64 * 6 / 64  # = 6

# 更新 bias
with torch.no_grad():
    step = 1e-3
    for e in range(64):
        if expert_load[e] > target:
            gate.bias[e] -= step
        else:
            gate.bias[e] += step

print("bias after update:", gate.bias[:10])

跑几个 step 后能观察到：“被选过多的 expert 的 bias 被压低，下个 step 它被选的概率下降”。这就是 noaux_tc 的全部魔法。

7.9·补 Gate 在 V4 训练中的”路由稳定性曲线”

V4 的 Gate 在训练全程要保持”路由稳定性”——意思是同一个 token 在训练前期和后期被分配的 expert 集合不应该剧烈变化。否则 expert 训练就成了”打地鼠”——每个 expert 接收的 token 类型不断改变，永远学不出差异化。

把训练过程中的路由稳定性按阶段拆开看：

阶段 1：训练前 5% steps（warm-up）

此时 weight 是随机初始化、bias 是 0、Gate 输出几乎是噪声。每 token 选的 6 个 expert 几乎是随机的。这个阶段的目的是让所有 expert 都”见过”足够多 token——为后续差异化打基础。

阶段 2：训练 5%-30% steps（路由开始分化）

Gate 的 weight 开始学到差异，不同 expert 的 score 开始有显著区分。但此时 noaux_tc 的 bias 还没充分调好——某些 expert 可能被过度选中。

阶段 3：训练 30%-80% steps（稳定差异化）

bias 已经完全发挥作用——expert 利用率均衡在 ±20% 内。此时 Gate 的 weight 学到的”何时选哪个 expert”模式基本稳定，每个 expert 的”输入 token 类型分布” 不再剧烈变化。

阶段 4：训练 80%-100% steps（精修）

学习率衰减到 cooldown 阶段，路由模式几乎固定，Gate 的 weight 微调”边缘 token 的选择”。这阶段的路由稳定性极高——同一 batch 的 token 在不同 step 被分到的 expert 高度一致。

V4 的 sqrtsoftplus + noaux_tc + Hash 前 3 层的组合让这条稳定性曲线比 V3 的 sigmoid + noaux_tc 更陡峭——意思是 V4 更早达到稳定差异化（约 V3 的 70% steps），节省训练成本。

理解这条曲线对fine-tuning V4 极重要：fine-tune 时如果 lr 设太大，会破坏阶段 3-4 学到的稳定路由——需要把 fine-tune lr 设到预训练 max lr 的 1/100 量级，避免路由崩塌。

7.9·补·补 Gate 与稀疏 attention 的隐性耦合

V4 的 Gate 看似与 attention 完全独立——但实际上有一个隐性的耦合点：前 3 层 Hash routing 让稀疏 attention 的 KV 内容更稳定。

具体机制：前 3 层 hash 路由让每个 expert 接收的 token 类别固定。这让前 3 层的 hidden state 输出有较高的稳定性——同一个 token id 在训练全程经过的 expert 集合是固定的，所以它的 hidden state 表达也是稳定的。

第 4 层及之后的 attention 看到的就是这种”被 Hash 稳定化”的 hidden state——稀疏 attention 的 Indexer 学到的”哪些 KV 重要”的模式因此更稳定。如果前 3 层用学习路由，hidden state 在训练初期会大幅波动——稀疏 attention 的 Indexer 就要不断”追”这种波动，训练困难。

这种”前 3 层 Hash 稳定化 → 后续 attention 稀疏选择稳定” 的因果链是 V4 工程的精妙之一——多个看似独立的设计相互强化。读懂这一点，才能理解为什么 V4 不能简单”拆掉某个组件” 来减小模型——每个组件都在为另一个组件提供稳定性保障。

7.9·延展 Gate 调试的 5 个常见问题

实现自己的 V4-style Gate 时容易遇到几个典型问题——把它们的症状与定位方法列出来：

问题 1：Gate 输出全部相近，topk 退化为随机

症状：所有 expert 的 score 差异在 1e-3 以下，topk 选出的 6 个 expert 几乎按 expert id 顺序选——表现是某些 expert 长期不被激活。

原因：Gate 的 weight 初始化太小，或者 input hidden state 太均匀。

解法：把 Gate weight 用 truncated normal 初始化（std = 1 / sqrt(dim)）；检查上游 RMSNorm 是否正确归一化。

问题 2：bias 失控，绝对值越来越大

症状：训练几千 step 后某些 expert 的 bias 跑到 ±10 以上。

原因：bias 更新步长设大了；或者训练数据分布严重不均，某些 expert 长期空闲。

解法：把 bias clamp 到 ±0.5；定期检查 bias 分布，发现长期不更新的 expert 用 hash routing 强制路由几个 step 给它”喂数据”。

问题 3：sqrtsoftplus 数值不稳

症状：训练初期 loss 突然 NaN。

原因：F.softplus(x).sqrt() 在 x 为大负数时可能 underflow——sqrt(0) 的导数是 inf。

解法：在 sqrt 之前加 clamp(min=1e-6) 保证不为 0。

问题 4：route_scale 与 shared expert 输出量级不匹配

症状：模型输出 norm 异常大或异常小。

原因：route_scale=2.5 是 V4 团队的经验值——不同模型可能需要不同值。

解法：训练时监控 routed 输出与 shared 输出的 norm 比例，调 route_scale 让两者接近 1:1。

问题 5：hash 层与学习层的边界训练不稳

症状：第 3 层（hash 与学习的边界）的 loss 抖动比其他层大。

原因：hash 层固定 expert 输出与学习层动态 expert 输出在边界处的”信号特征”差异大。

解法：让前 3 层的 weight decay 更大，加快它们的”稳定化”。

这 5 个问题在 V4 训练曲线里几乎都遇到过——V4 团队的调参经验值（bias clamp / route_scale=2.5 / sqrt 加 epsilon）都是从这些问题里调出来的。

7.10 延伸阅读

DeepSeekMoE 论文（arXiv:2401.06066）：细粒度专家 + 共享专家
DeepSeek-V3 报告（arXiv:2412.19437）：noaux_tc 的源头
Switch Transformer 论文（arXiv:2101.03961）：MoE 的早期工业化
Mixtral 论文（arXiv:2401.04088）：dense MoE 路线
本书第 8 章：hash routing 的细节展开
本书第 9 章：Expert 类与 SwiGLU 实现

7.10·补 Gate 的”前向计算” 浮点精度链路

V4 的 Gate forward 在精度上有几个细节值得专门拆出来。

第 1 步 score 计算：

scores = linear(x.float(), self.weight.float())

注意两个 .float()——把 input x 和 weight 都升到 float32 算。这与其他大部分 Linear 不同（其他大多走 BF16/FP8）。原因：score 的差异会被 topk 放大——即便差几个百分位的 epsilon，topk 选出来的可能完全不同。所以 Gate 必须用 FP32 算。

第 2 步 sqrtsoftplus：

scores = F.softplus(scores).sqrt()

softplus 与 sqrt 都在 float32 上算——避免 BF16 在大值或小值下的精度漂移。

第 3 步 bias 加法：

scores = scores + self.bias

self.bias 是 float32，与 scores 同精度——不会有精度损失。

第 4 步 topk：

topk 操作本身是 deterministic 的，但浮点精度对 ties 的处理敏感。如果两个 expert 的 score 极接近（差 1e-6），不同硬件可能选不同的 expert。V4 通过 sqrtsoftplus 的”分数差异放大”性质让 ties 几乎不出现——这是 sqrtsoftplus 的另一个隐性好处。

第 5 步 weights 归一化：

weights = original_scores.gather(1, indices)
if self.score_func != "softmax":
    weights /= weights.sum(dim=-1, keepdim=True)
weights *= self.route_scale

也在 float32 上算——除法对精度敏感。

第 6 步返回：

weights / indices 出 Gate 后才被下游 expert 用——下游 expert 内部用 BF16 / FP4。从 Gate 的 float32 到下游的 BF16 是一次精度降——但 weights 是个标量乘子，精度损失可控。

整体来看 V4 的 Gate 是”全程 float32”——没有任何 BF16/FP8 中间步骤。这是为了 routing 决策的稳定性付的小代价。

7.10·补·补 noaux_tc 的训练数学：bias 怎么收敛

V4 的 noaux_tc 用 bias 替代 aux loss——但 bias 怎么具体收敛到一个稳定的分布，背后有可推导的数学。

收敛方程：

设 expert e 在某 step 接收到的 token 数为 n_e，目标接收数为 target = batch_size × topk / n_routed。bias 的更新规则：

b_e[t+1] = b_e[t] + step × (target - n_e[t])

这是离散版的”目标-反馈”控制——(target - n_e) 是误差信号，step 是控制增益。

稳态分析：

假设训练数据分布稳定，bias 会收敛到一个稳态 b*，使得每 expert 的接收数等于 target。在稳态下：

b_e* + s_e (= sqrtsoftplus(scores_e) 的均值) ≈ const + (-log p_e)

其中 p_e 是 expert e 在”无 bias 时”的自然激活概率。bias 稳态值近似负对数概率——这是 noaux_tc 与 reweighting in importance sampling 在数学上的对应。

收敛速度：

step size 决定收敛速度。step 太大：bias 抖动剧烈，topk 选取不稳；step 太小：收敛慢，训练前期不均衡。V4 的 step 估计在 1e-3 量级，让 bias 在几千 step 内收敛。

与 weight 训练的同时性：

bias 在 no_grad 上下文中更新，weight 在反向传播中更新。两者交替进行——bias 调路由分布，weight 学每个 expert 的具体表达。这种”路由分布 vs 表达学习”的解耦是 noaux_tc 比 aux loss 高级的根本原因。

与 sqrtsoftplus 的协同：

sqrtsoftplus 让 score 输出无上限——bias 的”加上去”对 score 的相对排序影响清晰。如果 score 是 softmax 输出（[0,1] 区间），bias 加上去会被 softmax 归一化”吃掉”——bias 失效。V4 选 sqrtsoftplus 而非 softmax，部分原因就是为了让 bias 真正生效。

7.11 本章小结

V4 的 Gate 是 30 行代码、4 个工程要素：sqrtsoftplus、noaux_tc、bias term、route_scale
sqrtsoftplus 取代 sigmoid——给 Gate 更大的”分数动态范围” + 更平滑的训练梯度
noaux_tc + bias term 取代 aux loss——专家均衡不污染主 loss
route_scale=2.5 校准 routed 与 shared 路径的输出幅度
前 3 层用 hash 路由——绕开”早期层 hidden 太接近 token embedding”的学习困难
与 GShard / Mixtral / Qwen 等同类对比，V4 在三个维度独特：noaux_tc、sqrtsoftplus、384 + shared expert

第 8 章我们专门展开 hash routing——为什么前 3 层不学路由，以及 tid2eid 表是怎么生成的。