杨艺韬2026-04-2810,375 字约 21 分钟

第15章 CUDA Graph 与 torch.compile 端到端

“torch.compile 的最高境界是：第一次 forward 多几秒，之后 1000 次 forward 跑得像一次 GPU graph replay 一样。”

—— PyTorch Conference 2024

本章要点

CUDA Graph 把多次 kernel launch 录制成一个 “graph”：replay 时一次 launch 完成几百个 kernel，CPU 开销几乎归零
Inductor 的 reduce-overhead 模式自动用 CUDA Graph 包裹生成的 kernel：用户加 mode='reduce-overhead' 就行
cudagraph_trees.py（~2700 行）解决”多 graph 共存”的内存协作：每个 graph 关联自己的 PrivatePool（第 4 章 §4.9），保证 replay 时显存地址稳定
完整端到端：@torch.compile → Dynamo trace → AOTAutograd functionalize+partition → Inductor lowering+codegen → Triton kernel → CUDA Graph 包装 → replay
小模型 / 小 batch 推理收益最大：Llama-7B decode (单 token 生成) 能从 ~25ms/token 降到 ~5ms/token
大模型训练收益小：单步本身要几十毫秒，CPU 开销占比 < 1%，CUDA Graph 几乎没用

15.1 CUDA Graph 是什么

CUDA 11.0 引入的 API：把一段 GPU 操作”录制”成 cudaGraph_t 结构，replay 时用一次 cudaGraphLaunch 提交所有操作。

不用 CUDA Graph 的传统流程：

CPU: launch kernel 1 → launch kernel 2 → launch kernel 3 → ...
              ↓ 5 us       ↓ 5 us         ↓ 5 us
              GPU 接收并排队

每次 kernel launch 大约 5-10 微秒 CPU 开销。如果一个模型 forward 调 1000 个 kernel —— CPU 端就要花 5-10 毫秒只在 launch 上。GPU 计算如果只要 5 ms，整个 step 时间就被 launch 拖到一半。

用 CUDA Graph：

录制阶段（只做一次）:  capture_begin → 跑一遍 → capture_end → 得到 graph
replay 阶段（每次跑）:  cudaGraphLaunch(graph)   ← 一次调用！

一次 launch 完成几百个 kernel，CPU 开销几乎归零。这是 CUDA Graph 在 小算子密集 + 重复执行 场景的杀手级优化。

15.2 CUDA Graph 的硬约束

CUDA Graph 不是免费午餐，有几条硬约束让用它不那么直接：

1. 内存地址必须固定：录制时 kernel 看到的 input / output 张量地址在 replay 时也必须一样。如果 caching allocator 在两次 replay 之间分配了别的张量到那个地址 —— 数据被覆盖。第 4 章 §4.9 的 PrivatePool 就是为这个设计的。

2. shape 必须固定：录制时 input shape = [4, 4]，replay 时如果是 [8, 8]，graph 不能动态调整。

3. 控制流必须一致：如果 forward 里有 if x.sum() > 0，录制时进入 if 分支，replay 时不会重新判断。

4. 不能做 memory allocation：录制中调 torch.empty 等会失败 —— 那时 caching allocator 不允许分配新内存。

这四条约束让”任意 PyTorch 代码自动包 CUDA Graph”几乎不可能。所以 CUDA Graph 必须配合编译器使用 —— Inductor 把”已编译、shape 固定、无 host 控制流”的代码段交给 CUDA Graph 包装。

15.3 Inductor 的 reduce-overhead 模式

torch.compile(model, mode='reduce-overhead') 自动用 CUDA Graph 包 Inductor 生成的 kernel。流程：

flowchart TB
    User["@torch.compile(mode='reduce-overhead')"]
    User --> Cap[Inductor 生成 Triton kernel]
    Cap --> Pool[分配 PrivatePool 给这个 graph]
    Pool --> Warm[Warmup: 跑 3 次让 autotune 选好 block size]
    Warm --> Rec[Capture: cudaGraphCapture_begin → 调 kernel → end]
    Rec --> Cache[graph 缓存到 _CompiledFxGraph]

    Run[每次 forward 调用]
    Run --> Check{shape / device 与录制时一致?}
    Check -->|是| Replay[cudaGraphLaunch]
    Check -->|否| Recap[重新 capture 一份 graph]

    style Replay fill:#dcfce7,stroke:#22c55e,stroke-width:2px
    style Recap fill:#fef3c7,stroke:#f59e0b

第一次 forward 时 Inductor 生成完 kernel 后立刻进入 capture：用 cudaStreamBeginCapture 录制接下来所有 GPU 操作，跑一次完整 forward，cudaStreamEndCapture 得到一张 graph。后续相同 shape 的 forward 直接 replay。

实测在 Llama-7B 推理（单 token decode）上：

不用 torch.compile：~25 ms/token
mode='default'（仅 Inductor）：~12 ms/token
mode='reduce-overhead'（+ CUDA Graph）：~5 ms/token

5 倍加速 —— 这是 vLLM 等推理引擎大量用 torch.compile 的原因。

15.4 cudagraph_trees：多 graph 协作

实际工程里一次训练有几张 graph：forward 一张、backward 一张、可能还有 optimizer step 一张。它们之间的内存怎么协调？

torch/_inductor/cudagraph_trees.py（2711 行）实现了一套”graph 树”管理器：

每个 graph 节点对应一段编译产物
节点之间通过 input/output tensor 连接
同一棵树的 graph 共享一个 PrivatePool（在 caching allocator 里隔离）
不同树之间内存独立，避免 graph A 的输出地址被 graph B 复用

这套树形结构让”forward → backward → optimizer step”三个 graph 能流畅协作 —— forward 输出是 backward 的输入，backward 输出是 optimizer 的输入，全程地址稳定。

如果模型有 graph break（第 12 章），每段 graph 各自有自己的”子树”。中间 eager 执行的代码段不被 CUDA Graph 包裹，开销比纯 graph 情况高。这是为什么 graph break 越多、reduce-overhead 模式收益越小。

15.5 完整端到端：一行代码到 GPU 二进制

把 12-14 章串起来，看 @torch.compile 全旅程：

flowchart TB
    U["@torch.compile(model, mode='reduce-overhead')<br/>用户代码"]
    U --> D[Dynamo<br/>PEP 523 拦截 forward<br/>字节码符号执行<br/>输出 FX Graph + Guards]
    D --> A[AOTAutograd<br/>fake tensor 跑反向<br/>functionalize<br/>min-cut partition<br/>输出 fw + bw 子图]
    A --> I[Inductor<br/>Lowering: ATen→IR<br/>Scheduling: fusion 决策<br/>Codegen: 生成 Triton]
    I --> T[Triton 编译<br/>Triton DSL → PTX → CUBIN]
    T --> CG[CUDA Graph 录制<br/>warmup 3 次 + capture]
    CG --> Run[Runtime<br/>guards check → graph replay]

    Run -.guards 失败.-> D

    style D fill:#fef3c7
    style A fill:#dbeafe
    style I fill:#dcfce7,stroke:#22c55e,stroke-width:2px
    style CG fill:#fce7f3

每一步对应的源码主目录：

阶段	关键代码
Dynamo	`torch/_dynamo/` (v2.11 实测 ~100000 行 Python)
AOTAutograd	`torch/_functorch/` (~32000 行)
Inductor	`torch/_inductor/` (~213000 行 —— 单 namespace 最大)
Triton	`triton/` (独立项目，几十万行)
CUDA Graph	`cudagraph_trees.py` (~2700 行)

加起来约 35-40 万行代码 协同工作，让用户写一个装饰器就完成。这是 PyTorch 工程史上最复杂的一段子系统，也是 v2.0 后 PyTorch 性能跨越 TF / JAX 的根本。每个 minor version 这套代码量还在持续增长（v2.4 时 Inductor 仅 ~120k，到 v2.11 翻近一倍）。

15.6 模式选择决策

torch.compile(model, mode=...) 有四档模式：

Mode	行为	适用
`default`	Inductor 编译，无 CUDA Graph	大多数训练
`reduce-overhead`	+ CUDA Graph 包	推理 / 小 batch decode
`max-autotune`	+ 更激进的 autotune（试更多 block size 配置）	性能最敏感场景
`max-autotune-no-cudagraphs`	max-autotune 但不开 CUDA Graph	dynamic shape 多但还想 autotune

15.6.1 怎么选

大模型训练（70B+）：用 default。CUDA Graph 收益小（每步本身慢、launch 占比低），且 graph 容量限制可能导致显存爆
大模型推理（prefill）：用 default。prefill 一次处理整 batch、kernel 大，graph 收益小
大模型推理（decode）：用 reduce-overhead。decode 是单 token、kernel 小，CPU launch 占主导，graph 收益巨大
小模型（如 ResNet）训练：用 reduce-overhead。每步 kernel 多但小，类似 decode 场景
极致性能调优：用 max-autotune，编译时间多几倍，性能再多 5-10%

vLLM 在 prefill 阶段用 default、decode 阶段用 reduce-overhead，这种”按场景切模式”是大模型推理引擎的标配。

15.6.5 CUDAGraph capture 的具体 API 流程

CUDA 11+ 提供 stream-based graph capture API。PyTorch 包装它的关键函数：

// 简化版 capture 流程 (来自 c10::cuda::CUDACachingAllocator + torch.cuda.CUDAGraph)
void capture_begin(graph_pool_handle pool) {
    // 1. 切换 caching allocator 到 PrivatePool (第 4 章 §4.9)
    CUDACachingAllocator::beginAllocateToPool(device, pool, /*filter*/ ...);
    // 2. 让当前 stream 开始 capture
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeThreadLocal);
}

void capture_end() {
    cudaStreamEndCapture(stream, &graph);   // 拿到 cudaGraph_t
    CUDACachingAllocator::endAllocateToPool(device, pool);
    cudaGraphInstantiate(&exec, graph);     // 编译成 cudaGraphExec_t
}

void replay() {
    cudaGraphLaunch(exec, stream);          // 一次 launch 跑完整段
}

三个关键 CUDA API：

cudaStreamBeginCapture：标记 stream 进入 capture 模式。之后所有 launch 进 stream 都被记录、不真执行
cudaStreamEndCapture：结束 capture，拿到 cudaGraph_t（graph 描述，未编译）
cudaGraphInstantiate：把 graph 编译成 cudaGraphExec_t（可执行版本），编译耗时 ~10ms 一次

cudaStreamCaptureModeThreadLocal 让 capture 限于本线程 —— 避免其他线程的 GPU 操作干扰 capture。

PrivatePool 与 capture 的协作（第 4 章 §4.9）很关键：capture 期间所有 alloc 走专属 pool，replay 时这些地址保持有效；capture 之外其他张量分配不进 pool，避免污染 graph 录制的地址。

15.6.6 `_inductor.config` 关键 flags

torch/_inductor/config.py 700+ 行的 flag 控制编译行为。最常调的几个：

Flag	默认	用途
`max_autotune`	False	编译时 autotune 试更多 block size 配置（耗时几倍但性能再多 5-10%）
`max_autotune_gemm`	False	专门给 GEMM autotune（让 Inductor 试 cuBLAS / Triton GEMM templates / handwritten）
`triton.cudagraphs`	False	配 reduce-overhead 模式时启用，自动给 Triton kernel 包 CUDA Graph
`aggressive_fusion`	False	更激进 fuse（可能产生超大 kernel，占太多 register）
`coordinate_descent_tuning`	False	用坐标下降算法做 autotune（找比 grid search 更优配置）
`cpp.threads`	-1	CPU 后端 OpenMP 线程数，-1 表示让 Inductor 自动决定
`freezing`	False	推理时把常量参数 freeze 进 kernel（更激进 fuse，但模型变 inference-only）

环境变量等价物：TORCHINDUCTOR_MAX_AUTOTUNE=1 等。

调优顺序：先开 max_autotune 看效果；推理服务再开 freezing；如果某 kernel 用 aggressive_fusion 后崩了说明 register 压力过大，关掉。

理解这些 flag 让你在编译性能不达预期时有”杠杆”可调，而不是只盯着 mode='reduce-overhead' 一个粗粒度选择。

15.6.7 AOT Inductor (AOTI)：ahead-of-time 编译

@torch.compile 是 JIT —— 第一次跑时编译，依赖 Python 解释器 + Dynamo + Inductor 全部活着。AOTI 是 v2.4+ 引入的另一条路：编译前提前做、产物落盘、运行时不需要 Python。

入口（torch/_inductor/__init__.py:55/272）：

import torch
import torch._inductor

# 1. export 模型成 ExportedProgram (静态图)
exported = torch.export.export(model, example_inputs)

# 2. AOT 编译, 产出 .so 文件
output_path = torch._inductor.aoti_compile_and_package(
    exported,
    package_path="model_aoti.pt2",
)

# 3. C++ 端加载, 不需要 Python
# (示例: 推理服务用 C++ 加载)

输出的 .pt2 是个 zip 包，内含：

编译好的 .so（C++ wrapper + Triton kernel cubin）
模型权重（与 .so 关联）
元信息（输入 / 输出 schema）

C++ 端用 AOTIModelContainerRunner 加载 + run() 接口直接推理。整套部署不依赖 Python 解释器，对 mobile / edge / 嵌入式至关重要。

AOTI 与普通 torch.compile(mode='reduce-overhead') 的区别：

维度	torch.compile (JIT)	AOTI
编译时机	第一次调用	离线提前
部署依赖	Python + PyTorch	只需 libtorch (C++)
启动时间	首调用慢（warmup）	加载 .so 几百毫秒
灵活性	支持 dynamic shape / graph break	shape 必须静态
目标场景	训练 + JIT 推理	生产推理服务 / 边缘部署

AOTI 内部仍走 Inductor 整套 pipeline（lowering / scheduling / Triton codegen），但产物从”运行时 callable”变成”持久化 .so”。Wrapper codegen 也从 Python wrapper 切换成 C++ wrapper（codegen/cpp_wrapper_*.py，§14.6.8 提过），让 .so 内部完全 C++、零 Python overhead。

实战决策：训练用 torch.compile、推理服务用 AOTI。HuggingFace、vLLM 等推理引擎正在迁移到 AOTI 路径 —— 模型分发时直接给 .pt2 包，部署时不用 Python 环境。这是 PyTorch 走向 “一次训练、多端部署” 的关键工程基础。

15.6.8 CUDA Graph × NCCL 通信的兼容性

DDP / FSDP 训练里有 collective 操作（AllReduce、AllGather）。CUDA Graph 能否包住这些 collective？答案是 能但有限制。

支持路径：NCCL 2.10+ 提供了 ncclCommUserRank 与 stream-based collective，让 collective 调用能被 cudaStream capture 录入 graph。PyTorch v2.0+ 默认走这条路径。

// 录制阶段
cudaStreamBeginCapture(stream, ...);
ncclAllReduce(input, output, count, ncclFloat, ncclSum, comm, stream);  // 能被 capture
cudaStreamEndCapture(stream, &graph);

// replay
cudaGraphLaunch(exec, stream);   // collective 也被 replay

限制条件：

comm 必须固定：record 时用 commA、replay 时不能换 commB（comm 切换需要重新 capture）
input/output 地址必须固定：与普通 CUDA Graph 一致
不能跨 graph 用同一 comm：每个 graph 应该有独占的 comm（避免 NCCL 内部状态混乱）
collective 大小必须固定：count 在 record 时固化、replay 不能改

实战 corner case：

DDP gradient bucketing：第 17 章 DDP 章会讲，bucket 大小通常基于参数 ready 时机动态决定 → 与 CUDA Graph 不兼容（开 reduce-overhead 时 DDP 默认 fall back 到 static graph 模式）
FSDP all-gather：v2 有 _set_use_full_graph(True) 选项让 all-gather 进 graph，但要 shape 静态
EP / TP 训练：experts collective 在 graph 内复杂度高，多数实现走 graph break 把 collective 留在 eager

调优建议：分布式训练默认 mode='default'（不开 graph）；推理（无 collective）大胆开 reduce-overhead；分布式推理（如 vLLM TP=8）需要逐 op 验证 graph capture 不出错。

15.6.9 capture 失败的常见原因与诊断

CUDA Graph capture 失败通常报 cudaErrorStreamCaptureUnsupported 或 cudaErrorStreamCaptureInvalidated。失败场景：

错误	触发原因
`StreamCaptureUnsupported`	调了不能被 capture 的 API（如 `cudaMemcpy` 同步版本）
`StreamCaptureInvalidated`	capture 过程中其他线程在同 stream 做了 GPU 操作
`IllegalAddress` (replay 时)	replay 时 input 地址与 record 时不一致（caching allocator 误分配）
`LaunchTimeout`	graph 内某 kernel 运行时间过长（H100 默认 watchdog 10s）
OOM during instantiate	`cudaGraphInstantiate` 需要把整张 graph 拷贝到显存，超出余量

诊断命令：

# 开详细 CUDA log
CUDA_LAUNCH_BLOCKING=1 \
TORCH_CUDA_DSA=1 \
TORCH_LOGS="+inductor,+_inductor.cudagraph_trees" \
python train.py

具体看 cudagraph_trees.py 输出的 trace：每次 capture 开始 / 结束、graph 的 input/output 地址、replay 时的 address check。这些日志能定位 90% 的 capture 问题。

实战经验：第一次开 reduce-overhead 必碰几个 corner case。常见解法：

把不能 capture 的 op（如 print / .item()）移到 compile 包外
用 torch.cuda.is_current_stream_capturing() 在代码里判断当前是否在 capture，做差异化处理
关闭 triton.cudagraphs flag 让 Inductor 走非 graph 路径，先排查是否 graph 本身有问题

15.6.10 多 stream 编排：graph 内的并行

CUDA Graph 不只是”一串顺序 kernel”，它是有向无环图（DAG），节点间可以并行。Inductor 利用这点：把”无依赖关系的 kernel”放在不同 stream 节点，让 GPU 能并发执行。

graph TB
    Start[graph 开始]
    Start --> K1[kernel 1: layer norm]
    Start --> K2[kernel 2: bias load<br/>无依赖]
    K1 --> K3[kernel 3: matmul]
    K2 --> K3
    K3 --> K4[kernel 4: activation]
    K4 --> End[graph 结束]

    style K1 fill:#dcfce7
    style K2 fill:#fef3c7
    style K3 fill:#dbeafe
    style K4 fill:#fce7f3

实现方式：record 时 PyTorch 在多个 stream 各自跑 kernel、用 events 同步 → graph capture 把 stream 关系记录成 graph node 的 dependency edges。replay 时 GPU 调度器按 DAG 并发执行无依赖节点。

实战收益：单 stream record 的 graph 100% 顺序执行；多 stream record + 合理切分能让 GPU SM 利用率提升 5-15%。但多 stream graph 调试难度爆炸（错误信号难定位）—— Inductor 默认走单 stream，max-autotune 才尝试多 stream。

torch.cuda.graph(stream=...) API 让用户手动指定 stream。与 PyTorch 的 stream 管理（第 4 章 §4.x）协作让”用户自己写多 stream 模型”也能享受 CUDA Graph 加速。

15.6.11 编译产物的磁盘缓存

TORCHINDUCTOR_CACHE_DIR 让编译产物落盘，下次进程启动跳过编译。具体存什么？

~/.cache/torch/inductor/
├── triton/
│   ├── 0a3b8c.../
│   │   ├── triton_kernel.py        # Triton 源码
│   │   ├── triton_kernel.cubin     # 编译后 GPU 二进制
│   │   ├── triton_kernel.json      # metadata: shape, dtype, autotune 结果
│   │   └── triton_kernel.so        # CPU launcher
├── fx_graph_cache/
│   ├── 7d9f2a.../
│   │   ├── input_signature.pkl     # 输入 shape/dtype/stride
│   │   ├── compiled_graph.py       # Inductor 生成的 wrapper Python
│   │   └── output_code.py          # 给用户看的 readable 版本
└── aotautograd_cache/
    └── ...                          # AOTAutograd 的 trace 结果

工作流：编译时计算”输入签名”（shape/dtype/stride/op tree hash），用 hash 作为目录名。下次进程启动时 lookup hash → 命中直接加载 .cubin、不命中才走完整编译。

命中率取决于：模型固定（同一 nn.Module 类）+ 输入 shape 固定（或同样的 dynamic shape 假设）。生产服务里命中率 > 95%，冷启动时间从几十秒降到几百毫秒。

注意事项：

PyTorch 版本升级要清缓存（cubin 与 driver 不兼容）：rm -rf ~/.cache/torch/inductor
缓存目录大小要监控：长时间训练 + 多模型可能占几 GB；定期清理 find -mtime +30 -delete
多 GPU 共享 cache 危险：不同 SM 架构（A100 vs H100）的 cubin 不兼容，要 per-host cache

torch._inductor.config.fx_graph_cache = True（v2.4+ 默认开）启用 fx_graph cache。enable_remote_cache 可以把 cache 推到 S3 / GCS，让多机训练共享 —— 大模型多机训练常用配置。

15.6.12 dynamic shape × CUDA Graph：根本冲突的工程妥协

CUDA Graph 要求 shape 固定，但生产推理需要变长输入（如 LLM decode 的 KV cache 长度持续增长）。两个解决方案都被业界采用：

方案 A：Padding + Mask

把变长输入 pad 到几个固定 shape（如 [128, 256, 512, 1024]），graph 录制时用最大 shape：

@torch.compile(mode='reduce-overhead')
def model_forward(input_ids, attention_mask):
    return model(input_ids, attention_mask)

# 用户层 pad
input_padded = pad_to_bucket(input_ids, [128, 256, 512, 1024])
mask = make_attention_mask(input_ids, input_padded)
out = model_forward(input_padded, mask)
out = unpad(out, original_lengths)

代价：浪费 padding 部分的 GPU 算力。推理 batch size 大时 padding 损耗可能 10-30%。

方案 B：Split Graph (Bucketed Capture)

为每个常用 shape 单独录一张 graph，运行时按 input shape lookup：

graphs = {}                                  # shape → cudaGraph_t
for shape in [128, 256, 512, 1024]:
    with torch.cuda.graph(stream=...) as g:
        model(make_dummy_input(shape))
    graphs[shape] = g

def forward(input):
    bucket = next_bucket(input.shape[0])     # 找最近的 bucket
    graphs[bucket].replay()

vLLM 的 V1 引擎用这套：录制 16 个 batch_size bucket，覆盖 1-256 batch 范围。每个 bucket 的 graph 独立存储，内存占用是单 graph 的 16x（每个 graph 几十到几百 MB）。

方案选择：

batch 内 sequence 变长（如 prefill）→ 用 Padding + Mask（FlashAttention 帮助处理变长）
batch 数变化（如不同请求数）→ 用 Bucketed Capture
彻底没办法 graph 的代码段（如 dynamic 控制流）→ graph break，那段走 eager

torch._inductor.config.triton.cudagraph_dynamic_shape_warn_limit = N 让 Inductor 在 graph 数超过 N 时警告 —— 提醒你 bucket 是否划分太细。

15.6.13 不同模型规模下的 CUDA Graph 收益曲线

实测数据（H100，FP16，batch=1）：

模型	不开 compile	`default`	`reduce-overhead`
ResNet-50 推理	2.5 ms	1.8 ms (1.4x)	0.6 ms (4.2x)
BERT-base 推理	3.0 ms	2.4 ms (1.25x)	0.9 ms (3.3x)
Llama-7B prefill (seq=1024)	18 ms	12 ms (1.5x)	11 ms (1.6x)
Llama-7B decode	25 ms	12 ms (2.1x)	5 ms (5.0x)
Llama-70B prefill	95 ms	78 ms (1.2x)	75 ms (1.27x)
Llama-70B decode	70 ms	35 ms (2.0x)	22 ms (3.2x)

规律：

小模型 / 小 batch：CUDA Graph 收益 4-5x（CPU launch 开销占主导）
大模型 prefill：CUDA Graph 收益 < 30%（kernel 本身慢，launch 占比小）
大模型 decode：CUDA Graph 收益 2-5x（每 token 调几百 kernel，launch 累积明显）

反例：Llama-70B prefill 用 reduce-overhead 仅快 1.27x，但要多花 graph capture 的几秒、要预留 PrivatePool 显存。这种场景不开 reduce-overhead 反而更优。

这表是判断”开不开 reduce-overhead”的工程依据。盲目开会导致显存爆 + 编译变慢、收益反而不大。先 profile、看 launch 开销占比、再决定。

15.6.14 多 GPU 上的 CUDA Graph

多卡训练 / 推理时每张 GPU 独立录制 graph。关键点：

每张 GPU 一份 PrivatePool：互不干扰
每张 GPU 一份 graph instance：rank 0 与 rank 1 各自独立 capture
collective 协调：record 时所有 rank 同时调 NCCL collective，capture 把 collective 录入；replay 时所有 rank 同时 launch graph，collective 自然同步

启动顺序很关键：所有 rank 必须同时进入 capture，否则有 rank 在等 collective 时另一些 rank 在 capture，会 hang。PyTorch 用 dist.barrier() 在 capture 前同步所有 rank。

@torch.compile(mode='reduce-overhead')
def step(model, batch):
    out = model(batch)
    return out.sum().backward()

# 4 卡训练
dist.barrier()    # 必须！同步所有 rank
for i in range(num_warmup):
    step(model, batch)
dist.barrier()    # warmup 完成后再次同步

故障模式：rank 间 capture 时机不一致 → “watchdog timeout”（NCCL 内部 watchdog 检测到部分 rank 不响应）。第 16 章 process group 章会展开 watchdog 机制。

实战经验：分布式 + reduce-overhead 上手前，先在单卡跑通 + profile 收益、确认值得切；分布式部署再加 barrier 同步逻辑。

15.6.15 capture 中的 PyTorch wrapper API

PyTorch 暴露 torch.cuda.CUDAGraph 作为 capture 的 high-level wrapper：

g = torch.cuda.CUDAGraph()
optimizer.zero_grad()

# warmup
for _ in range(3):
    out = model(input); out.sum().backward(); optimizer.step()

# capture
with torch.cuda.graph(g):
    out = model(input)
    out.sum().backward()
    optimizer.step()

# replay
for batch in dataloader:
    input.copy_(batch.input)        # 用同一个 tensor 装数据
    g.replay()
    print(out.item())               # out 也是同一个 tensor

关键约束：

input / out 是固定 tensor：每次 replay 前用 .copy_() 写入新数据，graph 内部 kernel 看到的还是原 tensor
整个 step 在 capture 内：forward + backward + optimizer 全部一起录，否则 grad 状态在 graph 内 vs 外不一致
warmup 必须：至少跑 3 次让 caching allocator stabilize、autotune 选好 block size

这套 API 比 torch.compile(mode='reduce-overhead') 更底层，让用户精确控制哪段进 graph。Megatron-LM、DeepSpeed 等训练框架内部都直接用这层 API（不用 torch.compile，自己手写 graph capture）。

15.6.16 vLLM piecewise CUDA Graph：生产推理引擎的实战形态

vLLM 的 V1 引擎是最复杂、最优秀的 CUDA Graph 应用之一。它面对的难题是：

batch 大小动态变化（每个推理请求独立调度）
序列长度动态变化（KV cache 持续增长）
多个 attention kernel 必须 prefix / decode 区分（FlashAttention 不同模式）

piecewise CUDA Graph 是 vLLM 的解：把模型 forward 切成若干段，每段单独录 graph，运行时按 batch 状况组合。

graph LR
    Input[input tensors] --> P[Pre-graph<br/>不进 graph: 解析 metadata, prepare scheduling]
    P --> G1[Graph 1: embedding]
    G1 --> G2[Graph 2: 第 1 层 attention<br/>per-batch-size graph]
    G2 --> G3[Graph 3: 第 2-32 层<br/>fused, per-shape graph]
    G3 --> G4[Graph 4: norm + lm_head]
    G4 --> Out[output]

    style P fill:#fee2e2,stroke:#ef4444
    style G2 fill:#dcfce7
    style G3 fill:#fef3c7

每段 graph 专注解决一个 sub-problem：

embedding graph：input shape 仅 batch 变化、其他固定，bucket 16 个 batch size
attention graph：per-shape 录制（不同 KV cache 长度对应不同 graph）
layer graph：所有 transformer layer 一起 fuse，相同结构复用一份

切分让总 graph 数从”完整模型 × 16 batch_size × 8 seq_len = 128 张”降到”4 段 × 16 配置 = 64 张”，显存占用减半。

vLLM 在源码 vllm/model_executor/models/llama.py + vllm/engine/v1/ 大量用 torch.cuda.graph() 直接 capture（不用 torch.compile）。理解 PyTorch 这层 API 让你看明白 vLLM 怎么把这套打造出来。这是”compile 栈 + 生产引擎”配合的最佳样本。

15.6.17 graph instance 的显存成本量化

每张 cudaGraphExec_t 占多少显存？源码 cudagraph_trees.py 的 _count_storages_in_pool 计算 PrivatePool 占用。实测数据（Llama-7B）：

部分	大小
模型权重（fp16）	~14 GB
KV cache（per token，batch=1）	~256 KB
graph executable（forward）	~50 MB
graph executable（backward）	~80 MB
PrivatePool（input/output/intermediate）	~500 MB
总计开销 per graph	~630 MB

如果你录 16 张 graph（不同 batch size），显存额外占用 = 16 × 630MB ≈ 10 GB。在 80GB H100 上能容忍，但 24GB consumer GPU 上会撑爆。

优化技巧：

共享 PrivatePool：把多张 graph 放在同一棵树（cudagraph_trees）里，PrivatePool 复用 → 总开销降到 16 × 130MB = 2 GB
graph instance 释放：长时间不用的 graph 可以 cudaGraphExecDestroy、再用时重新 instantiate。但 instantiate 成本几十 ms
动态 instantiate：v2.6 加入的 lazy_capture 让 graph 按需创建（首次 replay 触发）—— 启动快、运行时第一次 replay 慢

理解这些数据让你在显存紧张时知道”开多少 graph 安全”。粗规则：每 GB 空余显存能容纳约 8 张 fp16 Llama-7B graph。

15.6.18 cudagraphify：Inductor 与 graph 的接口

torch/_inductor/cudagraph_trees.py 的 cudagraphify 函数（v2.11 实测 line 472）是 Inductor 调用 CUDA Graph 的入口：

def cudagraphify(
    model: Callable,
    inputs: List[torch.Tensor],
    static_input_idxs: Sequence[int] = (),
    is_backward: bool = False,
    ...
) -> Callable:
    """
    把一个 callable 包装成 CUDA Graph replay 函数。
    返回一个新 callable, 调用时:
      1. 把 inputs 复制到 static input tensors
      2. cudaGraphLaunch
      3. 返回 static output tensors
    """

关键概念：

static_input_idxs：标记哪些 input “地址固定”（如 model 权重 —— 永远是同一张 tensor）。这些不需要每次 replay 前 copy
dynamic input：用户每次传新数据的 tensor（如 input batch）—— graph 录制时用占位 tensor、replay 时把新数据 copy 到占位 tensor

具体逻辑：

# 录制时
static_inputs = [t.clone() if i not in static_input_idxs else t for i, t in enumerate(inputs)]
with torch.cuda.graph(graph):
    output = model(*static_inputs)
static_output = output

# replay 时
def wrapped(*new_inputs):
    for i, new_t in enumerate(new_inputs):
        if i not in static_input_idxs:
            static_inputs[i].copy_(new_t)
    graph.replay()
    return static_output

这种”用 copy 替换 dynamic input”的设计让 graph 既能处理”每次输入数据不同”又能保持地址固定。代价是每次 replay 前几 us 的 copy 开销（远小于 launch 开销，仍划算）。

理解这层接口让你看 Inductor 生成的 wrapper code 不困惑：那些 static_inputs[i].copy_(...) 语句不是用户写的、是 cudagraphify 自动包出来的。

15.6.19 用 Nsight 看 CUDA Graph 内部

NVIDIA Nsight Systems 是看 CUDA Graph 性能的标准工具：

nsys profile --trace=cuda,nvtx --output=report python train.py
nsys-ui report.qdrep

打开 timeline 能看到：

CUDA Graph Launch（紫色矩形）：一次 cudaGraphLaunch 出现在 CPU timeline 上，仅几个 us 宽
Graph 内部 kernel：在 GPU timeline 上展开，每个 kernel 各自显示
stream 之间依赖：箭头连接表明 graph 内部的并行结构
Memory copy：static_input copy 显示为 H2D / D2D 短矩形

不开 graph 时：CPU launch 占 timeline 50%+。开 graph 后 launch 占比 < 5% —— 视觉对比强烈。

进阶分析：

kernel 之间的 gap：暴露调度延迟，graph 也无法消除（这部分要靠 fusion 解决）
不必要的 stream sync：发现某些 event wait 是多余的，可优化 graph 拓扑
memory allocation 异常：graph 录制时若有意外 alloc，profile 会显示 cudaMalloc 出现在 graph 区域（不应该出现）

实战：调优 reduce-overhead 模式时，先开 Nsight 看 timeline，再决定调哪个 flag。盲调 max_autotune / cudagraph_dynamic_shape 是浪费时间。

15.6.20 cudaGraphExecUpdate：动态更新 graph

CUDA 11.4+ 提供 cudaGraphExecUpdate API：让一个已 instantiate 的 graph 在不重新 capture 的情况下更新少量参数（如 kernel 参数、memory copy 的 size）。PyTorch v2.5+ 用它做”少量参数变化的 graph 复用”。

例子：推理时 batch_size 不变但 input pointer 变化（不同 request 的 input 在不同地址）。传统做法是重新 capture，新做法是 cudaGraphExecUpdate(exec, new_graph) 把新 graph 的 metadata patch 到老 exec 里 —— 几百 us，远快于重新 instantiate（几十 ms）。

vLLM、SGLang 等推理引擎大量用这个：每 batch 只需 patch graph 不需重 capture，让 throughput 提升 2-3x。

限制：update 不能改 graph 结构（节点数、依赖边），只能改节点的参数（kernel arg、memory address）。结构变化仍要重新 capture。

理解这条 API 让你看到为什么”vLLM 一秒能处理几百个不同 input 的 request” —— 不是每个 request 重新编译，而是用同一份 graph + update 不同参数。这是 LLM serving 引擎的关键工程优化。

15.6.21 .pt2 包：AOTI 的产物格式

§15.6.7 提到 AOTI 输出 .pt2 文件。打开看里面长啥样：

unzip -l model_aoti.pt2

Archive:  model_aoti.pt2
   Length      Date    Time    Name
---------     -------- -----   ----
     1024     2026-04-21 12:00   data/version
   524288     2026-04-21 12:00   data/weights/0.weight        # fp16 weight tensor
   524288     2026-04-21 12:00   data/weights/0.bias
       ...
  4194304     2026-04-21 12:00   data/aotinductor/cuPVQ.so    # 编译后 .so
    65536     2026-04-21 12:00   data/aotinductor/cuPVQ.cubin # GPU 二进制
     2048     2026-04-21 12:00   data/aotinductor/manifest.json # 元信息
     1024     2026-04-21 12:00   archive_format

manifest.json 内容（简化）：

{
  "model_name": "Llama-7B",
  "input_specs": [
    {"name": "input_ids", "dtype": "int64", "shape": [1, 1024]},
    {"name": "attention_mask", "dtype": "int64", "shape": [1, 1024]}
  ],
  "output_specs": [...],
  "cuda_version": "12.4",
  "sm_arch": "sm_90",
  "torch_version": "2.11.0",
  "metadata": {...}
}

加载流程（C++ 端）：

AOTIModelContainerRunner::AOTIModelContainerRunner(path) 打开 .pt2、解压到 tmp dir
用 dlopen 加载 .so
用 dlsym 拿到 AOTInductorModelContainerCreate 函数指针
调用它创建 model handle，handle 内部已 mmap 好 weight tensor
用户每次 runner.run(inputs) 时直接进 .so 内部的 forward 函数

.pt2 格式是 PyTorch 模型分发的”未来”：替代 v1 的 .pt（pickle）+ torchscript .pt（C++ 推理）。HuggingFace 的 safetensors 也在与 .pt2 整合（v2.6 开始 .pt2 内部可以用 safetensors 存权重）。

打开 .pt2 看里面是 zip + 标准结构，让你做 deployment debug 时（如检查 weight 是否对、cubin 是否匹配 SM 架构）能直接 unzip 看。这种”开放格式”是 PyTorch 团队工程哲学的体现。

15.6.22 编译产物的 cross-platform 兼容性

.pt2 编译产物在不同环境间的兼容性：

维度	兼容性
CUDA driver 版本	向后兼容（编译时用 12.4，运行时 12.5+ OK）
CUDA toolkit 版本	不向前兼容（编译时用 12.4，运行时若 12.0 会报”PTX version too high”）
SM 架构	完全不兼容（A100 sm_80 编译产物不能在 H100 sm_90 跑，反之亦然）
PyTorch 版本	向后兼容到同 minor 版本（v2.11 → v2.12 OK，v2.11 → v2.13 不保证）
操作系统	Linux x86_64 与 ARM 不兼容（.so 是平台特定的）

工程后果：

多 GPU 类型部署：要为每种 SM 架构编一份 .pt2（fat binary 不存在 —— cubin 是 SM-specific）
PyTorch 升级：旧 .pt2 在新 PyTorch 上不一定能跑，建议每次升级重编
Docker 镜像分层：base image 锁 CUDA + PyTorch 版本，减少兼容性问题

正确部署流程：

训练机 export 出 .pt2（含元信息）
部署 pipeline 检测目标机 SM 架构、CUDA / PyTorch 版本
不匹配 → 在目标机上 re-export（用同一份 PyTorch model + 重新 AOT 编）
匹配 → 直接 deploy

vLLM / TensorRT-LLM 等引擎都内置这套”按目标硬件编译”流程。理解兼容性矩阵让你看到 .pt2 不是”训一次到处跑”的银弹，而是”为目标硬件编译”的中间制品。

15.6.23 编译时间的工程权衡

torch.compile 第一次跑慢，到底慢在哪？拆开看（Llama-7B forward + backward）：

阶段	时间	备注
Dynamo trace	800 ms	inline 整个 forward
AOTAutograd functionalize + partition	1200 ms	trace 反向 + min-cut
Inductor lowering（ATen → IR）	600 ms
Inductor scheduling + fusion	400 ms
Inductor codegen（生成 Triton）	800 ms	几百个 kernel
Triton 编译（Triton DSL → PTX → cubin）	8000 ms	大头！每个 kernel 几十 ms
autotune（max_autotune 模式额外）	5000 ms	试 5-10 种 block size
总计（default）	~12 s
总计（max_autotune）	~20 s

Triton 编译占 70%+ 时间。优化思路：

compile-cache 预热：第一次跑完把 cubin 落盘，下次进程直接 load
AsyncCompile：v2.4+ 引入的并发编译（§14.x），多线程同时编不同 kernel
跳过 autotune：生产推荐先 max_autotune 编一次、把结果存 cache、后续用普通 default mode

实测：开 AsyncCompile + 8 线程，Triton 编译时间从 8s 降到 2s。TORCHINDUCTOR_CACHE_DIR 第二次启动直接命中缓存、< 100ms。

工程决策：

训练任务：编译时间一次性、可接受
推理服务冷启动：编译时间 = 服务上线延迟。预编译 + 落盘 + 镜像层包含 cache 是标准做法
debug 调代码：每次改代码会让 cache 失效、再编一次、几秒到几十秒。工作流痛点

理解编译时间分布让你针对性优化。如果你的痛点是”频繁改代码 + 编译变慢”，看 cache 命中率 + 是否可以 partial cache（改某段代码不让全部失效）；如果痛点是”冷启动慢”，看 cache 落盘 + 镜像化。

15.6.24 reduce-overhead × autocast 的兼容性

§20 章会讲 mixed precision (autocast)。autocast 与 CUDA Graph 的兼容性如何？

v2.0 之前：不兼容。autocast 在 op-level 动态决定 fp16/fp32，CUDA Graph 录制时无法处理这种”运行时分支”。开 reduce-overhead + autocast 会报错。

v2.4+：通过 fx graph 阶段的 lower_to_aot_decomp 把 autocast 决策lowering 到 graph 静态结构：trace 时确定每个 op 的 dtype、graph 内部不再有运行时类型转换。autocast 与 reduce-overhead 完美兼容。

实测：Llama-7B 训练（fp16 autocast + reduce-overhead）速度比”纯 fp32 + reduce-overhead”快 1.8x、比”fp16 + 不开 reduce-overhead”快 1.5x —— 两个优化叠加才到峰值性能。

工程提示：autocast 决策在 Dynamo trace 期间确定。如果你 trace 时用 with autocast()、之后调用时不开 autocast → 编译产物可能行为异常（dtype 不一致）。建议把 autocast 包在 compile 函数内部，让 Dynamo 能完整 trace 进去：

@torch.compile(mode='reduce-overhead')
def step(model, batch):
    with torch.autocast('cuda', dtype=torch.float16):
        return model(batch)

这样 autocast 与 compile 都进 graph，没有外层 / 内层不一致问题。

15.6.25 warmup 必须的内部原因

torch.cuda.graph(stream) 之前必须 warmup 几次（默认 3），原因有四：

1. Caching allocator 状态稳定：第 4 章 §4.5 的 caching allocator 第一次见到一个新 shape 时会 cudaMalloc。第 2-3 次跑相同 shape 时 alloc 命中 cache → 同地址。warmup 让这个稳定状态出现，之后录制的 graph 看到的地址才靠谱。

2. cuDNN / cuBLAS algorithm 选择：第一次调用 at::native::cudnn_convolution_forward 等会触发 algo selection（试几种 algo、选最快的、cache 起来）。warmup 让 algo 选择完成，capture 时不再有这个一次性开销。

3. NCCL communicator init：分布式训练里 NCCL collective 第一次调用做 communicator setup（占几十 ms）。warmup 把它跑完。

4. PTX → SASS 的 lazy compilation：CUDA driver 在第一次 launch kernel 时把 PTX 编成机器码 SASS（几 ms）。warmup 让这步完成。

为什么 3 次就够？第 1 次解决 alloc + algo + JIT；第 2 次让 cache 命中、确认稳定；第 3 次确认与第 2 次一致。再多次效益递减。

实战：如果你的模型某些 op 第一次跑特别慢（如 cudnn algo selection 很慢），可以加到 5 次 warmup。Inductor 里 _inductor.config.triton.cudagraph_skip_dynamic_warmup 控制这个数量。

理解 warmup 不是迷信，是有具体技术原因的。生产代码”3 次 warmup 后再 capture”是把这些一次性开销摊销到 warmup 阶段，让 capture 录到的是”稳态”。

15.6.26 cudagraph_mark_step_begin：分隔训练步的 hint

torch._dynamo.mark_static_address 与 torch.cuda.graphs.cudagraph_mark_step_begin 是 v2.x 给用户提供的 hint API。前者标记某 tensor 的地址永远不变（让 cudagraph_trees 不在每次 step 检查），后者告诉 cudagraph_trees “新的训练步开始了”。

# 训练循环
for batch in dataloader:
    torch.cuda.graphs.cudagraph_mark_step_begin()    # hint!
    out = model(batch)
    out.backward()
    optimizer.step()
    optimizer.zero_grad()

为什么需要这个？因为 cudagraph_trees 维护一棵”graph 树”（§15.4），需要知道两个 invocation 之间有没有跨越”step 边界”。step 边界处会做：

把上一步生成的 outputs 标记为”可以释放”（PrivatePool 内部 free）
把 inputs 重新对齐到 PrivatePool（防地址漂移）
检查 graph cache 是否仍有效

不调用 mark_step_begin 时，cudagraph_trees 用启发式（如检测 input.grad 是否被清零）来推断。但启发式有 corner case，显式调用更稳。HuggingFace accelerate 库已默认插入这个调用，让用户透明享受。

理解这个 hint 让你看到”compile 与 PyTorch 训练循环之间的协议”—— 不是黑盒，有可干预的接口。生产代码里如果遇到 reduce-overhead 模式偶发”address mismatch” 错误，第一招就是加 mark_step_begin。

15.6.27 ExecuTorch / Mobile 部署：编译栈的另一终点

PyTorch v2.4 引入的 ExecuTorch 是”PyTorch on edge devices” 解决方案。它与 AOTI 的关系：

AOTI：服务器端 ahead-of-time 编译，输出 .pt2 含 cubin（GPU 二进制）
ExecuTorch：移动端 / 嵌入式编译，输出 .pte 文件（不含 cubin，含 PyTorch IR + delegate 注册）

graph TB
    Model[PyTorch nn.Module]
    Model --> Exp[torch.export]
    Exp --> EP[ExportedProgram]

    EP -->|服务器 GPU| AOTI[AOTI compile]
    AOTI --> PT2[.pt2<br/>含 cubin]

    EP -->|移动端| ET[ExecuTorch backend]
    ET --> PTE[.pte<br/>不含 GPU 二进制]
    PTE --> XNN[XNNPACK]
    PTE --> COR[Core ML]
    PTE --> NN[Android NNAPI]
    PTE --> QC[Qualcomm QNN]

    style AOTI fill:#dcfce7
    style ET fill:#fef3c7

ExecuTorch 的核心机制是 delegate：把 PyTorch IR 交给目标平台的 native engine（XNNPACK on CPU、Core ML on Apple、NNAPI on Android、QNN on Qualcomm）跑。.pte 内部存的是”分割好的 IR + 委托表”，runtime 把每段交给对应 engine 执行。

工程实践：

服务器推理：AOTI（.pt2 + libtorch）
iOS 应用：ExecuTorch + Core ML delegate（自动用 Apple Neural Engine）
Android 应用：ExecuTorch + XNNPACK / NNAPI delegate
嵌入式 / IoT：ExecuTorch + 自定义 delegate（需要硬件厂商支持）

PyTorch 团队在 v2.x 把 export 路径设计为”统一前端”，AOTI 与 ExecuTorch 共用 torch.export 这一入口、产出一致的 ExportedProgram、再分流到不同 backend。这是”一次 trace、多端部署”的架构基础。

理解这个全景让你看到 PyTorch 的野心 —— 不只是”训练框架”，是”从训练到部署的端到端 stack”。第 12-15 章讲的编译栈是这套战略的核心引擎。

15.6.28 编译栈错误的诊断流程

报错时怎么定位是哪一层的问题？标准流程：

flowchart TD
    Err[编译报错 / 性能差]
    Err --> Q1{有错误信息?}
    Q1 -->|是| L1[看 traceback 第一行<br/>哪个文件]
    Q1 -->|否, 性能问题| Profile[torch.profiler / Nsight]

    L1 --> CD[在 torch/_dynamo/]
    L1 --> CA[在 torch/_functorch/]
    L1 --> CI[在 torch/_inductor/]
    L1 --> CT[在 triton/]

    CD --> SD["TORCH_LOGS=dynamo,recompiles<br/>看 trace 卡在哪里"]
    CA --> SA["TORCH_LOGS=aot_graphs<br/>看反向 graph"]
    CI --> SI["TORCH_LOGS=output_code<br/>看 Inductor 生成什么"]
    CT --> ST["手动打 print 测 kernel 输入"]

    Profile --> P1{CPU launch 占比 > 30%?}
    P1 -->|是| RO[改用 reduce-overhead]
    P1 -->|否| P2{kernel 内 SM 利用率 < 80%?}
    P2 -->|是| MA[max_autotune + aggressive_fusion]
    P2 -->|否| P3{有 graph break?}
    P3 -->|是| FG[fullgraph=True 重构代码]

    style L1 fill:#fef3c7
    style Profile fill:#dbeafe

每层的常见错误：

层	典型错误	解决思路
Dynamo	”Unsupported: …“	graph break 该 op，或 disable trace
AOTAutograd	”fake tensor mode failed: …“	op 没注册 fake tensor 实现，加 decomp
Inductor lowering	”no lowering for op X”	op 没注册 lowering，要写 custom
Inductor scheduling	”no fusion possible”	不影响正确性，是性能 hint
Triton compile	”register pressure too high”	关 aggressive_fusion
CUDA Graph	”address mismatch”	加 `cudagraph_mark_step_begin`

按这个流程能快速定位 90% 的 compile 问题。盲改 mode flag 是浪费时间。

15.6.29 PyTorch compile vs JAX vs TF graph：哲学对比

把 PyTorch 编译栈与同类系统对比，能看清各自的设计选择：

维度	PyTorch (torch.compile)	JAX (jit)	TensorFlow Graph
程序模型	命令式 + 编译	函数式（纯函数 + jit）	声明式（先建图再跑）
副作用支持	是（SideEffects 追踪）	否（要求纯函数）	否（变量是图节点）
控制流	graph break + 重 trace	jit 后 jax.lax.cond / jax.lax.scan	tf.cond / tf.while_loop
trace 粒度	per-frame（PEP 523）	per-function（@jit 装饰）	整图
反向计算	AOTAutograd 自动	grad / vjp / jvp	tf.GradientTape
编译产物	Triton kernel + Python wrapper	XLA HLO → cubin	XLA HLO → cubin
部署路径	AOTI + ExecuTorch	jax.export + IREE	SavedModel + TFLite

核心差异：

PyTorch：把”动态图”作为顶层语义，编译是”事后包装”。代价是 trace 复杂、graph break 处理；好处是用户完全不用改代码就能享受加速
JAX：把”纯函数”作为顶层语义，trace 简单（pytree-in / pytree-out）。代价是用户要按函数式风格写代码、副作用要显式表达
TF Graph (1.x)：先 build graph 再 sess.run，最静态、最易优化、但最难写。TF 2.x 转向 eager + tf.function 抄 JAX 路径

为什么 PyTorch 赢了：用户用 Python 命令式代码已经几年，改代码成本巨高。PyTorch v2.0 团队选择”trace 复杂度由内部承担、用户接口零改动”，这是 v2.x 起 TF / JAX 的研究人员持续向 PyTorch 流失的根本原因。@torch.compile 一行装饰器背后的工程复杂度，是为了换”用户不改代码”这个核心目标。

理解这个哲学差异让你看到：好的工程不只是技术实现，更是”为谁优化哪个维度”的产品决策。PyTorch 选择”为研究者优化迭代速度”，TF 选择”为部署优化静态图”，结果在 LLM 时代 PyTorch 全面胜出。

15.6.30 TORCH_COMPILE_DEBUG：编译流程全景透视

TORCH_COMPILE_DEBUG=1 让 PyTorch 把整个编译过程的中间产物全部 dump 到磁盘。是调试 compile 问题的”核武器”：

TORCH_COMPILE_DEBUG=1 python train.py

产物目录（约 50-200 MB per compile）：

torch_compile_debug/run_2026_04_28_15_30_xxx/
├── torchdynamo/
│   ├── debug.log                  # Dynamo trace 详细日志
│   ├── frame_0/
│   │   ├── input_dynamic_dims     # dynamic shape 标记
│   │   ├── 00_before_pre_grad.py  # Dynamo 输出的 fx graph
│   │   ├── 01_after_aot.py        # AOTAutograd 处理后
│   │   ├── 02_after_decomp.py     # decomposition 后
│   │   ├── ir_pre_fusion.txt      # Inductor IR fusion 前
│   │   ├── ir_post_fusion.txt     # fusion 后
│   │   └── output_code.py         # 最终生成的 wrapper Python
└── torchinductor/
    └── frame_0/
        ├── triton_kernel_0.py     # 每个 Triton kernel 的源码
        ├── triton_kernel_1.py
        └── ...

每个文件都是可读的 Python / 文本。能看到：

output_code.py 是 Inductor 生成的最终 wrapper，含每个 Triton kernel 的调用顺序、shape、stride
ir_post_fusion.txt 让你看每个 Inductor IR node 是怎么 fuse 的（“fused_0” 含哪些 op）
triton_kernel_*.py 是真实 Triton 源码 —— 调试 Triton bug 时可以单独跑这个文件（python triton_kernel_0.py 直接执行）

实战调试场景：

某 op 报错：看 00_before_pre_grad.py 找该 op 在 fx graph 的位置
fusion 不生效：看 ir_pre_fusion.txt 与 ir_post_fusion.txt 对比，找为什么没 fuse
kernel 性能差：看 triton_kernel_X.py，对比 expected shape vs actual

更细的：TORCH_LOGS="+inductor" 输出更详细日志（含每个 fusion 决策的原因）。组合 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor,+aot,+dynamo" 让你看到编译栈每一层在做什么。

15.6.31 一个 Inductor output_code 的真实样子

为了具体感受”compile 产出什么”，看 output_code.py 真实样本（精简版，对应 y = relu(x + bias)）：

# output_code.py (Inductor 生成)
import triton
import triton.language as tl
from torch._inductor.runtime import triton_helpers

@triton.jit
def triton_poi_fused_relu_add_0(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    tmp0 = tl.load(in_ptr0 + xindex, xmask)        # x
    tmp1 = tl.load(in_ptr1 + xindex, xmask)        # bias
    tmp2 = tmp0 + tmp1                              # add
    tmp3 = tl.full([1], 0, tl.float32)
    tmp4 = triton_helpers.maximum(tmp3, tmp2)      # relu
    tl.store(out_ptr0 + xindex, tmp4, xmask)


def call(args):
    arg0_x, arg1_bias = args
    args.clear()
    out = empty_strided_cuda((1024,), (1,), torch.float32)
    triton_poi_fused_relu_add_0[grid(1024)](
        arg0_x, arg1_bias, out, 1024, XBLOCK=128, num_warps=4
    )
    return (out,)

可以看到：

Inductor fuse 了 add + relu 进同一个 Triton kernel（triton_poi_fused_relu_add_0，名字暗示了 fuse 的 op）
kernel 直接读 in_ptr0 / in_ptr1，加完立刻 relu 立刻写回 —— 没有中间 tensor，最优内存模式
wrapper call 负责：创建 output buffer、调 kernel、返回。这是被 cudagraphify 包装的”static callable”

如果不开 compile，eager 执行同样代码：1 次 add（读 2 个 tensor、写 1 个 tensor）+ 1 次 relu（读 1 个 tensor、写 1 个 tensor）= 5 次 memory access。compile 后：2 次 read + 1 次 write = 3 次。memory bound 算子直接快 1.6x。

理解 output_code 让你看到 compile 的实质收益不是”魔法”，是”把 ATen op 的中间 tensor 全部消除”。这是 Inductor 给 PyTorch 加速的根本机制。

把这段产物与 §12.8.27 的 Dynamo OutputGraph 对照看：Dynamo 那层产出的 fx Graph 是”逻辑算子序列”，Inductor 这层把它编译成”具体 GPU kernel + memory plan”。两层职责明确分开，让每层都能独立优化、独立替换。这是为什么 PyTorch 团队能在 v2.x 的几年里持续给 Inductor 加新优化（fusion patterns、autotune 改进、cudagraph_trees 演进），而不影响上层 Dynamo 接口稳定性 —— 接口稳、内部演进是这套编译栈最大的工程价值，也是看完 12-15 章源码后最值得记住的一条架构原则。

15.7 几条工程经验

实战 torch.compile 端到端：

1. warmup 是必须的：第一次 forward 编译几秒到几分钟。生产服务前必须 warmup 几个 batch，否则首请求超时

2. dynamic shape 用 mark_dynamic：torch._dynamo.mark_dynamic(input, 0) 让 Inductor 生成处理多个 batch size 的 kernel。否则每个 shape 重新编

3. 用 TORCHINDUCTOR_CACHE_DIR 持久化编译产物：进程重启不用重新编译。生产服务节省冷启动时间

4. graph break 会让 reduce-overhead 失效那一段：fullgraph=True 强制不允许 break，能逼自己写 trace-friendly 代码

5. 不要在 compiled 函数内做 H2D / D2H 拷贝：会触发 graph break。在外面做完再传进来

6. CUDA Graph 显存占用：每个 graph 至少占一份 PrivatePool。多 shape 训练 / 推理时显存可能暴增

7. torch._dynamo.reset() 清空所有 cache：换模型架构 / debug 时常用

8. distributed 训练慎用 reduce-overhead：CUDA Graph 不与 NCCL collective 完美兼容（v2.x 在改善但仍有 corner case）。第 17 章 DDP 章会展开

15.8 跨书关联

《vLLM 内核探秘》第 8 章 model runner：vLLM 的 V1 引擎大量用 reduce-overhead 模式 + 分段 CUDA Graph 处理变长 prefill / decode
第 4 章 §4.9 PrivatePool：CUDA Graph 与 caching allocator 的协作核心
第 12-14 章：本章是这三章的串联，理解前置链路是理解本章的基础

15.9 整个编译栈的”一句话总结”

把 12-15 章的核心思想浓缩成一句话：

torch.compile = “Python 帧拦截 + 反向自动捕获 + 函数化 IR + 贪心 fusion + Triton codegen + CUDA Graph 包装”，每一段都解决一个具体问题，串起来让 PyTorch 在保留动态图体验的同时拿到接近静态图的性能。

理解了这条链路，你就明白为什么 PyTorch 团队从 v1.0 起花了 5 年时间打磨 v2.x —— 每一段都是几年研究 + 几万行工程，叠在一起形成 @torch.compile 这个一行装饰器的入口。

下一章进分布式训练第七篇 —— 拆 ProcessGroup 与 NCCL，看多卡训练的通信底座，理解 collective 是怎么走到 GPU、watchdog 是怎么发现卡死、communicator 又是怎么缓存复用的。