NVIDIA Nsight 工具集是面向 CUDA 应用程序的专业级性能分析平台，其 Profiling 模块通过采集 GPU 硬件计数器、内存访问模式及内核执行时序等数据，为深度学习框架的优化提供量化依据。

尽管 Deepspeed 在单机运行时可以直接使用 Nsight 进行分析，但是在多机运行时存在问题，本文讲解：

如何跨机进行 profile
如何有选择的进行 profile

跨机 profile

在 Deepspeed 多机训练场景中，常规 Nsight 直接注入方式存在进程同步异常。解决方案需修改框架启动器源码中的子进程调用逻辑，具体操作路径为：~/miniconda3/envs/dschat/lib/python3.9/site-packages/deepspeed/launcher/runner.py。

在 main 函数内定位至 result = subprocess.Popen(cmd, env=env) 执行节点，将原始命令重构为：

cmd = cmd[:8] + ['/usr/local/cuda-11.8/bin/nsys', 'profile'] + cmd[8:]

该修改使所有训练节点通过 nsys 命令行工具启动，确保多机环境下硬件性能事件的全局同步采集。需注意 CUDA 工具链路径需与实际安装版本保持一致，对于最新 DeepSpeed v0.16.8 框架需验证其与 CUDA 12.x 的兼容性。

自定义 profile

针对完整训练流程中特定阶段的性能分析，推荐采用 CUDA 异步事件标记技术。在训练脚本中插入以下控制逻辑：

warmup_iter = 5
for i in range(10):
    if i == warmup_iter:
        torch.cuda.cudart().cudaProfilerStart()  # 启动硬件级事件采集
    if i >= warmup_iter:
        torch.cuda.nvtx.range_push(f"iteration {i}")  # 标记计算阶段
    
    # 核心通信算子执行区域
    if args.method == 'reduce_scatter':
        ans = reduce_scatter_coalesced([a], dist.get_world_group())
    elif args.method == 'a2a_quant_reduce':
        ans = all_to_all_quant_reduce([a], local_groups)
    elif args.method == 'all_gather':
        dist.all_gather(a_list, a, group=dist.get_world_group())

    if i >= warmup_iter:
        torch.cuda.nvtx.range_pop()  # 阶段标记结束
torch.cuda.cudart().cudaProfilerStop()  # 终止采集

执行时需附加参数：

--capture-range=cudaProfilerApi --capture-range-end=stop-shutdown

该配置使 Nsight 精准捕获从 cudaProfilerStart 到 cudaProfilerStop 之间的 GPU 活动，避免冗余数据采集。结合 NVTX 标记可在时间轴视图中直观识别通信与计算阶段的重叠情况，辅助发现流水线空隙。

Last modified on 2025-05-22