mindstudio-probe 1.0.3__py3-none-any.whl → 1.0.4__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {mindstudio_probe-1.0.3.dist-info → mindstudio_probe-1.0.4.dist-info}/LICENSE +201 -201
- {mindstudio_probe-1.0.3.dist-info → mindstudio_probe-1.0.4.dist-info}/METADATA +36 -34
- mindstudio_probe-1.0.4.dist-info/RECORD +276 -0
- {mindstudio_probe-1.0.3.dist-info → mindstudio_probe-1.0.4.dist-info}/WHEEL +1 -1
- {mindstudio_probe-1.0.3.dist-info → mindstudio_probe-1.0.4.dist-info}/entry_points.txt +1 -0
- msprobe/README.md +101 -237
- msprobe/{config/config.json → config.json} +49 -49
- msprobe/core/advisor/advisor.py +124 -124
- msprobe/core/advisor/advisor_const.py +59 -59
- msprobe/core/advisor/advisor_result.py +58 -58
- msprobe/core/common/const.py +341 -318
- msprobe/core/common/exceptions.py +99 -99
- msprobe/core/common/{file_check.py → file_utils.py} +478 -283
- msprobe/core/common/log.py +76 -69
- msprobe/core/common/utils.py +385 -616
- msprobe/core/common_config.py +85 -71
- msprobe/core/compare/acc_compare.py +299 -298
- msprobe/core/compare/check.py +95 -95
- msprobe/core/compare/compare_cli.py +49 -49
- msprobe/core/compare/highlight.py +223 -222
- msprobe/core/compare/multiprocessing_compute.py +149 -149
- msprobe/core/compare/npy_compare.py +295 -295
- msprobe/core/compare/utils.py +430 -429
- msprobe/core/data_dump/data_collector.py +154 -144
- msprobe/core/data_dump/data_processor/base.py +314 -293
- msprobe/core/data_dump/data_processor/factory.py +59 -59
- msprobe/core/data_dump/data_processor/mindspore_processor.py +186 -198
- msprobe/core/data_dump/data_processor/pytorch_processor.py +366 -389
- msprobe/core/data_dump/json_writer.py +96 -116
- msprobe/core/data_dump/scope.py +178 -178
- msprobe/core/grad_probe/constant.py +70 -70
- msprobe/core/grad_probe/grad_compare.py +171 -175
- msprobe/core/grad_probe/utils.py +64 -52
- msprobe/docs/01.installation.md +89 -0
- msprobe/docs/02.config_introduction.md +165 -0
- msprobe/docs/03.config_examples.md +247 -0
- msprobe/docs/04.acl_config_examples.md +76 -0
- msprobe/docs/05.data_dump_PyTorch.md +198 -0
- msprobe/docs/06.data_dump_MindSpore.md +243 -0
- msprobe/docs/07.accuracy_checker_PyTorch.md +274 -0
- msprobe/docs/08.accuracy_checker_online_PyTorch.md +198 -0
- msprobe/docs/09.accuracy_checker_MindSpore.md +68 -0
- msprobe/docs/10.accuracy_compare_PyTorch.md +245 -0
- msprobe/docs/11.accuracy_compare_MindSpore.md +202 -0
- msprobe/docs/12.overflow_check_PyTorch.md +79 -0
- msprobe/docs/13.overflow_check_MindSpore.md +31 -0
- msprobe/{pytorch/doc/parse_tool.md → docs/14.data_parse_PyTorch.md} +283 -286
- msprobe/docs/15.free_benchmarking_PyTorch.md +164 -0
- msprobe/{doc/grad_probe/grad_probe.md → docs/17.grad_probe.md} +207 -207
- msprobe/docs/FAQ_PyTorch.md +177 -0
- msprobe/docs/S02.report_free_benchmarking_validation_performance_baseline.md +146 -0
- msprobe/docs/img/free_benchmark_framework.png +0 -0
- msprobe/mindspore/__init__.py +1 -1
- msprobe/mindspore/api_accuracy_checker/api_accuracy_checker.py +254 -245
- msprobe/mindspore/api_accuracy_checker/api_info.py +69 -69
- msprobe/mindspore/api_accuracy_checker/api_runner.py +155 -151
- msprobe/mindspore/api_accuracy_checker/base_compare_algorithm.py +196 -196
- msprobe/mindspore/api_accuracy_checker/cmd_parser.py +6 -0
- msprobe/mindspore/api_accuracy_checker/compute_element.py +238 -223
- msprobe/mindspore/api_accuracy_checker/main.py +8 -15
- msprobe/mindspore/api_accuracy_checker/type_mapping.py +113 -113
- msprobe/mindspore/api_accuracy_checker/utils.py +79 -62
- msprobe/mindspore/cell_processor.py +34 -34
- msprobe/mindspore/common/const.py +106 -87
- msprobe/mindspore/common/log.py +37 -37
- msprobe/mindspore/common/utils.py +81 -57
- msprobe/mindspore/compare/distributed_compare.py +75 -75
- msprobe/mindspore/compare/ms_compare.py +219 -117
- msprobe/mindspore/compare/ms_graph_compare.py +348 -317
- msprobe/mindspore/compare/ms_to_pt_api.yaml +399 -399
- msprobe/mindspore/debugger/debugger_config.py +66 -74
- msprobe/mindspore/debugger/precision_debugger.py +126 -107
- msprobe/mindspore/dump/dump_tool_factory.py +35 -35
- msprobe/mindspore/dump/hook_cell/api_registry.py +118 -104
- msprobe/mindspore/dump/hook_cell/hook_cell.py +55 -53
- msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml +922 -925
- msprobe/mindspore/dump/hook_cell/wrap_api.py +113 -0
- msprobe/mindspore/dump/jit_dump.py +72 -56
- msprobe/mindspore/dump/kernel_graph_dump.py +59 -60
- msprobe/mindspore/dump/kernel_kbyk_dump.py +64 -65
- msprobe/mindspore/free_benchmark/api_pynative_self_check.py +116 -116
- msprobe/mindspore/free_benchmark/common/config.py +12 -12
- msprobe/mindspore/free_benchmark/common/handler_params.py +17 -17
- msprobe/mindspore/free_benchmark/common/utils.py +71 -71
- msprobe/mindspore/free_benchmark/data/support_wrap_ops.yaml +842 -842
- msprobe/mindspore/free_benchmark/decorator/dec_forward.py +43 -42
- msprobe/mindspore/free_benchmark/decorator/decorator_factory.py +107 -107
- msprobe/mindspore/free_benchmark/handler/base_handler.py +90 -90
- msprobe/mindspore/free_benchmark/handler/check_handler.py +41 -41
- msprobe/mindspore/free_benchmark/handler/fix_handler.py +36 -36
- msprobe/mindspore/free_benchmark/handler/handler_factory.py +21 -21
- msprobe/mindspore/free_benchmark/perturbation/add_noise.py +67 -67
- msprobe/mindspore/free_benchmark/perturbation/base_perturbation.py +21 -21
- msprobe/mindspore/free_benchmark/perturbation/bit_noise.py +63 -63
- msprobe/mindspore/free_benchmark/perturbation/exchange_value.py +51 -0
- msprobe/mindspore/free_benchmark/perturbation/improve_precision.py +35 -34
- msprobe/mindspore/free_benchmark/perturbation/no_change.py +12 -12
- msprobe/mindspore/free_benchmark/perturbation/perturbation_factory.py +29 -27
- msprobe/mindspore/free_benchmark/self_check_tool_factory.py +33 -33
- msprobe/mindspore/grad_probe/global_context.py +90 -91
- msprobe/mindspore/grad_probe/grad_analyzer.py +231 -231
- msprobe/mindspore/grad_probe/grad_monitor.py +27 -27
- msprobe/mindspore/grad_probe/grad_stat_csv.py +131 -131
- msprobe/mindspore/grad_probe/hook.py +94 -92
- msprobe/mindspore/grad_probe/utils.py +29 -28
- msprobe/mindspore/ms_config.py +128 -126
- msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py +44 -45
- msprobe/mindspore/overflow_check/overflow_check_tool_factory.py +34 -34
- msprobe/mindspore/runtime.py +4 -4
- msprobe/mindspore/service.py +378 -354
- msprobe/mindspore/task_handler_factory.py +24 -24
- msprobe/msprobe.py +105 -107
- msprobe/pytorch/__init__.py +3 -3
- msprobe/pytorch/api_accuracy_checker/common/config.py +53 -55
- msprobe/pytorch/api_accuracy_checker/common/utils.py +214 -165
- msprobe/pytorch/api_accuracy_checker/compare/algorithm.py +213 -213
- msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py +606 -581
- msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml +132 -132
- msprobe/pytorch/api_accuracy_checker/compare/api_precision_threshold.yaml +390 -390
- msprobe/pytorch/api_accuracy_checker/compare/compare.py +386 -381
- msprobe/pytorch/api_accuracy_checker/compare/compare_column.py +73 -73
- msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py +245 -244
- msprobe/pytorch/api_accuracy_checker/config.yaml +10 -10
- msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py +335 -332
- msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py +200 -199
- msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py +133 -134
- msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py +592 -581
- msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py +70 -74
- msprobe/pytorch/api_accuracy_checker/run_ut/torch_ut_setting.json +7 -4
- msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/attl.py +197 -202
- msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/client.py +325 -324
- msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/device_dispatch.py +204 -204
- msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/server.py +219 -218
- msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/ssl_config.py +10 -10
- msprobe/pytorch/bench_functions/__init__.py +15 -15
- msprobe/pytorch/bench_functions/apply_adam_w.py +28 -28
- msprobe/pytorch/bench_functions/confusion_transpose.py +19 -19
- msprobe/pytorch/bench_functions/fast_gelu.py +55 -55
- msprobe/pytorch/bench_functions/layer_norm_eval.py +6 -6
- msprobe/pytorch/bench_functions/linear.py +12 -12
- msprobe/pytorch/bench_functions/matmul_backward.py +48 -48
- msprobe/pytorch/bench_functions/npu_fusion_attention.py +509 -421
- msprobe/pytorch/bench_functions/rms_norm.py +15 -15
- msprobe/pytorch/bench_functions/rotary_mul.py +52 -52
- msprobe/pytorch/bench_functions/scaled_mask_softmax.py +26 -26
- msprobe/pytorch/bench_functions/swiglu.py +55 -55
- msprobe/pytorch/common/__init__.py +2 -2
- msprobe/pytorch/common/compare_script.template +14 -14
- msprobe/pytorch/common/log.py +20 -31
- msprobe/pytorch/common/parse_json.py +39 -39
- msprobe/pytorch/common/utils.py +305 -300
- msprobe/pytorch/compare/distributed_compare.py +66 -66
- msprobe/pytorch/compare/mapping.yaml +607 -607
- msprobe/pytorch/compare/match.py +34 -33
- msprobe/pytorch/compare/pt_compare.py +50 -40
- msprobe/pytorch/debugger/debugger_config.py +95 -95
- msprobe/pytorch/debugger/precision_debugger.py +125 -125
- msprobe/pytorch/free_benchmark/__init__.py +8 -8
- msprobe/pytorch/free_benchmark/common/constant.py +70 -70
- msprobe/pytorch/free_benchmark/common/counter.py +71 -71
- msprobe/pytorch/free_benchmark/common/enums.py +37 -37
- msprobe/pytorch/free_benchmark/common/params.py +129 -129
- msprobe/pytorch/free_benchmark/common/utils.py +102 -102
- msprobe/pytorch/free_benchmark/compare/grad_saver.py +179 -179
- msprobe/pytorch/free_benchmark/compare/single_benchmark.py +104 -104
- msprobe/pytorch/free_benchmark/main.py +105 -105
- msprobe/pytorch/free_benchmark/perturbed_layers/base_layer.py +13 -13
- msprobe/pytorch/free_benchmark/perturbed_layers/layer_factory.py +41 -41
- msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py +90 -90
- msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py +104 -104
- msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py +63 -63
- msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py +68 -68
- msprobe/pytorch/free_benchmark/perturbed_layers/npu/no_change.py +28 -28
- msprobe/pytorch/free_benchmark/perturbed_layers/npu/npu_base_layser.py +45 -45
- msprobe/pytorch/free_benchmark/perturbed_layers/run_cpu.py +19 -19
- msprobe/pytorch/free_benchmark/result_handlers/base_handler.py +217 -217
- msprobe/pytorch/free_benchmark/result_handlers/check_handler.py +39 -39
- msprobe/pytorch/free_benchmark/result_handlers/fix_handler.py +23 -23
- msprobe/pytorch/free_benchmark/result_handlers/handler_factory.py +30 -30
- msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py +170 -170
- msprobe/pytorch/function_factory.py +76 -75
- msprobe/pytorch/functional/dump_module.py +39 -39
- msprobe/pytorch/grad_probe/grad_monitor.py +91 -90
- msprobe/pytorch/grad_probe/grad_stat_csv.py +128 -128
- msprobe/pytorch/hook_module/api_registry.py +161 -161
- msprobe/pytorch/hook_module/hook_module.py +120 -120
- msprobe/pytorch/hook_module/support_wrap_ops.yaml +1879 -1877
- msprobe/pytorch/hook_module/utils.py +30 -29
- msprobe/pytorch/hook_module/wrap_aten.py +110 -110
- msprobe/pytorch/hook_module/wrap_distributed.py +78 -78
- msprobe/pytorch/hook_module/wrap_functional.py +105 -105
- msprobe/pytorch/hook_module/wrap_npu_custom.py +93 -84
- msprobe/pytorch/hook_module/wrap_tensor.py +71 -71
- msprobe/pytorch/hook_module/wrap_torch.py +86 -86
- msprobe/pytorch/hook_module/wrap_vf.py +62 -62
- msprobe/pytorch/module_processer.py +138 -138
- msprobe/pytorch/online_dispatch/__init__.py +20 -20
- msprobe/pytorch/online_dispatch/compare.py +236 -236
- msprobe/pytorch/online_dispatch/dispatch.py +271 -271
- msprobe/pytorch/online_dispatch/dump_compare.py +155 -156
- msprobe/pytorch/online_dispatch/single_compare.py +391 -391
- msprobe/pytorch/online_dispatch/torch_ops_config.yaml +49 -49
- msprobe/pytorch/online_dispatch/utils.py +130 -146
- msprobe/pytorch/parse.py +4 -4
- msprobe/pytorch/parse_tool/cli.py +32 -32
- msprobe/pytorch/parse_tool/lib/compare.py +260 -271
- msprobe/pytorch/parse_tool/lib/config.py +52 -52
- msprobe/pytorch/parse_tool/lib/file_desc.py +31 -31
- msprobe/pytorch/parse_tool/lib/interactive_cli.py +102 -102
- msprobe/pytorch/parse_tool/lib/parse_exception.py +54 -54
- msprobe/pytorch/parse_tool/lib/parse_tool.py +158 -158
- msprobe/pytorch/parse_tool/lib/utils.py +316 -321
- msprobe/pytorch/parse_tool/lib/visualization.py +85 -91
- msprobe/pytorch/pt_config.py +188 -187
- msprobe/pytorch/service.py +246 -252
- mindstudio_probe-1.0.3.dist-info/RECORD +0 -272
- msprobe/config/README.md +0 -539
- msprobe/mindspore/doc/compare.md +0 -58
- msprobe/mindspore/doc/dump.md +0 -217
- msprobe/mindspore/dump/hook_cell/wrap_functional.py +0 -91
- msprobe/mindspore/dump/hook_cell/wrap_tensor.py +0 -63
- msprobe/pytorch/doc/FAQ.md +0 -193
- msprobe/pytorch/doc/api_accuracy_checker.md +0 -313
- msprobe/pytorch/doc/api_accuracy_checker_online.md +0 -187
- msprobe/pytorch/doc/dump.md +0 -260
- msprobe/pytorch/doc/msprobe/321/207/342/226/223/342/225/233/321/205/342/225/221/320/266/321/205/342/225/226/320/265/321/205/320/225/342/225/226/321/206/320/245/342/226/221/321/206/320/235/320/276dump/321/206/320/260/320/227/321/205/320/227/320/226/321/206/320/220/320/267/321/210/320/223/342/225/234/321/205/320/257/342/225/221/321/207/342/225/221/342/224/220/321/206/320/232/320/265/321/205/320/241/320/232.md +0 -182
- msprobe/pytorch/doc/ptdbg_ascend_compare.md +0 -240
- msprobe/pytorch/doc/ptdbg_ascend_overview.md +0 -68
- msprobe/pytorch/doc/ptdbg_ascend_quickstart.md +0 -381
- msprobe/pytorch/doc/run_overflow_check.md +0 -25
- msprobe/pytorch/doc//321/205/320/254/320/270/321/207/342/225/221/342/224/220/321/207/342/226/223/342/225/233/321/205/342/225/221/320/266/321/206/320/277/320/244/321/205/320/277/342/225/243.md +0 -90
- msprobe/pytorch/doc//321/206/320/247/320/260/321/206/320/260/320/227/321/206/320/255/320/226/321/205/342/225/226/320/265/321/205/320/225/342/225/226/321/205/320/254/342/225/221/321/206/320/251/320/277/321/211/320/272/320/234/321/210/320/277/320/221/321/205/320/242/320/234/321/206/320/220/320/267/321/210/320/223/342/225/234/321/205/320/257/342/225/221/321/207/342/225/221/342/224/220/321/206/320/232/320/265/321/205/320/241/320/232.md +0 -151
- {mindstudio_probe-1.0.3.dist-info → mindstudio_probe-1.0.4.dist-info}/top_level.txt +0 -0
- /msprobe/{pytorch/doc → docs}/img/BLOOM-7B_1.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/BLOOM-7B_2.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/BLOOM-7B_3.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/BLOOM-7B_4.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_1.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_2.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_3.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_4.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_5.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_6.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_7.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/GPT-3_8.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/YOLOV5S_1.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/YOLOV5S_2.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/accuracy_checking_details.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/accuracy_checking_result.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/api_precision_compare_details.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/api_precision_compare_result.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/auto_analyze_log.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/compare_result_pkl.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/compare_result_pkl_md5.png.png +0 -0
- /msprobe/{pytorch/doc → docs}/img/cpu_info.png +0 -0
- /msprobe/{config → docs}/img/free_benchmark.png +0 -0
- /msprobe/{doc/grad_probe/img/image-1.png → docs/img/grad_probe_image-1.png} +0 -0
- /msprobe/{doc/grad_probe/img/image-2.png → docs/img/grad_probe_image-2.png} +0 -0
- /msprobe/{doc/grad_probe/img/image-3.png → docs/img/grad_probe_image-3.png} +0 -0
- /msprobe/{doc/grad_probe/img/image-4.png → docs/img/grad_probe_image-4.png} +0 -0
- /msprobe/{doc/grad_probe/img/image.png → docs/img/grad_probe_image.png} +0 -0
- /msprobe/{pytorch/doc → docs}/img/module_compare.png +0 -0
msprobe/mindspore/doc/dump.md
DELETED
|
@@ -1,217 +0,0 @@
|
|
|
1
|
-
# **精度数据采集**
|
|
2
|
-
|
|
3
|
-
msprobe工具主要通过在训练脚本内添加dump接口并启动训练的方式来采集精度数据。
|
|
4
|
-
|
|
5
|
-
执行dump操作需要安装msprobe工具。详见《[MindStudio精度调试工具](../../README.md)》的“工具安装”章节。
|
|
6
|
-
|
|
7
|
-
## dump接口介绍
|
|
8
|
-
|
|
9
|
-
### PrecisionDebugger
|
|
10
|
-
|
|
11
|
-
**功能说明**
|
|
12
|
-
|
|
13
|
-
通过加载dump配置文件的方式来确定dump操作的详细配置。
|
|
14
|
-
|
|
15
|
-
PrecisionDebugger可以在from msprobe.mindspore import PrecisionDebugger之后的位置添加。详细使用可参考“**示例代码**”。
|
|
16
|
-
|
|
17
|
-
**原型**
|
|
18
|
-
|
|
19
|
-
```Python
|
|
20
|
-
PrecisionDebugger(config_path=None)
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
**参数说明**
|
|
24
|
-
|
|
25
|
-
| 参数名 | 说明 | 是否必选 |
|
|
26
|
-
| ----------- | ------------------------------------------------------------ | -------- |
|
|
27
|
-
| config_path | 指定dump配置文件路径,String类型。参数示例:"./config.json"。未配置该路径时,默认使用[config.json](../../config)文件的默认配置。config.json文件可以配置更多参数,若需要进行更多场景的精度数据dump,建议配置[config.json](../../config/config.json)文件。config.json文件的配置可参考《[配置文件说明](https://gitee.com/ascend/mstt/blob/master/debug/accuracy_tools/msprobe/config/README.md)》。 | 否 |
|
|
28
|
-
|
|
29
|
-
### start函数
|
|
30
|
-
|
|
31
|
-
**功能说明**
|
|
32
|
-
|
|
33
|
-
启动函数。
|
|
34
|
-
|
|
35
|
-
在模型初始化之后的位置添加。需要与stop函数一起添加在for循环内。
|
|
36
|
-
|
|
37
|
-
**原型**
|
|
38
|
-
|
|
39
|
-
```Python
|
|
40
|
-
debugger.start(model = None)
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
该函数为类函数,可以使用debugger.start(model = None)也可以使用PrecisionDebugger.start(model = None)
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
**参数说明**
|
|
47
|
-
|
|
48
|
-
| 参数名 | 说明 | 是否必选 |
|
|
49
|
-
| ----------- |---------------------------------------------------------------------------------------| -------- |
|
|
50
|
-
| model | 指具体的mindspore.nn.Cell,默认未配置,L1级别下传入model可以使能对primitive op的dump,否则无法dump primitive op。 | 否 |
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
### stop函数
|
|
54
|
-
|
|
55
|
-
**功能说明**
|
|
56
|
-
|
|
57
|
-
dump停止函数。
|
|
58
|
-
|
|
59
|
-
在**start**函数之后的任意位置添加。需要与start函数一起添加在for循环内。若需要dump反向数据,则需要添加在反向计算代码之后。
|
|
60
|
-
|
|
61
|
-
仅MindSpore动态图场景支持。
|
|
62
|
-
|
|
63
|
-
**原型**
|
|
64
|
-
|
|
65
|
-
```Python
|
|
66
|
-
debugger.stop()
|
|
67
|
-
```
|
|
68
|
-
|
|
69
|
-
该函数为类函数,可以使用debugger.stop()也可以使用PrecisionDebugger.stop()。
|
|
70
|
-
|
|
71
|
-
### step函数
|
|
72
|
-
|
|
73
|
-
**功能说明**
|
|
74
|
-
|
|
75
|
-
结束标识。
|
|
76
|
-
|
|
77
|
-
在最后一个**stop**函数后或一个step结束的位置添加。
|
|
78
|
-
|
|
79
|
-
仅MindSpore动态图场景支持。
|
|
80
|
-
|
|
81
|
-
**原型**
|
|
82
|
-
|
|
83
|
-
```Python
|
|
84
|
-
debugger.step()
|
|
85
|
-
```
|
|
86
|
-
|
|
87
|
-
该函数为类函数,可以使用debugger.step()也可以使用PrecisionDebugger.step()。
|
|
88
|
-
|
|
89
|
-
## 示例代码
|
|
90
|
-
|
|
91
|
-
### MindSpore静态图场景
|
|
92
|
-
|
|
93
|
-
```Python
|
|
94
|
-
from msprobe.mindspore import PrecisionDebugger
|
|
95
|
-
debugger = PrecisionDebugger(config_path="./config.json")
|
|
96
|
-
# 请勿将以上初始化流程插入到循环代码中
|
|
97
|
-
# 下面代码也可以用PrecisionDebugger.start()
|
|
98
|
-
debugger.start()
|
|
99
|
-
...
|
|
100
|
-
```
|
|
101
|
-
|
|
102
|
-
### MindSpore动态图场景
|
|
103
|
-
|
|
104
|
-
当使用模型使用for循环时,在每个迭代的开始插入debugger.start(),在每个迭代的结束插入debugger.stop()与debugger.step():
|
|
105
|
-
|
|
106
|
-
```Python
|
|
107
|
-
import mindspore as ms
|
|
108
|
-
from msprobe.mindspore import PrecisionDebugger
|
|
109
|
-
|
|
110
|
-
# 请勿将PrecisionDebugger的初始化插入到循环代码中
|
|
111
|
-
debugger = PrecisionDebugger(config_path="./config.json")
|
|
112
|
-
|
|
113
|
-
# 模型、损失函数的定义以及初始化等操作
|
|
114
|
-
# ...
|
|
115
|
-
|
|
116
|
-
# 数据集迭代的地方往往是模型开始训练的地方
|
|
117
|
-
for data, label in data_loader:
|
|
118
|
-
debugger.start() # 开启数据dump
|
|
119
|
-
net = Model()
|
|
120
|
-
# 如下是模型每个step执行的逻辑
|
|
121
|
-
grad_net = ms.grad(net)(data)
|
|
122
|
-
# ...
|
|
123
|
-
debugger.stop() # 关闭数据dump
|
|
124
|
-
debugger.step() # 结束一个step的dump
|
|
125
|
-
```
|
|
126
|
-
|
|
127
|
-
当使用模型的train方法而非for循环时,可以通过在callbacks参数中传入MsprobeStep(debugger):
|
|
128
|
-
|
|
129
|
-
```Python
|
|
130
|
-
from msprobe.mindspore.common.utils import MsprobeStep
|
|
131
|
-
from msprobe.mindspore import PrecisionDebugger
|
|
132
|
-
|
|
133
|
-
# 初始化PrecisionDebugger
|
|
134
|
-
debugger = PrecisionDebugger(config_path="./config.json")
|
|
135
|
-
|
|
136
|
-
# 自动在每个step开始时调用start(),在每个step结束时调用stop()和step()。
|
|
137
|
-
# 这意味着您无需手动在循环内添加start、stop和step函数,框架会自动完成数据的dump操作。
|
|
138
|
-
trainer.train(1, dataset_train, callbacks=[loss_monior, MsprobeStep(debugger)])
|
|
139
|
-
|
|
140
|
-
```
|
|
141
|
-
|
|
142
|
-
## dump结果文件介绍
|
|
143
|
-
|
|
144
|
-
### MindSpore静态图场景
|
|
145
|
-
|
|
146
|
-
训练结束后,工具将dump的数据保存在dump_path参数指定的目录下。
|
|
147
|
-
|
|
148
|
-
- jit_level为O0/O1时
|
|
149
|
-
|
|
150
|
-
dump结果目录请参见MindSpore官网中的《[同步Dump数据对象目录](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.0rc2/debug/dump.html#%E5%90%8C%E6%AD%A5dump%E6%95%B0%E6%8D%AE%E5%AF%B9%E8%B1%A1%E7%9B%AE%E5%BD%95)》。
|
|
151
|
-
|
|
152
|
-
- jit_level为O2时
|
|
153
|
-
|
|
154
|
-
dump结果目录请参见MindSpore官网中的《[异步Dump数据对象目录](https://www.mindspore.cn/tutorials/experts/zh-CN/r2.3.0rc2/debug/dump.html#%E5%BC%82%E6%AD%A5dump%E6%95%B0%E6%8D%AE%E5%AF%B9%E8%B1%A1%E7%9B%AE%E5%BD%95)》。
|
|
155
|
-
|
|
156
|
-
jit_level请参见[mindspore.set_context](https://www.mindspore.cn/docs/zh-CN/r2.3.0/api_python/mindspore/mindspore.JitConfig.html#mindspore-jitconfig)配置jit_config。
|
|
157
|
-
|
|
158
|
-
### MindSpore动态图场景
|
|
159
|
-
|
|
160
|
-
训练结束后,工具将dump的数据保存在dump_path参数指定的目录下。
|
|
161
|
-
|
|
162
|
-
dump结果目录结构示例如下:
|
|
163
|
-
|
|
164
|
-
```bash
|
|
165
|
-
├── dump_path
|
|
166
|
-
│ ├── step0
|
|
167
|
-
│ | ├── rank0
|
|
168
|
-
│ | │ ├── dump_tensor_data
|
|
169
|
-
| | | | ├── MintFunctional.relu.0.backward.input.0.npy
|
|
170
|
-
| | | | ├── Mint.abs.0.forward.input.0.npy
|
|
171
|
-
| | | | ├── Functional.split.0.forward.input.0.npy
|
|
172
|
-
| | | | ├── Tensor.__add__.0.forward.output.0.npy
|
|
173
|
-
| | | | ...
|
|
174
|
-
| | | | └── Jit.AlexNet.0.forward.input.0.npy
|
|
175
|
-
│ | | ├── dump.json # 保存前反向算子、算子的统计量信息或溢出算子信息。包含dump数据的API名称(命名格式为:`{api_type}_{api_name}_{API调用次数}_{前向反向}_{input/output}.{参数序号}`)、dtype、 shape、各数据的max、min、mean、L2norm统计信息以及当配置summary_mode="md5"时的md5数据。其中,“参数序号”表示该API下的第n个参数,例如1,则为第一个参数,若该参数为list格式,则根据list继续排序,例如1.1,表示该API的第1个参数的第1个子参数;L2norm表示L2范数(平方根)
|
|
176
|
-
│ | | ├── stack.json # 算子调用栈信息
|
|
177
|
-
│ | | └── construct.json # 分层分级结构,level为L1时,construct.json内容为空
|
|
178
|
-
│ | ├── rank1
|
|
179
|
-
| | | ├── dump_tensor_data
|
|
180
|
-
| | | | └── ...
|
|
181
|
-
│ | | ├── dump.json
|
|
182
|
-
│ | | ├── stack.json
|
|
183
|
-
| | | └── construct.json
|
|
184
|
-
│ | ├── ...
|
|
185
|
-
│ | |
|
|
186
|
-
| | └── rank7
|
|
187
|
-
│ ├── step1
|
|
188
|
-
│ | ├── ...
|
|
189
|
-
│ ├── step2
|
|
190
|
-
```
|
|
191
|
-
|
|
192
|
-
dump过程中,npy文件在对应算子或者模块被执行后就会落盘,而json文件则需要在正常执行PrecisionDebugger.stop()后才会写入完整数据,异常的程序终止会保存终止前被执行算子的相关npy文件,可能会导致json文件中数据丢失。
|
|
193
|
-
|
|
194
|
-
其中rank为设备上各卡的ID,每张卡上dump的数据会生成对应dump目录。非分布式场景下没有rank ID,目录名称为rank。
|
|
195
|
-
|
|
196
|
-
动态图场景下使能PSJit或PIJit,装饰特定Cell或function,被装饰的部分会全部/部分使能静态图流程。PSJit场景下config.json文件配置level为L1时,被PSJit装饰的部分也作为API被dump到对应目录;若配置level为L2时,则只会dump用户网络中静态图流程下的相关kernel。PIJit场景开启dump工具后,会被还原为动态图,按API粒度进行dump。
|
|
197
|
-
|
|
198
|
-
npy文件保存的前缀和MindSpore对应关系如下:
|
|
199
|
-
|
|
200
|
-
| 前缀 | MindSpore模块 |
|
|
201
|
-
| -------------- | ---------------------------- |
|
|
202
|
-
| Tensor | mindspore.Tensor |
|
|
203
|
-
| Functional | mindspore.ops |
|
|
204
|
-
| Mint | mindspore.mint |
|
|
205
|
-
| MintFunctional | mindspore.mint.nn.functional |
|
|
206
|
-
| Jit | mindspore.jit |
|
|
207
|
-
|
|
208
|
-
## 工具支持的API列表
|
|
209
|
-
|
|
210
|
-
msprobe工具维护固定的API支持列表,若需要删除或增加dump的API,可以在msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml文件内手动修改,如下示例:
|
|
211
|
-
|
|
212
|
-
```bash
|
|
213
|
-
ops: # ops为算子类别,找到对应的类别,在该类别下按照下列格式删除或添加API
|
|
214
|
-
- adaptive_avg_pool1d
|
|
215
|
-
- adaptive_avg_pool2d
|
|
216
|
-
- adaptive_avg_pool3d
|
|
217
|
-
```
|
|
@@ -1,91 +0,0 @@
|
|
|
1
|
-
# Copyright 2024 Huawei Technologies Co., Ltd
|
|
2
|
-
#
|
|
3
|
-
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
-
# you may not use this file except in compliance with the License.
|
|
5
|
-
# You may obtain a copy of the License at
|
|
6
|
-
#
|
|
7
|
-
# http://www.apache.org/licenses/LICENSE-2.0
|
|
8
|
-
#
|
|
9
|
-
# Unless required by applicable law or agreed to in writing, software
|
|
10
|
-
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
11
|
-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
12
|
-
# See the License for the specific language governing permissions and
|
|
13
|
-
# limitations under the License.
|
|
14
|
-
# ============================================================================
|
|
15
|
-
|
|
16
|
-
import os
|
|
17
|
-
import mindspore as ms
|
|
18
|
-
from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell
|
|
19
|
-
from msprobe.core.common.utils import Const, load_yaml
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
cur_path = os.path.dirname(os.path.realpath(__file__))
|
|
23
|
-
yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml")
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
def load_ops_functions():
|
|
27
|
-
ops_func = {f: getattr(ms.ops, f) for f in dir(ms.ops)}
|
|
28
|
-
mint_ops_func = {f: getattr(ms.mint, f) for f in dir(ms.mint)}
|
|
29
|
-
mint_func_ops_func = {f: getattr(ms.mint.nn.functional, f) for f in dir(ms.mint.nn.functional)}
|
|
30
|
-
return ops_func, mint_ops_func, mint_func_ops_func
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
def get_functional_ops():
|
|
34
|
-
ops_func, mint_ops_func, mint_func_ops_func = load_ops_functions()
|
|
35
|
-
config = load_yaml(yaml_path)
|
|
36
|
-
wrap_functional = config.get("ops")
|
|
37
|
-
wrap_mint = config.get("mint.ops")
|
|
38
|
-
wrap_mint_functional = config.get("mint.nn.functional")
|
|
39
|
-
return (
|
|
40
|
-
set(wrap_functional) & set(ops_func.keys()),
|
|
41
|
-
set(wrap_mint) & set(mint_ops_func.keys()),
|
|
42
|
-
set(wrap_mint_functional) & set(mint_func_ops_func.keys())
|
|
43
|
-
)
|
|
44
|
-
|
|
45
|
-
|
|
46
|
-
class HOOKFunctionalOP(object):
|
|
47
|
-
pass
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
class HOOKMintOP(object):
|
|
51
|
-
pass
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
class HOOKMintNNFunctionalOP(object):
|
|
55
|
-
pass
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
class FunctionalOPTemplate(HOOKCell):
|
|
59
|
-
def __init__(self, op_name, op_dict, prefix, hook):
|
|
60
|
-
self.op_name = op_name
|
|
61
|
-
self.op_func = op_dict[op_name]
|
|
62
|
-
self.prefix_op_name_ = prefix + str(op_name.split(Const.SEP)[-1]) + Const.SEP
|
|
63
|
-
super().__init__(hook)
|
|
64
|
-
|
|
65
|
-
def construct(self, *args, **kwargs):
|
|
66
|
-
if self.op_name.startswith('dropout'):
|
|
67
|
-
return args[0] if args else kwargs.get('input')
|
|
68
|
-
return self.op_func(*args, **kwargs)
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
def wrap_functional_op(op_name, op_dict, prefix, hook):
|
|
72
|
-
def op_template(*args, **kwargs):
|
|
73
|
-
return FunctionalOPTemplate(op_name, op_dict, prefix, hook)(*args, **kwargs)
|
|
74
|
-
return op_template
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
def wrap_functional_ops_and_bind(ops, op_dict, prefix, hook, hook_class):
|
|
78
|
-
for op_name in ops:
|
|
79
|
-
if callable(op_dict[op_name]):
|
|
80
|
-
setattr(hook_class, Const.ATTR_NAME_PREFIX + op_name, wrap_functional_op(op_name, op_dict, prefix, hook))
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
def setup_hooks(hook):
|
|
84
|
-
functional_ops, mint_ops, mint_func_ops = get_functional_ops()
|
|
85
|
-
wrap_functional_ops_and_bind(
|
|
86
|
-
functional_ops, {f: getattr(ms.ops, f) for f in dir(ms.ops)}, "Functional.", hook, HOOKFunctionalOP)
|
|
87
|
-
wrap_functional_ops_and_bind(
|
|
88
|
-
mint_ops, {f: getattr(ms.mint, f) for f in dir(ms.mint)}, "Mint.", hook, HOOKMintOP)
|
|
89
|
-
wrap_functional_ops_and_bind(
|
|
90
|
-
mint_func_ops, {f: getattr(ms.mint.nn.functional, f) for f in dir(ms.mint.nn.functional)}, "MintFunctional.", hook, HOOKMintNNFunctionalOP)
|
|
91
|
-
|
|
@@ -1,63 +0,0 @@
|
|
|
1
|
-
# Copyright 2024 Huawei Technologies Co., Ltd
|
|
2
|
-
#
|
|
3
|
-
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
4
|
-
# you may not use this file except in compliance with the License.
|
|
5
|
-
# You may obtain a copy of the License at
|
|
6
|
-
#
|
|
7
|
-
# http://www.apache.org/licenses/LICENSE-2.0
|
|
8
|
-
#
|
|
9
|
-
# Unless required by applicable law or agreed to in writing, software
|
|
10
|
-
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
11
|
-
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
12
|
-
# See the License for the specific language governing permissions and
|
|
13
|
-
# limitations under the License.
|
|
14
|
-
# ============================================================================
|
|
15
|
-
|
|
16
|
-
import os
|
|
17
|
-
import mindspore as ms
|
|
18
|
-
from msprobe.mindspore.dump.hook_cell.hook_cell import HOOKCell
|
|
19
|
-
from msprobe.core.common.utils import Const, load_yaml
|
|
20
|
-
|
|
21
|
-
|
|
22
|
-
cur_path = os.path.dirname(os.path.realpath(__file__))
|
|
23
|
-
yaml_path = os.path.join(cur_path, "support_wrap_ops.yaml")
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
TensorFunc = {}
|
|
27
|
-
for f in dir(ms.Tensor):
|
|
28
|
-
TensorFunc[f] = getattr(ms.Tensor, f)
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
def get_tensor_ops():
|
|
32
|
-
yaml_data = load_yaml(yaml_path)
|
|
33
|
-
wrap_tensor_ops = yaml_data.get('tensor')
|
|
34
|
-
_tensor_ops = dir(ms.Tensor)
|
|
35
|
-
return set(wrap_tensor_ops) & set(_tensor_ops)
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
class HOOKTensor(object):
|
|
39
|
-
pass
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
class TensorOPTemplate(HOOKCell):
|
|
43
|
-
|
|
44
|
-
def __init__(self, op_name, hook):
|
|
45
|
-
self.op_name_ = op_name
|
|
46
|
-
self.prefix_op_name_ = "Tensor." + str(op_name) + Const.SEP
|
|
47
|
-
super().__init__(hook)
|
|
48
|
-
|
|
49
|
-
def construct(self, *args, **kwargs):
|
|
50
|
-
return TensorFunc[str(self.op_name_)](*args, **kwargs)
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
def wrap_tensor_op(op_name, hook):
|
|
54
|
-
def tensor_op_template(*args, **kwargs):
|
|
55
|
-
return TensorOPTemplate(op_name, hook)(*args, **kwargs)
|
|
56
|
-
return tensor_op_template
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
def wrap_tensor_ops_and_bind(hook):
|
|
60
|
-
_tensor_ops = get_tensor_ops()
|
|
61
|
-
for op_name in _tensor_ops:
|
|
62
|
-
if callable(TensorFunc[op_name]):
|
|
63
|
-
setattr(HOOKTensor, Const.ATTR_NAME_PREFIX + str(op_name), wrap_tensor_op(op_name, hook))
|
msprobe/pytorch/doc/FAQ.md
DELETED
|
@@ -1,193 +0,0 @@
|
|
|
1
|
-
# 精度预检工具
|
|
2
|
-
|
|
3
|
-
1. 预检工具在dump和run_ut的过程中,是否需要同时开启或关闭jit编译(jit_compile)?
|
|
4
|
-
|
|
5
|
-
答:是。
|
|
6
|
-
|
|
7
|
-
2. 预检工具对于type_as这类涉及数据类型转换操作的API,是否具有参考性?
|
|
8
|
-
|
|
9
|
-
由于这类API在CPU侧存在精度先提升后下降的操作,因此这类API的有效性的参考价值有限。
|
|
10
|
-
|
|
11
|
-
3. run ut过程中出现报错:ERROR:Got unsupported ScalarType BFloat16
|
|
12
|
-
|
|
13
|
-
答:请使用最新版本的工具。
|
|
14
|
-
|
|
15
|
-
4. Dropout算子,CPU和NPU的随机应该不一样,为什么结果比对是一致的?
|
|
16
|
-
|
|
17
|
-
答:这个结果是正常的,工具对该算子有特殊处理,只判定位置为0的位置比例大约和设定p值相当。
|
|
18
|
-
|
|
19
|
-
5. 为什么浮点型数据bench和CPU的dtype不一致?
|
|
20
|
-
|
|
21
|
-
答:对于fp16的数据,CPU会上升一个精度fp32去计算,这是和算子那边对齐的精度结论,CPU用更高精度去计算会更接近真实值。
|
|
22
|
-
|
|
23
|
-
6. 添加预检工具后截取操作报错:`IndexError: too many indices for tensor of dimension x` 或 `TypeError: len() of a 0-d tensor`。
|
|
24
|
-
|
|
25
|
-
答:注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。
|
|
26
|
-
|
|
27
|
-
7. 添加预检工具后F.gelu触发ValueError报错:`activation_func must be F.gelu`等。
|
|
28
|
-
|
|
29
|
-
答:注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。
|
|
30
|
-
|
|
31
|
-
8. 添加预检工具后触发AsStrided算子相关的报错,或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。
|
|
32
|
-
|
|
33
|
-
答:注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。
|
|
34
|
-
|
|
35
|
-
9. Tensor 魔法函数具体对应什么操作?
|
|
36
|
-
|
|
37
|
-
答:
|
|
38
|
-
|
|
39
|
-
| Tensor魔法函数 | 具体操作 |
|
|
40
|
-
| --------------- | ---------------- |
|
|
41
|
-
| `__add__` | + |
|
|
42
|
-
| `__and__` | & |
|
|
43
|
-
| `__bool__` | 返回Tensor布尔值 |
|
|
44
|
-
| `__div__` | / |
|
|
45
|
-
| `__eq__` | == |
|
|
46
|
-
| `__ge__` | >= |
|
|
47
|
-
| `__gt__` | > |
|
|
48
|
-
| `__iadd__` | += |
|
|
49
|
-
| `__iand__` | &= |
|
|
50
|
-
| `__idiv__` | /= |
|
|
51
|
-
| `__ifloordiv__` | //= |
|
|
52
|
-
| `__ilshift__` | <<= |
|
|
53
|
-
| `__imod__` | %= |
|
|
54
|
-
| `__imul__` | *= |
|
|
55
|
-
| `__ior__` | \|= |
|
|
56
|
-
| `__irshift__` | >>= |
|
|
57
|
-
| `__isub__` | -= |
|
|
58
|
-
| `__ixor__` | ^= |
|
|
59
|
-
| `__lshift__` | << |
|
|
60
|
-
| `__matmul__` | 矩阵乘法 |
|
|
61
|
-
| `__mod__` | % |
|
|
62
|
-
| `__mul__` | * |
|
|
63
|
-
| `__nonzero__` | 同`__bool__` |
|
|
64
|
-
| `__or__` | \| |
|
|
65
|
-
| `__radd__` | +(反向) |
|
|
66
|
-
| `__rmul__` | *(反向) |
|
|
67
|
-
| `__rshift__` | >> |
|
|
68
|
-
| `__sub__` | - |
|
|
69
|
-
| `__truediv__` | 同`__div__` |
|
|
70
|
-
| `__xor__` | ^ |
|
|
71
|
-
|
|
72
|
-
# 精度比对工具
|
|
73
|
-
|
|
74
|
-
## 工具使用
|
|
75
|
-
|
|
76
|
-
### dump指定融合算子
|
|
77
|
-
|
|
78
|
-
dump指定操作当前支持dump指定融合算子的输入输出,需要在mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml中添加,比如以下代码段调用的softmax融合算子
|
|
79
|
-
|
|
80
|
-
```
|
|
81
|
-
def npu_forward_fused_softmax(self, input_, mask):
|
|
82
|
-
resl = torch_npu.npu_scaled_masked_softmax(input_, mask, self.scale, False)
|
|
83
|
-
return resl
|
|
84
|
-
```
|
|
85
|
-
|
|
86
|
-
如果需要dump其中调用的npu_scaled_masked_softmax算子的输入输出信息,需要在support_wrap_ops.yaml中的torch_npu: 中自行添加该融合算子即可:
|
|
87
|
-
|
|
88
|
-
```
|
|
89
|
-
- npu_scaled_masked_softmax
|
|
90
|
-
```
|
|
91
|
-
|
|
92
|
-
(npu_scaled_masked_softmax融合算子工具已支持dump,本例仅供参考)
|
|
93
|
-
|
|
94
|
-
## 常见问题
|
|
95
|
-
|
|
96
|
-
### 1. 在同一个目录多次执行dump会冲突吗?
|
|
97
|
-
|
|
98
|
-
会,同一个目录多次dump,会覆盖上一次结果,可以使用dump_path参数修改dump目录。
|
|
99
|
-
|
|
100
|
-
### 2. 如何dump算子级的数据?
|
|
101
|
-
|
|
102
|
-
需要配置level为L2模式。
|
|
103
|
-
|
|
104
|
-
### 3. 工具比对发现NPU和标杆数据的API无法完全对齐?
|
|
105
|
-
|
|
106
|
-
torch版本和硬件差异属于正常情况。
|
|
107
|
-
|
|
108
|
-
## 异常情况
|
|
109
|
-
|
|
110
|
-
### 2. HCCL 报错: error code: EI0006
|
|
111
|
-
|
|
112
|
-
**故障现象**
|
|
113
|
-
|
|
114
|
-
使用msprobe工具时,报错: error code: EI0006。
|
|
115
|
-
|
|
116
|
-
**故障原因**
|
|
117
|
-
|
|
118
|
-
CANN软件版本较低导致不兼容。
|
|
119
|
-
|
|
120
|
-
**故障处理**
|
|
121
|
-
|
|
122
|
-
升级新版CANN软件版本。
|
|
123
|
-
|
|
124
|
-
### 3. torch_npu._C._clear_overflow_npu() RuntimeError NPU error,error code is 107002
|
|
125
|
-
|
|
126
|
-
如果运行溢出检测功能遇到这个报错,采取以下解决方法:
|
|
127
|
-
如果是单卡运行,添加如下代码,0是卡号,选择自己空闲的卡号。
|
|
128
|
-
|
|
129
|
-
```
|
|
130
|
-
torch.npu.set_device('npu:0')
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
如果多卡运行,请在代码中修改对应卡号,比如进程使用卡号为{rank}时可以添加如下代码:
|
|
134
|
-
|
|
135
|
-
```
|
|
136
|
-
torch.npu.set_device(f'npu:{rank}')
|
|
137
|
-
```
|
|
138
|
-
|
|
139
|
-
如果运行精度比对功能遇到这个报错,尝试安装最新版本的msprobe。
|
|
140
|
-
|
|
141
|
-
### 4. dump得到的VF_lstm_99_forward_input.1.0.npy、VF_lstm_99_forward_input.1.1.npy类似的数据是否正常?
|
|
142
|
-
|
|
143
|
-
带1.0/1.1/1.2后缀的npy是正常现象,例如当输入数据为[[tensor1, tensor2, tensor3]]会生成这样的后缀。
|
|
144
|
-
|
|
145
|
-
### 5. 进行compare报错:The current file contains stack information, please turn on the stack_mode
|
|
146
|
-
|
|
147
|
-
在比对脚本中,设置stack_mode=True,例如:
|
|
148
|
-
|
|
149
|
-
```
|
|
150
|
-
from msprobe.pytorch import compare
|
|
151
|
-
dump_result_param={
|
|
152
|
-
"npu_json_path": "./npu_dump/dump.json",
|
|
153
|
-
"bench_json_path": "./gpu_dump/dump.json",
|
|
154
|
-
"stack_json_path": "./npu_dump/stack.json",
|
|
155
|
-
"is_print_compare_log": True
|
|
156
|
-
}
|
|
157
|
-
compare(dump_result_param, output_path="./output", stack_mode=True)
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
### 6. dump指定反向API的kernel级别的数据报错:NameError:name 'torch_npu' is not defined
|
|
161
|
-
|
|
162
|
-
- 如果是npu环境,请安装torch_npu;
|
|
163
|
-
- 如果是gpu环境,暂不支持dump指定API的kernel级别的数据
|
|
164
|
-
|
|
165
|
-
### 7. 配置dump_path后,使用工具报错:[ERROR]The file path /home/xxx/dump contains special characters
|
|
166
|
-
|
|
167
|
-
- 请检查你设置的dump绝对路径是否包含特殊字符,确保路径名只包含大小写字母、数字、下划线、斜杠、点和短横线
|
|
168
|
-
- 注意,如果执行脚本的路径为/home/abc++/,设置的dump_path="./dump",工具实际校验的路径为绝对路径/home/abc++/dump,++为特殊字符,会引发本条报错
|
|
169
|
-
|
|
170
|
-
### 8. 无法dump matmul权重的反向梯度数据
|
|
171
|
-
|
|
172
|
-
- matmul期望的输入是二维,当输入不是二维时,会将输入通过view操作展成二维,再进行matmul运算,因此在反向求导时,backward_hook能拿到的是UnsafeViewBackward这步操作里面数据的梯度信息,取不到MmBackward这步操作里面数据的梯度信息,即权重的反向梯度数据。
|
|
173
|
-
- 典型的例子有,当linear的输入不是二维,且无bias时,会调用output = input.matmul(weight.t()),因此拿不到linear层的weight的反向梯度数据。
|
|
174
|
-
|
|
175
|
-
### 9. dump.json文件中的某些api的dtype类型为float16,但是读取此api的npy文件显示的dtype类型为float32
|
|
176
|
-
|
|
177
|
-
- msprobe工具在dump数据时需要将原始数据从npu to cpu上再转换为numpy类型,npu to cpu的逻辑和gpu to cpu是保持一致的,都存在dtype可能从float16变为float32类型的情况,如果出现dtype不一致的问题,最终dump数据的dtype以pkl文件为准。
|
|
178
|
-
|
|
179
|
-
### 10. 使用dataloader后raise异常Exception("msprobe: exit after iteration {}". format(max(self.config.step))
|
|
180
|
-
|
|
181
|
-
- 正常现象,dataloader通过raise结束程序,堆栈信息可忽略。
|
|
182
|
-
|
|
183
|
-
### 11. 添加msprobe工具后截取操作报错:`IndexError: too many indices for tensor of dimension x` 或 `TypeError: len() of a 0-d tensor`。
|
|
184
|
-
|
|
185
|
-
- 注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- __getitem__`,工具会跳过dump该API。如果是需要dump的关键位置API也可以考虑根据报错堆栈信息注释引发报错的类型检查。
|
|
186
|
-
|
|
187
|
-
### 12. 添加msprobe工具后F.gelu触发ValueError报错:`activation_func must be F.gelu`等。
|
|
188
|
-
|
|
189
|
-
- 注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中functional:下的的`- gelu`,工具会跳过dump该API。如果是需要dump的关键位置api也可以考虑根据报错堆栈信息注释引发报错的类型检查。
|
|
190
|
-
|
|
191
|
-
### 13. 添加msprobe工具后触发AsStrided算子相关的报错,或者编译相关的报错,如:`Failed to compile Op [AsStrided]`。
|
|
192
|
-
|
|
193
|
-
- 注释工具目录mstt/debug/accuracy_tools/msprobe/pytorch/hook_module/support_wrap_ops.yaml文件中Tensor:下的`- t`和`- transpose`。
|