mindstudio-probe 1.1.0__py3-none-any.whl → 1.2.1__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (299) hide show
  1. {mindstudio_probe-1.1.0.dist-info → mindstudio_probe-1.2.1.dist-info}/METADATA +7 -6
  2. mindstudio_probe-1.2.1.dist-info/RECORD +396 -0
  3. {mindstudio_probe-1.1.0.dist-info → mindstudio_probe-1.2.1.dist-info}/WHEEL +1 -1
  4. {mindstudio_probe-1.1.0.dist-info → mindstudio_probe-1.2.1.dist-info}/entry_points.txt +0 -1
  5. msprobe/CMakeLists.txt +5 -0
  6. msprobe/README.md +51 -20
  7. msprobe/config.json +2 -3
  8. msprobe/core/advisor/advisor.py +8 -3
  9. msprobe/core/common/const.py +264 -15
  10. msprobe/core/common/exceptions.py +27 -3
  11. msprobe/core/common/file_utils.py +176 -26
  12. msprobe/core/common/inplace_op_checker.py +15 -0
  13. msprobe/core/common/inplace_ops.yaml +3 -0
  14. msprobe/core/common/log.py +27 -9
  15. msprobe/core/common/utils.py +204 -77
  16. msprobe/core/common_config.py +49 -14
  17. msprobe/core/compare/acc_compare.py +274 -198
  18. msprobe/core/compare/check.py +32 -33
  19. msprobe/core/compare/compare_cli.py +32 -14
  20. msprobe/core/compare/highlight.py +283 -127
  21. msprobe/core/compare/layer_mapping/__init__.py +19 -0
  22. msprobe/core/compare/layer_mapping/data_scope_parser.py +246 -0
  23. msprobe/core/compare/layer_mapping/layer_mapping.py +249 -0
  24. msprobe/core/compare/layer_mapping/postprocess_pass.py +95 -0
  25. msprobe/core/compare/merge_result/merge_result.py +380 -0
  26. msprobe/core/compare/merge_result/merge_result_cli.py +31 -0
  27. msprobe/core/compare/multiprocessing_compute.py +2 -2
  28. msprobe/core/compare/npy_compare.py +135 -144
  29. msprobe/core/compare/utils.py +419 -274
  30. msprobe/core/data_dump/data_collector.py +60 -28
  31. msprobe/core/data_dump/data_processor/base.py +84 -36
  32. msprobe/core/data_dump/data_processor/factory.py +5 -3
  33. msprobe/core/data_dump/data_processor/mindspore_processor.py +152 -18
  34. msprobe/core/data_dump/data_processor/pytorch_processor.py +267 -110
  35. msprobe/core/data_dump/json_writer.py +29 -1
  36. msprobe/core/data_dump/scope.py +119 -39
  37. msprobe/core/grad_probe/constant.py +27 -13
  38. msprobe/core/grad_probe/grad_compare.py +18 -1
  39. msprobe/core/grad_probe/utils.py +30 -2
  40. msprobe/core/overflow_check/abnormal_scene.py +189 -0
  41. msprobe/core/overflow_check/api_info.py +55 -0
  42. msprobe/core/overflow_check/checker.py +138 -0
  43. msprobe/core/overflow_check/filter.py +157 -0
  44. msprobe/core/overflow_check/ignore_rules.yaml +55 -0
  45. msprobe/core/overflow_check/level.py +22 -0
  46. msprobe/core/overflow_check/utils.py +28 -0
  47. msprobe/docs/01.installation.md +96 -7
  48. msprobe/docs/02.config_introduction.md +50 -23
  49. msprobe/docs/03.config_examples.md +2 -9
  50. msprobe/docs/04.kernel_dump_PyTorch.md +73 -0
  51. msprobe/docs/05.data_dump_PyTorch.md +93 -61
  52. msprobe/docs/06.data_dump_MindSpore.md +200 -95
  53. msprobe/docs/07.accuracy_checker_PyTorch.md +28 -28
  54. msprobe/docs/08.accuracy_checker_online_PyTorch.md +1 -6
  55. msprobe/docs/09.accuracy_checker_MindSpore.md +44 -8
  56. msprobe/docs/10.accuracy_compare_PyTorch.md +114 -50
  57. msprobe/docs/11.accuracy_compare_MindSpore.md +340 -48
  58. msprobe/docs/12.overflow_check_PyTorch.md +2 -2
  59. msprobe/docs/13.overflow_check_MindSpore.md +6 -6
  60. msprobe/docs/15.free_benchmarking_PyTorch.md +4 -5
  61. msprobe/docs/16.free_benchmarking_MindSpore.md +56 -37
  62. msprobe/docs/17.grad_probe.md +5 -6
  63. msprobe/docs/19.monitor.md +561 -0
  64. msprobe/docs/20.monitor_performance_baseline.md +52 -0
  65. msprobe/docs/21.visualization_PyTorch.md +466 -0
  66. msprobe/docs/22.visualization_MindSpore.md +481 -0
  67. msprobe/docs/23.generate_operator_PyTorch.md +107 -0
  68. msprobe/docs/24.code_mapping_Mindspore.md +28 -0
  69. msprobe/docs/25.tool_function_introduction.md +29 -0
  70. msprobe/docs/26.data_dump_PyTorch_baseline.md +37 -0
  71. msprobe/docs/27.dump_json_instruction.md +521 -0
  72. msprobe/docs/FAQ.md +29 -2
  73. msprobe/docs/accuracy_checker_MindSpore/accuracy_checker_MindSpore_baseline.md +14 -0
  74. msprobe/docs/data_dump_MindSpore/data_dump_MindSpore_baseline.md +22 -0
  75. msprobe/docs/data_dump_MindSpore/dynamic_graph_quick_start_example.md +211 -0
  76. msprobe/docs/img/compare_result.png +0 -0
  77. msprobe/docs/img/merge_result.png +0 -0
  78. msprobe/docs/img/monitor/cpu_info.png +0 -0
  79. msprobe/docs/img/visualization/fuzzy_match_ms.png +0 -0
  80. msprobe/docs/img/visualization/fuzzy_match_pt.png +0 -0
  81. msprobe/docs/img/visualization/tensorboard_1.png +0 -0
  82. msprobe/docs/img/visualization/tensorboard_2.png +0 -0
  83. msprobe/docs/img/visualization/vis_browser_1.png +0 -0
  84. msprobe/docs/img/visualization/vis_browser_2.png +0 -0
  85. msprobe/docs/img/visualization/vis_precision_info.png +0 -0
  86. msprobe/docs/img/visualization/vis_search_info.png +0 -0
  87. msprobe/docs/img/visualization/vis_show_info.png +0 -0
  88. msprobe/docs/img/visualization/vis_showcase.png +0 -0
  89. msprobe/docs/img/visualization/vis_unmatch_info.png +0 -0
  90. msprobe/docs/visualization/GPTModel.png +0 -0
  91. msprobe/docs/visualization/ParallelMLP.png +0 -0
  92. msprobe/docs/visualization/layer_mapping_example.md +132 -0
  93. msprobe/docs/visualization/mapping.png +0 -0
  94. msprobe/docs/visualization/mapping1.png +0 -0
  95. msprobe/docs/visualization/module_name.png +0 -0
  96. msprobe/docs/visualization/module_name1.png +0 -0
  97. msprobe/docs/visualization/no_mapping.png +0 -0
  98. msprobe/docs/visualization/no_mapping1.png +0 -0
  99. msprobe/docs/visualization/no_mapping_analyze.png +0 -0
  100. msprobe/docs/visualization/top_layer.png +0 -0
  101. msprobe/mindspore/__init__.py +25 -0
  102. msprobe/mindspore/api_accuracy_checker/api_accuracy_checker.py +151 -151
  103. msprobe/mindspore/api_accuracy_checker/api_info.py +21 -6
  104. msprobe/mindspore/api_accuracy_checker/api_runner.py +43 -18
  105. msprobe/mindspore/api_accuracy_checker/base_compare_algorithm.py +21 -7
  106. msprobe/mindspore/api_accuracy_checker/checker_support_api.yaml +77 -0
  107. msprobe/mindspore/api_accuracy_checker/cmd_parser.py +64 -1
  108. msprobe/mindspore/api_accuracy_checker/compute_element.py +64 -31
  109. msprobe/mindspore/api_accuracy_checker/data_manager.py +301 -0
  110. msprobe/mindspore/api_accuracy_checker/main.py +28 -3
  111. msprobe/mindspore/api_accuracy_checker/multi_api_accuracy_checker.py +212 -0
  112. msprobe/mindspore/api_accuracy_checker/multi_data_manager.py +60 -0
  113. msprobe/mindspore/api_accuracy_checker/type_mapping.py +22 -5
  114. msprobe/mindspore/api_accuracy_checker/utils.py +34 -17
  115. msprobe/mindspore/cell_processor.py +33 -12
  116. msprobe/mindspore/code_mapping/bind.py +264 -0
  117. msprobe/mindspore/code_mapping/cmd_parser.py +40 -0
  118. msprobe/mindspore/code_mapping/graph.py +49 -0
  119. msprobe/mindspore/code_mapping/graph_parser.py +226 -0
  120. msprobe/mindspore/code_mapping/main.py +24 -0
  121. msprobe/mindspore/code_mapping/processor.py +34 -0
  122. msprobe/mindspore/common/const.py +35 -13
  123. msprobe/mindspore/common/log.py +5 -9
  124. msprobe/mindspore/common/utils.py +88 -4
  125. msprobe/mindspore/compare/distributed_compare.py +22 -24
  126. msprobe/mindspore/compare/ms_compare.py +333 -268
  127. msprobe/mindspore/compare/ms_graph_compare.py +95 -52
  128. msprobe/mindspore/debugger/debugger_config.py +7 -1
  129. msprobe/mindspore/debugger/precision_debugger.py +87 -12
  130. msprobe/mindspore/dump/dump_tool_factory.py +3 -1
  131. msprobe/mindspore/dump/hook_cell/api_registry.py +95 -18
  132. msprobe/mindspore/dump/hook_cell/hook_cell.py +60 -38
  133. msprobe/mindspore/dump/hook_cell/primitive_hooks.py +45 -30
  134. msprobe/mindspore/dump/hook_cell/support_wrap_ops.yaml +36 -1
  135. msprobe/mindspore/dump/hook_cell/wrap_api.py +92 -1
  136. msprobe/mindspore/dump/jit_dump.py +17 -5
  137. msprobe/mindspore/dump/kernel_dump/kernel_config.py +33 -0
  138. msprobe/mindspore/dump/kernel_graph_dump.py +9 -4
  139. msprobe/mindspore/dump/kernel_kbyk_dump.py +2 -4
  140. msprobe/mindspore/dym_loader/hook_dynamic_loader.cc +140 -0
  141. msprobe/mindspore/dym_loader/hook_dynamic_loader.h +53 -0
  142. msprobe/mindspore/free_benchmark/api_pynative_self_check.py +156 -41
  143. msprobe/mindspore/free_benchmark/common/handler_params.py +1 -2
  144. msprobe/mindspore/free_benchmark/common/utils.py +19 -4
  145. msprobe/mindspore/free_benchmark/data/support_wrap_ops.yaml +0 -204
  146. msprobe/mindspore/free_benchmark/handler/base_handler.py +3 -3
  147. msprobe/mindspore/free_benchmark/handler/check_handler.py +4 -5
  148. msprobe/mindspore/free_benchmark/handler/fix_handler.py +4 -4
  149. msprobe/mindspore/free_benchmark/handler/handler_factory.py +4 -4
  150. msprobe/mindspore/free_benchmark/perturbation/add_noise.py +2 -2
  151. msprobe/mindspore/free_benchmark/perturbation/base_perturbation.py +15 -6
  152. msprobe/mindspore/free_benchmark/perturbation/bit_noise.py +2 -2
  153. msprobe/mindspore/free_benchmark/perturbation/exchange_value.py +2 -2
  154. msprobe/mindspore/free_benchmark/perturbation/improve_precision.py +13 -6
  155. msprobe/mindspore/free_benchmark/perturbation/perturbation_factory.py +2 -2
  156. msprobe/mindspore/free_benchmark/self_check_tool_factory.py +2 -2
  157. msprobe/mindspore/grad_probe/global_context.py +28 -8
  158. msprobe/mindspore/grad_probe/grad_analyzer.py +50 -24
  159. msprobe/mindspore/grad_probe/grad_monitor.py +16 -1
  160. msprobe/mindspore/grad_probe/grad_stat_csv.py +33 -5
  161. msprobe/mindspore/grad_probe/hook.py +35 -12
  162. msprobe/mindspore/grad_probe/utils.py +18 -5
  163. msprobe/mindspore/mindtorch/__init__.py +18 -0
  164. msprobe/mindspore/mindtorch/mindtorch_adaptor.py +255 -0
  165. msprobe/mindspore/ms_config.py +27 -16
  166. msprobe/mindspore/overflow_check/kernel_graph_overflow_check.py +9 -4
  167. msprobe/mindspore/runtime.py +15 -0
  168. msprobe/mindspore/service.py +285 -113
  169. msprobe/mindspore/task_handler_factory.py +15 -0
  170. msprobe/msprobe.py +48 -10
  171. msprobe/pytorch/__init__.py +8 -6
  172. msprobe/pytorch/api_accuracy_checker/common/config.py +62 -0
  173. msprobe/pytorch/api_accuracy_checker/common/utils.py +31 -16
  174. msprobe/pytorch/api_accuracy_checker/compare/algorithm.py +41 -8
  175. msprobe/pytorch/api_accuracy_checker/compare/api_precision_compare.py +103 -271
  176. msprobe/pytorch/api_accuracy_checker/compare/api_precision_standard.yaml +4 -1
  177. msprobe/pytorch/api_accuracy_checker/compare/compare.py +69 -68
  178. msprobe/pytorch/api_accuracy_checker/compare/compare_column.py +54 -0
  179. msprobe/pytorch/api_accuracy_checker/compare/compare_input.py +51 -0
  180. msprobe/pytorch/api_accuracy_checker/compare/compare_utils.py +2 -4
  181. msprobe/pytorch/api_accuracy_checker/generate_op_script/config_op.json +9 -0
  182. msprobe/pytorch/api_accuracy_checker/generate_op_script/op_generator.py +478 -0
  183. msprobe/pytorch/api_accuracy_checker/generate_op_script/operator_replication.template +365 -0
  184. msprobe/pytorch/api_accuracy_checker/precision_standard/absolute_threshold.py +106 -0
  185. msprobe/pytorch/api_accuracy_checker/precision_standard/accumulative_error_compare.py +107 -0
  186. msprobe/pytorch/api_accuracy_checker/precision_standard/base_standard.py +151 -0
  187. msprobe/pytorch/api_accuracy_checker/precision_standard/benchmark_compare.py +226 -0
  188. msprobe/pytorch/api_accuracy_checker/precision_standard/binary_consistency.py +68 -0
  189. msprobe/pytorch/api_accuracy_checker/precision_standard/standard_config.py +218 -0
  190. msprobe/pytorch/api_accuracy_checker/precision_standard/standard_register.py +104 -0
  191. msprobe/pytorch/api_accuracy_checker/precision_standard/thousandth_standard.py +63 -0
  192. msprobe/pytorch/api_accuracy_checker/precision_standard/ulp_compare.py +200 -0
  193. msprobe/pytorch/api_accuracy_checker/run_ut/data_generate.py +63 -2
  194. msprobe/pytorch/api_accuracy_checker/run_ut/multi_run_ut.py +21 -15
  195. msprobe/pytorch/api_accuracy_checker/run_ut/run_overflow_check.py +54 -22
  196. msprobe/pytorch/api_accuracy_checker/run_ut/run_ut.py +140 -71
  197. msprobe/pytorch/api_accuracy_checker/run_ut/run_ut_utils.py +49 -8
  198. msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/attl.py +9 -24
  199. msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/client.py +4 -12
  200. msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/device_dispatch.py +5 -3
  201. msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/dump_dispatch.py +9 -4
  202. msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/server.py +3 -11
  203. msprobe/pytorch/api_accuracy_checker/tensor_transport_layer/utils.py +2 -2
  204. msprobe/pytorch/bench_functions/confusion_transpose.py +5 -1
  205. msprobe/pytorch/bench_functions/matmul_backward.py +12 -0
  206. msprobe/pytorch/bench_functions/npu_fusion_attention.py +142 -16
  207. msprobe/pytorch/bench_functions/rotary_mul.py +4 -0
  208. msprobe/pytorch/bench_functions/swiglu.py +10 -2
  209. msprobe/pytorch/common/parse_json.py +7 -6
  210. msprobe/pytorch/common/utils.py +101 -7
  211. msprobe/pytorch/compare/distributed_compare.py +17 -30
  212. msprobe/pytorch/compare/pt_compare.py +44 -22
  213. msprobe/pytorch/debugger/debugger_config.py +46 -27
  214. msprobe/pytorch/debugger/precision_debugger.py +42 -12
  215. msprobe/pytorch/dump/kernel_dump/kernel_config.py +33 -0
  216. msprobe/pytorch/dump/module_dump/module_dump.py +86 -0
  217. msprobe/pytorch/{module_processer.py → dump/module_dump/module_processer.py} +81 -10
  218. msprobe/pytorch/free_benchmark/common/constant.py +15 -0
  219. msprobe/pytorch/free_benchmark/common/counter.py +15 -0
  220. msprobe/pytorch/free_benchmark/common/enums.py +15 -0
  221. msprobe/pytorch/free_benchmark/common/params.py +10 -2
  222. msprobe/pytorch/free_benchmark/common/utils.py +29 -4
  223. msprobe/pytorch/free_benchmark/compare/grad_saver.py +20 -5
  224. msprobe/pytorch/free_benchmark/compare/single_benchmark.py +2 -0
  225. msprobe/pytorch/free_benchmark/perturbed_layers/npu/add_noise.py +3 -1
  226. msprobe/pytorch/free_benchmark/perturbed_layers/npu/bit_noise.py +6 -4
  227. msprobe/pytorch/free_benchmark/perturbed_layers/npu/change_value.py +2 -0
  228. msprobe/pytorch/free_benchmark/perturbed_layers/npu/improve_precision.py +4 -0
  229. msprobe/pytorch/free_benchmark/result_handlers/base_handler.py +41 -47
  230. msprobe/pytorch/free_benchmark/result_handlers/fix_handler.py +6 -5
  231. msprobe/pytorch/free_benchmark/result_handlers/preheat_handler.py +0 -4
  232. msprobe/pytorch/grad_probe/grad_monitor.py +23 -6
  233. msprobe/pytorch/grad_probe/grad_stat_csv.py +40 -10
  234. msprobe/pytorch/hook_module/__init__.py +1 -1
  235. msprobe/pytorch/hook_module/hook_module.py +14 -11
  236. msprobe/pytorch/hook_module/register_optimizer_hook.py +59 -0
  237. msprobe/pytorch/hook_module/support_wrap_ops.yaml +35 -0
  238. msprobe/pytorch/hook_module/wrap_distributed.py +6 -8
  239. msprobe/pytorch/hook_module/wrap_functional.py +0 -38
  240. msprobe/pytorch/monitor/__init__.py +0 -0
  241. msprobe/pytorch/monitor/anomaly_analyse.py +201 -0
  242. msprobe/pytorch/monitor/anomaly_detect.py +425 -0
  243. msprobe/pytorch/monitor/csv2tb.py +166 -0
  244. msprobe/pytorch/monitor/distributed/__init__.py +0 -0
  245. msprobe/pytorch/monitor/distributed/distributed_ops.yaml +19 -0
  246. msprobe/pytorch/monitor/distributed/stack_blacklist.yaml +5 -0
  247. msprobe/pytorch/monitor/distributed/wrap_distributed.py +283 -0
  248. msprobe/pytorch/monitor/features.py +108 -0
  249. msprobe/pytorch/monitor/module_hook.py +1076 -0
  250. msprobe/pytorch/monitor/module_metric.py +172 -0
  251. msprobe/pytorch/monitor/module_spec_verifier.py +95 -0
  252. msprobe/pytorch/monitor/optimizer_collect.py +333 -0
  253. msprobe/pytorch/monitor/unittest/__init__.py +0 -0
  254. msprobe/pytorch/monitor/unittest/test_monitor.py +160 -0
  255. msprobe/pytorch/monitor/utils.py +321 -0
  256. msprobe/pytorch/monitor/visualizer.py +59 -0
  257. msprobe/pytorch/online_dispatch/__init__.py +2 -3
  258. msprobe/pytorch/online_dispatch/compare.py +29 -38
  259. msprobe/pytorch/online_dispatch/dispatch.py +58 -27
  260. msprobe/pytorch/online_dispatch/dump_compare.py +21 -9
  261. msprobe/pytorch/online_dispatch/single_compare.py +53 -32
  262. msprobe/pytorch/online_dispatch/torch_ops_config.yaml +1 -1
  263. msprobe/pytorch/online_dispatch/utils.py +49 -21
  264. msprobe/pytorch/parse_tool/lib/compare.py +21 -27
  265. msprobe/pytorch/parse_tool/lib/config.py +6 -8
  266. msprobe/pytorch/parse_tool/lib/file_desc.py +15 -1
  267. msprobe/pytorch/parse_tool/lib/interactive_cli.py +10 -10
  268. msprobe/pytorch/parse_tool/lib/parse_exception.py +7 -7
  269. msprobe/pytorch/parse_tool/lib/parse_tool.py +12 -12
  270. msprobe/pytorch/parse_tool/lib/utils.py +33 -53
  271. msprobe/pytorch/parse_tool/lib/visualization.py +11 -10
  272. msprobe/pytorch/pt_config.py +31 -8
  273. msprobe/pytorch/service.py +188 -108
  274. msprobe/visualization/__init__.py +14 -0
  275. msprobe/visualization/builder/__init__.py +14 -0
  276. msprobe/visualization/builder/graph_builder.py +222 -0
  277. msprobe/visualization/builder/msprobe_adapter.py +227 -0
  278. msprobe/visualization/compare/__init__.py +14 -0
  279. msprobe/visualization/compare/graph_comparator.py +180 -0
  280. msprobe/visualization/compare/mode_adapter.py +197 -0
  281. msprobe/visualization/graph/__init__.py +14 -0
  282. msprobe/visualization/graph/base_node.py +119 -0
  283. msprobe/visualization/graph/distributed_analyzer.py +318 -0
  284. msprobe/visualization/graph/graph.py +209 -0
  285. msprobe/visualization/graph/node_colors.py +95 -0
  286. msprobe/visualization/graph/node_op.py +39 -0
  287. msprobe/visualization/graph_service.py +288 -0
  288. msprobe/visualization/utils.py +217 -0
  289. mindstudio_probe-1.1.0.dist-info/RECORD +0 -287
  290. msprobe/docs/04.acl_config_examples.md +0 -78
  291. msprobe/mindspore/compare/layer_mapping.py +0 -146
  292. msprobe/mindspore/compare/modify_mapping.py +0 -107
  293. msprobe/mindspore/free_benchmark/decorator/dec_forward.py +0 -57
  294. msprobe/mindspore/free_benchmark/decorator/decorator_factory.py +0 -122
  295. msprobe/pytorch/functional/module_dump.py +0 -84
  296. {mindstudio_probe-1.1.0.dist-info → mindstudio_probe-1.2.1.dist-info}/LICENSE +0 -0
  297. {mindstudio_probe-1.1.0.dist-info → mindstudio_probe-1.2.1.dist-info}/top_level.txt +0 -0
  298. /msprobe/mindspore/{free_benchmark/decorator → code_mapping}/__init__.py +0 -0
  299. /msprobe/pytorch/{functional → dump/module_dump}/__init__.py +0 -0
@@ -0,0 +1,138 @@
1
+ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd.
2
+ # All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from typing import Dict, List, Optional, Any
17
+
18
+ from msprobe.core.common.const import Const
19
+
20
+ from msprobe.core.overflow_check.abnormal_scene import InputAnomalyOutputNormalScene, InputAnomalyOutputAnomalyScene, \
21
+ InputNormalOutputAnomalyScene, NumericalMutationScene, AnomalyScene
22
+ from msprobe.core.overflow_check.api_info import APIInfo
23
+ from msprobe.core.overflow_check.filter import IgnoreFilter
24
+ from msprobe.core.overflow_check.level import OverflowLevel
25
+
26
+
27
+ class StatisticsFields:
28
+ """统计字段常量类"""
29
+ CRITICAL_APIS = 'critical_apis'
30
+ HIGH_PRIORITY_APIS = 'high_priority_apis'
31
+ MEDIUM_PRIORITY_APIS = 'medium_priority_apis'
32
+ ANOMALY_DETAILS = 'anomaly_details'
33
+
34
+ # 所有字段
35
+ ALL_FIELDS = [CRITICAL_APIS, HIGH_PRIORITY_APIS, MEDIUM_PRIORITY_APIS, ANOMALY_DETAILS]
36
+
37
+
38
+ class AnomalyDetector:
39
+ """异常检测器"""
40
+
41
+ def __init__(self, dump_data: Dict):
42
+ """
43
+ 初始化检测器,并保存dump_data
44
+ Args:
45
+ dump_data: 数据格式如下
46
+ {
47
+ "api/module": {statistics}
48
+ }
49
+ """
50
+ self.dump_data = dump_data
51
+ self.ignore_filter = IgnoreFilter()
52
+ self.scene_types = [
53
+ InputNormalOutputAnomalyScene, # 输入正常,输出异常
54
+ InputAnomalyOutputAnomalyScene, # 输入异常,输出异常
55
+ InputAnomalyOutputNormalScene, # 输入异常,输出正常
56
+ NumericalMutationScene # 输出较输入值突变
57
+ ]
58
+ self.anomaly_scenes: Dict[str, AnomalyScene] = dict()
59
+
60
+ @staticmethod
61
+ def _create_api_info(api_name: str, data: Dict) -> APIInfo:
62
+ """从原始数据创建APIInfo实例"""
63
+ return APIInfo(
64
+ api_name=api_name,
65
+ input_args=data.get(Const.INPUT_ARGS, data.get(Const.INPUT, [])),
66
+ input_kwargs=data.get(Const.INPUT_KWARGS, {}),
67
+ output_data=data.get(Const.OUTPUT, [])
68
+ )
69
+
70
+ def get_statistics(self) -> Dict[str, List]:
71
+ """获取统计信息
72
+
73
+ 使用StatisticsFields类统一管理字段名称,避免硬编码
74
+
75
+ Returns:
76
+ Dict[str, List]: 包含各优先级API列表和异常详情的字典
77
+ """
78
+ stats = {field: [] for field in StatisticsFields.ALL_FIELDS}
79
+
80
+ # 定义rank到结果key的映射关系
81
+ rank_to_key = {
82
+ OverflowLevel.CRITICAL: StatisticsFields.CRITICAL_APIS,
83
+ OverflowLevel.HIGH: StatisticsFields.HIGH_PRIORITY_APIS,
84
+ OverflowLevel.MEDIUM: StatisticsFields.MEDIUM_PRIORITY_APIS
85
+ }
86
+
87
+ for scene in self.anomaly_scenes.values():
88
+ stats[StatisticsFields.ANOMALY_DETAILS].append(scene.get_details())
89
+ # 根据rank分类API
90
+ key = rank_to_key.get(scene.rank, None)
91
+ if not key:
92
+ stats[key].append(scene.api_name)
93
+
94
+ return stats
95
+
96
+ def analyze(self):
97
+ """
98
+ 按照异常场景对调用数据进行分析
99
+ Returns:
100
+ 返回类本身,若不进行过滤,则仅调用analyze即可
101
+ """
102
+ # 遍历data item
103
+ for api_name, data in self.dump_data.items():
104
+ api_info = self._create_api_info(api_name, data)
105
+
106
+ # 每种都进行检测,可能涉及多种命中,原则如下:
107
+ # - 就高原则
108
+ # - 优先原则,数据异常放最后检测
109
+ for scene_type in self.scene_types:
110
+ scene = scene_type(api_info)
111
+ if hasattr(scene, 'matches') and scene.matches():
112
+ self.anomaly_scenes[api_name] = scene
113
+ break # 直接跳过,就高原则
114
+ return self
115
+
116
+ def filter(self):
117
+ """
118
+ 对误检数据进行过滤
119
+ Returns:
120
+ 检查checker自身,方便链式调用
121
+ """
122
+ result = dict()
123
+ for api_name, scene in self.anomaly_scenes.items():
124
+ if self.ignore_filter.apply_filter(scene.api_data):
125
+ continue
126
+ result[api_name] = scene
127
+ self.anomaly_scenes = result
128
+ return self
129
+
130
+ def overflow_result(self) -> Dict[str, AnomalyScene]:
131
+ return self.anomaly_scenes
132
+
133
+ def has_overflow(self, api_name: str) -> bool:
134
+ return api_name in self.anomaly_scenes.keys()
135
+
136
+ def get_overflow_level(self, api_name: str) -> Optional[Any]:
137
+ scene = self.anomaly_scenes.get(api_name, None)
138
+ return scene.rank if scene else None
@@ -0,0 +1,157 @@
1
+ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd.
2
+ # All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ import os.path
16
+ from dataclasses import dataclass, field
17
+ from typing import Set
18
+
19
+ from msprobe.core.common.file_utils import load_yaml
20
+ from msprobe.core.overflow_check.api_info import APIInfo
21
+ from msprobe.core.overflow_check.utils import has_nan_inf
22
+
23
+ cur_path = os.path.dirname(os.path.realpath(__file__))
24
+
25
+
26
+ class IgnoreFilter:
27
+ def __init__(self, rule_path=os.path.join(cur_path, './ignore_rules.yaml')):
28
+ self.rules = dict()
29
+ self._load_rules(rule_path)
30
+
31
+ def has_api_rule(self, api_name: str) -> bool:
32
+ return api_name in self.rules.keys()
33
+
34
+ def apply_filter(self, api_info: APIInfo) -> bool:
35
+ """
36
+ 应用过滤规则,返回是否需要被过滤
37
+ Args:
38
+ api_info: API调用信息
39
+ Returns:
40
+ 是否为误检,是否需要过滤
41
+ """
42
+ torch_api = api_info.torch_api_name
43
+ if not self.has_api_rule(torch_api):
44
+ return False
45
+ rule = self.rules.get(torch_api)
46
+ if not rule.match(api_info):
47
+ return False
48
+ return True
49
+
50
+ def _load_rules(self, rule_file_path):
51
+ if self.rules and len(self.rules):
52
+ return
53
+ data = load_yaml(rule_file_path)
54
+ self.rules = dict()
55
+ for rule_item in data.get('ignore_nan_inf', []):
56
+ rule = Rule(
57
+ api_name=rule_item.get('api_name', ''),
58
+ desc=rule_item.get('description', ''),
59
+ input_ignore=rule_item.get('input_ignore', []),
60
+ output_ignore=rule_item.get('output_ignore', [])
61
+ )
62
+ if not rule.verify_field():
63
+ continue
64
+ if self.has_api_rule(rule.api_name):
65
+ continue
66
+ self.rules[rule.api_name] = rule
67
+
68
+
69
+ class Rule:
70
+
71
+ def __init__(self, api_name, desc='', input_ignore=None, output_ignore=None):
72
+ self.api_name = api_name
73
+ self.desc = desc
74
+ self.input_ignore = IgnoreItem()
75
+ self.output_ignore = IgnoreItem()
76
+ self._init_ignore(input_ignore, output_ignore)
77
+
78
+ def __repr__(self):
79
+ return (f'Rule(api_name={self.api_name}, desc={self.desc}, input_ignore={self.input_ignore}, output_ignore='
80
+ f'{self.output_ignore})')
81
+
82
+ def verify_field(self):
83
+ if self.api_name == '':
84
+ return False
85
+ # 若无输入输出规则长度,则为无效规则
86
+ if not (len(self.input_ignore.index) + len(self.input_ignore.name) + len(self.output_ignore.index)):
87
+ return False
88
+ return True
89
+
90
+ def match(self, api_info: APIInfo) -> bool:
91
+ """
92
+ 匹配API信息是否符合规则
93
+ Returns:
94
+ bool: True if the api_info matches this rule, False otherwise
95
+ """
96
+ # 首先检查API名称是否匹配
97
+ api_name = api_info.torch_api_name
98
+ if api_name != self.api_name:
99
+ return False
100
+
101
+ # 检查输入参数中的NaN/Inf
102
+ if self.input_ignore.index and len(api_info.input_args):
103
+ for idx, arg in enumerate(api_info.input_args):
104
+ if has_nan_inf(arg) and not self.input_ignore.has_index(idx):
105
+ return False
106
+
107
+ # 检查输入kwargs中的NaN/Inf
108
+ if self.input_ignore.name and len(api_info.input_kwargs):
109
+ for name, value in api_info.input_kwargs.items():
110
+ if has_nan_inf(value) and not self.input_ignore.has_name(name):
111
+ return False
112
+
113
+ # 检查输出中的NaN/Inf
114
+ if self.output_ignore.index and len(api_info.output_data):
115
+ for idx, out in enumerate(api_info.output_data):
116
+ if has_nan_inf(out) and not self.output_ignore.has_index(idx):
117
+ return False
118
+
119
+ return True
120
+
121
+ def _init_ignore(self, input_ignore=None, output_ignore=None):
122
+ """初始化忽略项"""
123
+ if input_ignore is None:
124
+ input_ignore = []
125
+ if output_ignore is None:
126
+ output_ignore = []
127
+
128
+ # 处理输入忽略规则
129
+ for item in input_ignore:
130
+ if 'index' in item:
131
+ self.input_ignore.add_index(item['index'])
132
+ if 'name' in item:
133
+ self.input_ignore.add_name(item['name'])
134
+
135
+ # 处理输出忽略规则
136
+ for item in output_ignore:
137
+ if 'index' in item:
138
+ self.output_ignore.add_index(item['index'])
139
+
140
+
141
+ @dataclass
142
+ class IgnoreItem:
143
+ """存储需要忽略的索引和名称"""
144
+ index: Set[int] = field(default_factory=set)
145
+ name: Set[str] = field(default_factory=set)
146
+
147
+ def add_index(self, idx: int):
148
+ self.index.add(idx)
149
+
150
+ def add_name(self, name: str):
151
+ self.name.add(name)
152
+
153
+ def has_index(self, idx: int) -> bool:
154
+ return idx in self.index
155
+
156
+ def has_name(self, name: str) -> bool:
157
+ return name in self.name
@@ -0,0 +1,55 @@
1
+ ignore_nan_inf:
2
+ # Create an uninitialized memory
3
+ - api_name: "torch.empty"
4
+ description: "Creates a tensor with uninitialized data. The values may contain NaN or Inf because the memory is not cleared or set to zero."
5
+ output_ignore:
6
+ - index: 0
7
+
8
+ - api_name: "torch.empty_like"
9
+ description: "Creates an uninitialized tensor with the same size, dtype, and device as the input tensor. The values may contain NaN or Inf due to uninitialized memory."
10
+ output_ignore:
11
+ - index: 0
12
+
13
+ - api_name: "torch.empty_strided"
14
+ description: "Creates a tensor with uninitialized data using specified strides. NaN or Inf may be present due to uninitialized memory."
15
+ output_ignore:
16
+ - index: 0
17
+
18
+ # Distributed func
19
+ - api_name: "distributed.recv"
20
+ description: "Receives a tensor from another process. The input tensor may contain uninitialized data before the recv call, but it will be overwritten with received data."
21
+ input_ignore:
22
+ - index: 0 # tensor (the input buffer, which may be uninitialized before receiving)
23
+ - name: tensor
24
+
25
+ - api_name: "distributed.all_gather"
26
+ description: "Gathers tensors from all processes and distributes them to each process. The tensors in tensor_list may contain uninitialized data before the all_gather call, but they will be overwritten with collected data from all processes."
27
+ input_ignore:
28
+ - index: 0 # tensor_list (the input list of tensors, which may contain uninitialized data before the all_gather call)
29
+
30
+ - api_name: "distributed.reduce_scatter"
31
+ description: "Combines reduction and scatter operations. The output tensor may contain uninitialized data before the reduce_scatter call, but it will be overwritten with the reduced and scattered data from all processes."
32
+ input_ignore:
33
+ - index: 0
34
+ - name: output
35
+
36
+ - api_name: "distributed._reduce_scatter_base"
37
+ description: "Performs a combined reduction and scatter operation using a single input tensor. The output tensor may contain uninitialized data before the _reduce_scatter_base call, but it will be overwritten with the reduced and scattered data."
38
+ input_ignore:
39
+ - index: 0
40
+
41
+ - api_name: "distributed.all_gather_into_tensor"
42
+ description: "Gathers tensors from all processes into a single output tensor. The output tensor may contain uninitialized data before the all_gather_into_tensor call, but it will be overwritten with collected data from all processes."
43
+ input_ignore:
44
+ - index: 0
45
+
46
+ - api_name: "distributed.reduce_scatter_tensor"
47
+ description: "Performs a reduction operation across all processes and scatters the result into the output tensor. The output tensor may contain uninitialized data before the reduce_scatter_tensor call, but it will be overwritten with the reduced and scattered data."
48
+ input_ignore:
49
+ - index: 0
50
+
51
+ # Tensor inplace func
52
+ - api_name: "tensor.masked_fill_"
53
+ description: "Inplace fill tensor with given value by filtered mask"
54
+ input_ignore:
55
+ - index: 0
@@ -0,0 +1,22 @@
1
+ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd.
2
+ # All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from enum import Enum
17
+
18
+
19
+ class OverflowLevel(Enum):
20
+ MEDIUM = "medium"
21
+ HIGH = "high"
22
+ CRITICAL = "critical"
@@ -0,0 +1,28 @@
1
+ # Copyright (c) 2024-2024, Huawei Technologies Co., Ltd.
2
+ # All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+
16
+ from typing import Any
17
+
18
+ CHECK_FIELDS = ['Max', 'Min', 'Mean']
19
+ OVERFLOW_VALUES = ['inf', '-inf', 'nan']
20
+
21
+
22
+ def has_nan_inf(value: Any) -> bool:
23
+ """检查值是否包含NaN或Inf"""
24
+ if isinstance(value, dict):
25
+ for k, v in value.items():
26
+ if k in CHECK_FIELDS and str(v).lower() in OVERFLOW_VALUES:
27
+ return True
28
+ return False
@@ -16,6 +16,9 @@ pip install mindstudio-probe
16
16
 
17
17
  |版本|发布日期|支持 PyTorch 版本|支持 MindSpore 版本|下载链接|校验码|
18
18
  |:--:|:--:|:--:|:--:|:--:|:--:|
19
+ |1.2.0|2025.1.13|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.2.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.2/mindstudio_probe-1.2.0-py3-none-any.whl)|1e3aeea1706112f6ee52fd1165037936bb209138f0b9ec42ea21e2c1c8942cdc|
20
+ |1.1.1|2024.12.09|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.1.1-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.1/mindstudio_probe-1.1.1-py3-none-any.whl)|577b597555dc155b76ba1a62d575c3546004644e140a456c3ba0824d46283735|
21
+ |1.1.0|2024.10.14|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.1.0-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.1/mindstudio_probe-1.1.0-py3-none-any.whl)|83a5a9b7c65a357639f8c9636d88c693b4cf0eb590d4f8f5cb56395ba69b1f6d|
19
22
  |1.0.4|2024.09.09|1.11/2.0/2.1/2.2|2.4.0|[mindstudio_probe-1.0.4-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.0/mindstudio_probe-1.0.4-py3-none-any.whl)|4e1909566a71a855b356597750c20ee43d964a22b2c2b02ac08312a5def75fd6|
20
23
  | 1.0.3 | 2024.08.23 | 1.11/2.0/2.1/2.2 | 2.4.0 | [mindstudio_probe-1.0.3-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.0/mindstudio_probe-1.0.3-py3-none-any.whl) | 7060cc141a5b98ef770cd9220995d299393f32a61938261e632c7e8b5160bef2 |
21
24
  | 1.0.2 | 2024.08.09 | 1.11/2.0/2.1/2.2 | 2.4.0 | [mindstudio_probe-1.0.2-py3-none-any.whl](https://ptdbg.obs.myhuaweicloud.com/msprobe/1.0/mindstudio_probe-1.0.2-py3-none-any.whl) | e4a980e5d98c426ce5ce9842520d9bc031d3b3de621c74b3d59414cc6e238e0e |
@@ -40,18 +43,104 @@ cd mstt/debug/accuracy_tools
40
43
 
41
44
  pip install setuptools wheel
42
45
 
43
- python setup.py bdist_wheel
46
+ python setup.py bdist_wheel [--include-mod=[adump]]
44
47
  cd ./dist
45
48
  pip install ./mindstudio_probe*.whl
46
49
  ```
47
50
 
48
- # 历史版本特性
51
+ |参数|说明|是否必选|
52
+ |--|--|:--:|
53
+ |--include-mod|指定可选模块,可取值`adump`,表示在编whl包时加入adump模块。默认未配置该参数,表示编基础包。<br>&#8226; adump模块用于MindSpore静态图场景L2级别的dump。<br>&#8226; 仅MindSpore 2.5.0及以上版本支持adump模块。<br>&#8226; 若使用源码安装,编译环境需支持GCC 7或以上版本,和CMAKE 3.14或以上版本。<br>&#8226; 生成的whl包仅限编译时使用的python版本和处理器架构可用。|否|
49
54
 
50
- <table>
51
- <tr><th>版本</th><th>特性</th></tr>
52
- <tr><td rowspan="2">1.0.3</td><td>【精度预检】</br>1. 落盘数据小;</br>2. 支持随机生成模式和真实数据模式;</br>3. 单 API 测试,排除整网中的累计误差问题。</td></tr>
53
- <tr><td>【梯度检测】</br>1. 使用便捷,无需在训练流程里插入代码。</br>2. 可以精准定位问题出现的 step。</td></tr>
54
- </table>
55
+ # 特性变更说明
56
+
57
+ ## 1.1.1
58
+
59
+ 【数据采集】
60
+
61
+ - dump 支持 processgroup、namedtuple、slice 等数据类型
62
+ - MindSpore 动态图 dump 能力增强,支持 mix 模式 dump、控制 dropout 失效、支持控制区间正反向数据 dump
63
+
64
+ 【精度预检】
65
+
66
+ - PyTorch 场景新增单算子 API 自动生成脚本
67
+ - MindSpore 动态图场景新增支持 multi_run_ut 多线程预检
68
+ - MindSpore 场景新增支持断点续检
69
+
70
+ 【精度比对】
71
+
72
+ - 新增 MindSpore 跨框架比对能力,支持 MindSpore 与 PyTorch 跨框架比对
73
+ - 支持异常比对结果数据自动颜色标注
74
+
75
+ 【无标杆比对】
76
+
77
+ - Mindspore 动态图场景支持反向过程的无标杆比对
78
+
79
+ 【训练状态监控】
80
+
81
+ - 新增支持通信聚合前梯度信息监控
82
+
83
+ 【分级可视化构图比对】
84
+
85
+ - 新增分级可视化构图比对工具,支持单数据构图、溢出检测、双数据比对构图、同时支持传入映射文件,支持跨框架或同框架比对
86
+
87
+ ## 1.1.0
88
+
89
+ 【总体】
90
+
91
+ - 训练精度一体化工具 atat 统一更名为 msprobe
92
+ - msprobe 支持日志分级功能
93
+
94
+ 【数据采集】
95
+
96
+ - 增加 L1 dump 接口,支持在指定区间内进行正反向 dump 功能
97
+ - 新增 MindSpore 函数式接口的通信 API dump 功能
98
+
99
+ 【精度预检】
100
+
101
+ - 支持配置 blacklist 黑名单字段
102
+ - 补充了支持的融合算子列表
103
+
104
+ 【精度比对】
105
+
106
+ - 支持 data mapping 和 layer mapping 的比对功能。
107
+
108
+ 【梯度工具】
109
+
110
+ - 增加了梯度工具中关于 JIT 限制的说明
111
+
112
+ ## 1.0.4
113
+
114
+ 【数据采集】
115
+
116
+ - 支持在 config.json 中传入 step 范围配置
117
+ - 优化了 MindSpore 场景下的 step 机制,step 结束后训练继续运行
118
+
119
+ 【精度预检】
120
+
121
+ - 在 PyTorch 场景下,支持部分 NPU 融合算子精度预检
122
+
123
+ 【精度比对】
124
+
125
+ - 解决了在 MindSpore 场景下需要安装 PyTorch 的问题
126
+
127
+ 【无标杆比对】
128
+
129
+ - 补充了 PyTorch 场景的性能基线报告
130
+ - 支持 MindSpore 场景下的 change_value 扰动模式
131
+
132
+ ## 1.0.3
133
+
134
+ 【精度预检】
135
+
136
+ - 落盘数据缩减
137
+ - 支持随机生成模式和真实数据模式
138
+ - 单 API 测试,排除整网中的累计误差问题
139
+
140
+ 【梯度检测】
141
+
142
+ - 使用便捷,无需在训练流程里插入代码
143
+ - 可以精准定位问题出现的 step
55
144
 
56
145
  # 查看 msprobe 工具信息
57
146