siclaw 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (267) hide show
  1. package/README.md +74 -114
  2. package/dist/agentbox/gateway-client.d.ts +2 -1
  3. package/dist/agentbox/gateway-client.js +6 -2
  4. package/dist/agentbox/gateway-client.js.map +1 -1
  5. package/dist/agentbox/http-server.js +184 -19
  6. package/dist/agentbox/http-server.js.map +1 -1
  7. package/dist/agentbox/resource-handlers.d.ts +1 -0
  8. package/dist/agentbox/resource-handlers.js +23 -23
  9. package/dist/agentbox/resource-handlers.js.map +1 -1
  10. package/dist/agentbox/session.js +85 -5
  11. package/dist/agentbox/session.js.map +1 -1
  12. package/dist/agentbox-main.d.ts +2 -1
  13. package/dist/agentbox-main.js +65 -18
  14. package/dist/agentbox-main.js.map +1 -1
  15. package/dist/cli-credentials.d.ts +1 -0
  16. package/dist/cli-credentials.js +109 -0
  17. package/dist/cli-credentials.js.map +1 -0
  18. package/dist/cli-first-run.d.ts +11 -0
  19. package/dist/cli-first-run.js +99 -0
  20. package/dist/cli-first-run.js.map +1 -0
  21. package/dist/cli-main.js +33 -11
  22. package/dist/cli-main.js.map +1 -1
  23. package/dist/cli-setup.d.ts +5 -11
  24. package/dist/cli-setup.js +12 -225
  25. package/dist/cli-setup.js.map +1 -1
  26. package/dist/core/agent-factory.d.ts +4 -0
  27. package/dist/core/agent-factory.js +102 -151
  28. package/dist/core/agent-factory.js.map +1 -1
  29. package/dist/core/config.d.ts +10 -3
  30. package/dist/core/config.js +11 -95
  31. package/dist/core/config.js.map +1 -1
  32. package/dist/core/extensions/deep-investigation.d.ts +2 -1
  33. package/dist/core/extensions/deep-investigation.js +144 -24
  34. package/dist/core/extensions/deep-investigation.js.map +1 -1
  35. package/dist/core/extensions/setup.d.ts +8 -0
  36. package/dist/core/extensions/setup.js +669 -0
  37. package/dist/core/extensions/setup.js.map +1 -0
  38. package/dist/core/llm-proxy.js +7 -3
  39. package/dist/core/llm-proxy.js.map +1 -1
  40. package/dist/core/mcp-client.d.ts +0 -10
  41. package/dist/core/mcp-client.js +0 -65
  42. package/dist/core/mcp-client.js.map +1 -1
  43. package/dist/core/prompt.d.ts +1 -1
  44. package/dist/core/prompt.js +42 -5
  45. package/dist/core/prompt.js.map +1 -1
  46. package/dist/core/provider-presets.d.ts +14 -0
  47. package/dist/core/provider-presets.js +81 -0
  48. package/dist/core/provider-presets.js.map +1 -0
  49. package/dist/cron/cron-coordinator.d.ts +2 -0
  50. package/dist/cron/cron-coordinator.js +46 -14
  51. package/dist/cron/cron-coordinator.js.map +1 -1
  52. package/dist/cron/cron-executor.js +33 -8
  53. package/dist/cron/cron-executor.js.map +1 -1
  54. package/dist/cron/cron-scheduler.d.ts +1 -1
  55. package/dist/cron/gateway-client.d.ts +5 -0
  56. package/dist/cron/gateway-client.js +43 -8
  57. package/dist/cron/gateway-client.js.map +1 -1
  58. package/dist/cron-main.js +39 -9
  59. package/dist/cron-main.js.map +1 -1
  60. package/dist/gateway/agentbox/client.d.ts +11 -0
  61. package/dist/gateway/agentbox/client.js +18 -0
  62. package/dist/gateway/agentbox/client.js.map +1 -1
  63. package/dist/gateway/agentbox/k8s-spawner.d.ts +11 -2
  64. package/dist/gateway/agentbox/k8s-spawner.js +95 -52
  65. package/dist/gateway/agentbox/k8s-spawner.js.map +1 -1
  66. package/dist/gateway/agentbox/local-spawner.d.ts +1 -1
  67. package/dist/gateway/agentbox/local-spawner.js +4 -2
  68. package/dist/gateway/agentbox/local-spawner.js.map +1 -1
  69. package/dist/gateway/agentbox/manager.d.ts +0 -10
  70. package/dist/gateway/agentbox/manager.js +11 -30
  71. package/dist/gateway/agentbox/manager.js.map +1 -1
  72. package/dist/gateway/agentbox/types.d.ts +6 -4
  73. package/dist/gateway/cron/cron-service.d.ts +49 -0
  74. package/dist/gateway/cron/cron-service.js +259 -0
  75. package/dist/gateway/cron/cron-service.js.map +1 -0
  76. package/dist/gateway/db/init-schema.js +44 -0
  77. package/dist/gateway/db/init-schema.js.map +1 -1
  78. package/dist/gateway/db/migrate-sqlite.js +73 -4
  79. package/dist/gateway/db/migrate-sqlite.js.map +1 -1
  80. package/dist/gateway/db/repositories/chat-repo.d.ts +56 -2
  81. package/dist/gateway/db/repositories/chat-repo.js +132 -2
  82. package/dist/gateway/db/repositories/chat-repo.js.map +1 -1
  83. package/dist/gateway/db/repositories/config-repo.d.ts +31 -2
  84. package/dist/gateway/db/repositories/config-repo.js +57 -7
  85. package/dist/gateway/db/repositories/config-repo.js.map +1 -1
  86. package/dist/gateway/db/repositories/env-repo.d.ts +14 -0
  87. package/dist/gateway/db/repositories/env-repo.js +15 -2
  88. package/dist/gateway/db/repositories/env-repo.js.map +1 -1
  89. package/dist/gateway/db/repositories/model-config-repo.js +6 -5
  90. package/dist/gateway/db/repositories/model-config-repo.js.map +1 -1
  91. package/dist/gateway/db/repositories/skill-repo.d.ts +0 -5
  92. package/dist/gateway/db/repositories/skill-review-repo.d.ts +1 -0
  93. package/dist/gateway/db/repositories/skill-review-repo.js +4 -1
  94. package/dist/gateway/db/repositories/skill-review-repo.js.map +1 -1
  95. package/dist/gateway/db/repositories/skill-version-repo.js +0 -1
  96. package/dist/gateway/db/repositories/skill-version-repo.js.map +1 -1
  97. package/dist/gateway/db/repositories/system-config-repo.d.ts +1 -1
  98. package/dist/gateway/db/repositories/system-config-repo.js +2 -1
  99. package/dist/gateway/db/repositories/system-config-repo.js.map +1 -1
  100. package/dist/gateway/db/repositories/user-env-config-repo.d.ts +13 -0
  101. package/dist/gateway/db/repositories/user-env-config-repo.js +11 -0
  102. package/dist/gateway/db/repositories/user-env-config-repo.js.map +1 -1
  103. package/dist/gateway/db/repositories/workspace-repo.d.ts +3 -2
  104. package/dist/gateway/db/repositories/workspace-repo.js +6 -2
  105. package/dist/gateway/db/repositories/workspace-repo.js.map +1 -1
  106. package/dist/gateway/db/schema-mysql.d.ts +473 -51
  107. package/dist/gateway/db/schema-mysql.js +35 -4
  108. package/dist/gateway/db/schema-mysql.js.map +1 -1
  109. package/dist/gateway/db/schema-sqlite.d.ts +522 -57
  110. package/dist/gateway/db/schema-sqlite.js +38 -6
  111. package/dist/gateway/db/schema-sqlite.js.map +1 -1
  112. package/dist/gateway/db/schema.d.ts +471 -51
  113. package/dist/gateway/db/schema.js +1 -1
  114. package/dist/gateway/db/schema.js.map +1 -1
  115. package/dist/gateway/metrics-aggregator.d.ts +65 -0
  116. package/dist/gateway/metrics-aggregator.js +244 -0
  117. package/dist/gateway/metrics-aggregator.js.map +1 -0
  118. package/dist/gateway/plugins/channel-bridge.d.ts +4 -1
  119. package/dist/gateway/plugins/channel-bridge.js +78 -86
  120. package/dist/gateway/plugins/channel-bridge.js.map +1 -1
  121. package/dist/gateway/rpc-methods.d.ts +4 -2
  122. package/dist/gateway/rpc-methods.js +852 -166
  123. package/dist/gateway/rpc-methods.js.map +1 -1
  124. package/dist/gateway/security/cert-manager.d.ts +2 -2
  125. package/dist/gateway/security/cert-manager.js +4 -2
  126. package/dist/gateway/security/cert-manager.js.map +1 -1
  127. package/dist/gateway/server.d.ts +4 -8
  128. package/dist/gateway/server.js +297 -261
  129. package/dist/gateway/server.js.map +1 -1
  130. package/dist/gateway/skills/file-writer.js +17 -11
  131. package/dist/gateway/skills/file-writer.js.map +1 -1
  132. package/dist/gateway/skills/script-evaluator.js +12 -9
  133. package/dist/gateway/skills/script-evaluator.js.map +1 -1
  134. package/dist/gateway/web/dist/assets/index-0p17ZeTP.js +740 -0
  135. package/dist/gateway/web/dist/assets/index-9eP6nPUq.js +741 -0
  136. package/dist/gateway/web/dist/assets/index-9eP6nPUq.js.map +1 -0
  137. package/dist/gateway/web/dist/assets/index-DyowBCEj.css +1 -0
  138. package/dist/gateway/web/dist/assets/index-PDK5JJDO.css +1 -0
  139. package/dist/gateway/web/dist/index.html +2 -2
  140. package/dist/gateway-main.js +27 -10
  141. package/dist/gateway-main.js.map +1 -1
  142. package/dist/memory/embeddings.js +5 -4
  143. package/dist/memory/embeddings.js.map +1 -1
  144. package/dist/memory/indexer.d.ts +23 -3
  145. package/dist/memory/indexer.js +235 -23
  146. package/dist/memory/indexer.js.map +1 -1
  147. package/dist/memory/schema.js +15 -1
  148. package/dist/memory/schema.js.map +1 -1
  149. package/dist/memory/types.d.ts +18 -0
  150. package/dist/memory/types.js +6 -1
  151. package/dist/memory/types.js.map +1 -1
  152. package/dist/shared/detect-language.d.ts +12 -0
  153. package/dist/shared/detect-language.js +78 -0
  154. package/dist/shared/detect-language.js.map +1 -0
  155. package/dist/shared/diagnostic-events.d.ts +70 -0
  156. package/dist/shared/diagnostic-events.js +38 -0
  157. package/dist/shared/diagnostic-events.js.map +1 -0
  158. package/dist/shared/local-collector.d.ts +56 -0
  159. package/dist/shared/local-collector.js +284 -0
  160. package/dist/shared/local-collector.js.map +1 -0
  161. package/dist/shared/metrics-types.d.ts +64 -0
  162. package/dist/shared/metrics-types.js +25 -0
  163. package/dist/shared/metrics-types.js.map +1 -0
  164. package/dist/shared/metrics.d.ts +19 -0
  165. package/dist/shared/metrics.js +185 -0
  166. package/dist/shared/metrics.js.map +1 -0
  167. package/dist/shared/path-utils.d.ts +15 -0
  168. package/dist/shared/path-utils.js +23 -0
  169. package/dist/shared/path-utils.js.map +1 -0
  170. package/dist/shared/retry.d.ts +35 -0
  171. package/dist/shared/retry.js +61 -0
  172. package/dist/shared/retry.js.map +1 -0
  173. package/dist/tools/command-sets.d.ts +18 -2
  174. package/dist/tools/command-sets.js +207 -32
  175. package/dist/tools/command-sets.js.map +1 -1
  176. package/dist/tools/command-validator.d.ts +56 -0
  177. package/dist/tools/command-validator.js +357 -0
  178. package/dist/tools/command-validator.js.map +1 -0
  179. package/dist/tools/create-skill.js +26 -1
  180. package/dist/tools/create-skill.js.map +1 -1
  181. package/dist/tools/credential-list.js +1 -23
  182. package/dist/tools/credential-list.js.map +1 -1
  183. package/dist/tools/credential-manager.d.ts +98 -0
  184. package/dist/tools/credential-manager.js +313 -0
  185. package/dist/tools/credential-manager.js.map +1 -0
  186. package/dist/tools/deep-search/engine.js +184 -127
  187. package/dist/tools/deep-search/engine.js.map +1 -1
  188. package/dist/tools/deep-search/prompts.d.ts +10 -2
  189. package/dist/tools/deep-search/prompts.js +37 -36
  190. package/dist/tools/deep-search/prompts.js.map +1 -1
  191. package/dist/tools/deep-search/schemas.d.ts +87 -0
  192. package/dist/tools/deep-search/schemas.js +85 -0
  193. package/dist/tools/deep-search/schemas.js.map +1 -0
  194. package/dist/tools/deep-search/sub-agent.d.ts +21 -0
  195. package/dist/tools/deep-search/sub-agent.js +153 -4
  196. package/dist/tools/deep-search/sub-agent.js.map +1 -1
  197. package/dist/tools/deep-search/tool.js +1 -0
  198. package/dist/tools/deep-search/tool.js.map +1 -1
  199. package/dist/tools/deep-search/types.d.ts +2 -0
  200. package/dist/tools/deep-search/types.js.map +1 -1
  201. package/dist/tools/dp-tools.js +29 -5
  202. package/dist/tools/dp-tools.js.map +1 -1
  203. package/dist/tools/exec-utils.d.ts +85 -0
  204. package/dist/tools/exec-utils.js +294 -0
  205. package/dist/tools/exec-utils.js.map +1 -0
  206. package/dist/tools/fork-skill.js +14 -2
  207. package/dist/tools/fork-skill.js.map +1 -1
  208. package/dist/tools/investigation-feedback.d.ts +3 -0
  209. package/dist/tools/investigation-feedback.js +71 -0
  210. package/dist/tools/investigation-feedback.js.map +1 -0
  211. package/dist/tools/manage-schedule.js +16 -6
  212. package/dist/tools/manage-schedule.js.map +1 -1
  213. package/dist/tools/netns-script.js +27 -281
  214. package/dist/tools/netns-script.js.map +1 -1
  215. package/dist/tools/node-exec.d.ts +2 -14
  216. package/dist/tools/node-exec.js +18 -225
  217. package/dist/tools/node-exec.js.map +1 -1
  218. package/dist/tools/node-script.js +14 -168
  219. package/dist/tools/node-script.js.map +1 -1
  220. package/dist/tools/pod-exec.d.ts +1 -1
  221. package/dist/tools/pod-exec.js +10 -26
  222. package/dist/tools/pod-exec.js.map +1 -1
  223. package/dist/tools/pod-nsenter-exec.js +21 -225
  224. package/dist/tools/pod-nsenter-exec.js.map +1 -1
  225. package/dist/tools/pod-script.js +10 -19
  226. package/dist/tools/pod-script.js.map +1 -1
  227. package/dist/tools/restricted-bash.d.ts +1 -17
  228. package/dist/tools/restricted-bash.js +38 -252
  229. package/dist/tools/restricted-bash.js.map +1 -1
  230. package/dist/tools/run-skill.d.ts +3 -1
  231. package/dist/tools/run-skill.js +21 -1
  232. package/dist/tools/run-skill.js.map +1 -1
  233. package/dist/tools/script-resolver.d.ts +3 -1
  234. package/dist/tools/script-resolver.js +74 -30
  235. package/dist/tools/script-resolver.js.map +1 -1
  236. package/dist/tools/update-skill.js +17 -6
  237. package/dist/tools/update-skill.js.map +1 -1
  238. package/package.json +4 -2
  239. package/siclaw.mjs +10 -1
  240. package/skills/core/cluster-events/SKILL.md +1 -1
  241. package/skills/core/deep-investigation/SKILL.md +11 -0
  242. package/skills/core/deployment-rollout-debug/SKILL.md +1 -1
  243. package/skills/core/dns-debug/SKILL.md +1 -0
  244. package/skills/core/meta.json +12 -1
  245. package/skills/core/networkpolicy-debug/SKILL.md +332 -0
  246. package/skills/core/node-logs/scripts/get-node-logs.sh +19 -9
  247. package/skills/core/pod-pending-debug/SKILL.md +1 -0
  248. package/skills/core/quota-debug/SKILL.md +203 -0
  249. package/skills/core/service-debug/SKILL.md +1 -0
  250. package/skills/core/statefulset-debug/SKILL.md +280 -0
  251. package/skills/core/volcano-diagnose-pod/SKILL.md +196 -0
  252. package/skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh +175 -0
  253. package/skills/core/volcano-gang-scheduling/SKILL.md +299 -0
  254. package/skills/core/volcano-job-diagnose/SKILL.md +319 -0
  255. package/skills/core/volcano-job-diagnose/scripts/diagnose-job.sh +253 -0
  256. package/skills/core/volcano-node-resources/SKILL.md +334 -0
  257. package/skills/core/volcano-node-resources/scripts/get-node-resources.sh +281 -0
  258. package/skills/core/volcano-queue-diagnose/SKILL.md +294 -0
  259. package/skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh +283 -0
  260. package/skills/core/volcano-resource-insufficient/SKILL.md +315 -0
  261. package/skills/core/volcano-scheduler-config/SKILL.md +371 -0
  262. package/skills/core/volcano-scheduler-config/scripts/get-scheduler-config.sh +297 -0
  263. package/skills/core/volcano-scheduler-logs/SKILL.md +241 -0
  264. package/skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh +159 -0
  265. package/skills/platform/create-skill/SKILL.md +35 -3
  266. package/skills/platform/manage-skill/SKILL.md +9 -2
  267. package/skills/platform/update-skill/SKILL.md +17 -6
@@ -0,0 +1,299 @@
1
+ ---
2
+ name: volcano-gang-scheduling
3
+ description: >-
4
+ Gang Scheduling diagnostic guide for Volcano.
5
+ Use when PodGroup cannot schedule completely, member Pods remain Pending,
6
+ or minAvailable/minMember constraints are not satisfied.
7
+ ---
8
+
9
+ # Gang Scheduling Diagnosis
10
+
11
+ This is a diagnostic guide for Gang scheduling issues in Volcano. Gang scheduling requires that all members of a PodGroup be scheduled simultaneously. If the cluster cannot satisfy the `minMember` requirement, none of the pods will be scheduled.
12
+
13
+ **Scope:** This skill is for **diagnosis only**. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify PodGroups or resource configurations.
14
+
15
+ ## When to Use This Guide
16
+
17
+ Use this skill when:
18
+ - PodGroup status is `Inqueue` but member Pods remain `Pending`
19
+ - Events contain `minMember` related errors
20
+ - Volcano Job has `minAvailable` or `minMember` that cannot be satisfied
21
+ - Some member Pods are running, others are Pending, and the entire group won't start
22
+ - You see `FailedScheduling` events mentioning Gang constraints
23
+
24
+ ## Understanding Gang Scheduling
25
+
26
+ Gang scheduling in Volcano ensures that either all members of a workload are scheduled, or none are. This is crucial for distributed workloads like MPI, TensorFlow, PyTorch where partial scheduling is wasteful.
27
+
28
+ **Key Concepts:**
29
+ - `minMember` (in PodGroup spec): Minimum number of pods that must be scheduled simultaneously
30
+ - `minResources` (in PodGroup spec): Aggregate resource floor (e.g., total GPUs) that must be available — **both** `minMember` and `minResources` must be satisfied if set
31
+ - `minAvailable` (in Job spec): Similar concept at Job level
32
+ - The scheduler checks if there are **simultaneous** resources for all minMember pods before allocating
33
+
34
+ ## Diagnostic Steps
35
+
36
+ ### Step 1: Identify the PodGroup
37
+
38
+ Find the PodGroup associated with the pending pods:
39
+
40
+ ```bash
41
+ kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.annotations.scheduling.volcano.sh/pod-group}'
42
+ ```
43
+
44
+ ### Step 2: Check PodGroup Status
45
+
46
+ Get detailed PodGroup information:
47
+
48
+ ```bash
49
+ kubectl get podgroup <podgroup-name> -n <namespace> -o yaml
50
+ ```
51
+
52
+ **Key fields to examine:**
53
+
54
+ | Field | Meaning | What to Look For |
55
+ |-------|---------|------------------|
56
+ | `spec.minMember` | Minimum pods required | Is this number achievable? |
57
+ | `spec.minResources` | Aggregate resource floor | Is total cluster capacity sufficient? |
58
+ | `status.phase` | Current scheduling phase | Should be `Inqueue` for ready-to-schedule |
59
+ | `status.running` | Currently running pods | Compare to minMember |
60
+ | `status.pending` | Pending pods | These are waiting for Gang constraint |
61
+ | `spec.queue` | Queue name | Check if queue has sufficient resources |
62
+
63
+ **Common scenarios:**
64
+
65
+ - `status.phase: Pending` - PodGroup is waiting to be enqueued
66
+ - `status.phase: Inqueue` - Ready for scheduling but constraint not met
67
+ - `status.running < spec.minMember` - Gang constraint not satisfied
68
+
69
+ ### Step 3: Calculate Resource Requirements
70
+
71
+ Calculate the total resources needed for the Gang:
72
+
73
+ ```
74
+ Total CPU = minMember × single Pod CPU request
75
+ Total Memory = minMember × single Pod Memory request
76
+ Total GPU = minMember × single Pod GPU request (if applicable)
77
+ ```
78
+
79
+ Get a pod's resource requests:
80
+
81
+ ```bash
82
+ kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'
83
+ ```
84
+
85
+ ### Step 4: Check Cluster Resources
86
+
87
+ #### Option A: Check Node Resources
88
+
89
+ View available resources across nodes:
90
+
91
+ ```bash
92
+ kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,GPU:.status.allocatable.nvidia\.com/gpu'
93
+ ```
94
+
95
+ Check current resource usage:
96
+
97
+ ```bash
98
+ kubectl top nodes
99
+ ```
100
+
101
+ #### Option B: Check by Node Labels (if pods have node affinity)
102
+
103
+ If pods target specific nodes:
104
+
105
+ ```bash
106
+ kubectl get nodes -l <label-key>=<label-value> -o wide
107
+ ```
108
+
109
+ ### Step 5: Check Events for Gang Errors
110
+
111
+ Look for Gang-specific scheduling errors:
112
+
113
+ ```bash
114
+ kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
115
+ ```
116
+
117
+ **Common Gang-related event messages:**
118
+
119
+ | Message | Meaning | Investigation |
120
+ |---------|---------|---------------|
121
+ | `minMember not satisfied` | Gang constraint preventing scheduling | Check if total resources >= minMember requirements |
122
+ | `gang member not ready` | Some pods in the gang are not ready | Check individual pod status |
123
+ | `resource insufficient` | Not enough resources for all members | Use `volcano-resource-insufficient` skill |
124
+
125
+ ### Step 6: Verify Queue Resources
126
+
127
+ If the PodGroup is in a Queue, check if the queue has sufficient deserved resources:
128
+
129
+ ```bash
130
+ kubectl get queue <queue-name>
131
+ kubectl describe queue <queue-name>
132
+ ```
133
+
134
+ Look for:
135
+ - `status.deserved` vs `status.allocated`
136
+ - If allocated >= deserved, the queue is at capacity
137
+ - Check `status.state` is `Open` (not `Closing` or `Closed`)
138
+
139
+ ## Common Causes and Solutions
140
+
141
+ ### Cause 1: minMember Too Large
142
+
143
+ **Symptom:** `minMember` is larger than the number of available nodes, or requires more resources than any single node can provide.
144
+
145
+ **Example:**
146
+ - minMember = 10
147
+ - Each pod requests 8 GPUs
148
+ - Only 5 nodes have 8 GPUs each
149
+ - **Result:** Gang can never be satisfied
150
+
151
+ **Solution:**
152
+ - Reduce `minMember` in PodGroup spec
153
+ - Increase cluster capacity (add nodes)
154
+ - Reduce per-pod resource requests
155
+
156
+ ### Cause 2: Resource Fragmentation
157
+
158
+ **Symptom:** Total cluster resources are sufficient, but not concentrated on enough nodes to satisfy simultaneous scheduling.
159
+
160
+ **Example:**
161
+ - minMember = 4, each needs 4 CPUs
162
+ - Total cluster: 20 CPUs available
163
+ - But distributed across 10 nodes with 2 CPUs each
164
+ - **Result:** Cannot find 4 nodes with 4 CPUs simultaneously
165
+
166
+ **Solution:**
167
+ - Configure `binpack` plugin to concentrate pods on fewer nodes
168
+ - Defragment cluster by rescheduling or draining nodes
169
+ - Adjust resource requests to fit node sizes
170
+
171
+ ### Cause 3: Priority Preemption
172
+
173
+ **Symptom:** Resources exist but are being used by lower-priority workloads that should be preempted.
174
+
175
+ **Check:**
176
+ - Compare PodGroup priority vs running PodGroups
177
+ - Check if higher priority exists in the same queue
178
+
179
+ **Solution:**
180
+ - Ensure correct PriorityClass is assigned
181
+ - Check `priority` plugin is enabled in scheduler config
182
+
183
+ ### Cause 4: Queue Resource Exhaustion
184
+
185
+ **Symptom:** The PodGroup's queue has used all its deserved resources.
186
+
187
+ **Check:**
188
+ ```bash
189
+ kubectl get queue <queue-name> -o jsonpath='{.status.allocated}'
190
+ kubectl get queue <queue-name> -o jsonpath='{.status.deserved}'
191
+ ```
192
+
193
+ **Solution:**
194
+ - Increase queue weight or capability
195
+ - Wait for other jobs to complete
196
+ - Use `volcano-queue-diagnose` for detailed analysis
197
+
198
+ ### Cause 5: Affinity/Anti-Affinity Conflicts (Effective Node Pool Narrowing)
199
+
200
+ **Symptom:** Queue shows available capacity, but Gang still blocks. Pod scheduling constraints narrow the effective node pool below what Gang requires.
201
+
202
+ **Diagnosis — compute the effective node pool:**
203
+ ```bash
204
+ # 1. Check pod's nodeSelector
205
+ kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
206
+
207
+ # 2. Check matching nodes
208
+ kubectl get nodes -l <selector-key>=<selector-value> -o custom-columns="NAME:.metadata.name,CPU:.status.allocatable.cpu,GPU:.status.allocatable['nvidia.com/gpu']"
209
+
210
+ # 3. Check tolerations (tainted nodes require matching tolerations)
211
+ kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'
212
+ kubectl get nodes -o custom-columns="NAME:.metadata.name,TAINTS:.spec.taints[*].key"
213
+ ```
214
+
215
+ Volcano scheduling is **two-phase**: first queue-level admission (capacity check), then node-level placement. A job can pass the queue check but fail node placement if all matching nodes are occupied.
216
+
217
+ **Solution:**
218
+ - Relax affinity constraints if possible
219
+ - Ensure sufficient nodes match the constraints
220
+ - Verify toleration matches for tainted nodes
221
+
222
+ ### Cause 6: Queue Has Capacity but Gang Still Blocks
223
+
224
+ **Symptom:** Queue `allocated < deserved`, PodGroup is `Inqueue`, but pods remain Pending.
225
+
226
+ **Check — verify remaining capacity vs Gang requirement:**
227
+ ```bash
228
+ # Queue remaining capacity
229
+ kubectl get queue <queue> -o jsonpath='{"deserved: "}{.status.deserved}{"\nallocated: "}{.status.allocated}'
230
+
231
+ # PodGroup minMember and minResources
232
+ kubectl get podgroup <pg> -n <ns> -o jsonpath='{"minMember: "}{.spec.minMember}{"\nminResources: "}{.spec.minResources}'
233
+ ```
234
+
235
+ Calculate: `remaining = deserved - allocated`. If `remaining < minMember × per-pod-resources`, the Gang cannot be satisfied even though the queue is not fully used.
236
+
237
+ If `minResources` is set, also verify: `remaining >= minResources` for each resource dimension.
238
+
239
+ **Solution:**
240
+ - Wait for enough resources to free up in the queue
241
+ - Reduce `minMember` or `minResources` if the job can tolerate partial scheduling
242
+
243
+ ### Cause 7: Post-Scheduling Gang Breakage
244
+
245
+ **Symptom:** Job was Running, then moves to Aborted. Running pod count dropped below `minMember`.
246
+
247
+ This happens when pods are evicted (preemption, node failure, OOM) and the remaining count falls below the Gang constraint, causing the entire group to be torn down.
248
+
249
+ **Check:**
250
+ ```bash
251
+ # Current running vs required
252
+ kubectl get podgroup <pg> -n <ns> -o jsonpath='{"running: "}{.status.running}{"\nminMember: "}{.spec.minMember}'
253
+
254
+ # Check for eviction/preemption events
255
+ kubectl get events -n <ns> --field-selector reason=Preempted
256
+ kubectl get events -n <ns> --field-selector reason=Evicted
257
+ ```
258
+
259
+ **Solution:**
260
+ - Investigate why pods were evicted (resource pressure, preemption, node failure)
261
+ - Consider setting `reclaimable: false` on the queue to prevent preemption
262
+ - Increase cluster capacity to reduce eviction pressure
263
+
264
+ ## Verification Steps
265
+
266
+ After identifying the issue, verify your analysis:
267
+
268
+ 1. **Check if issue is Gang-specific:**
269
+ - Try scheduling a single pod with same resources
270
+ - If single pod schedules, it's a Gang constraint issue
271
+ - If single pod doesn't schedule, it's a resource/affinity issue
272
+
273
+ 2. **Calculate minimum requirements:**
274
+ - Confirm minMember × per-pod-resources ≤ available resources
275
+ - Confirm enough nodes can accommodate the pods
276
+
277
+ 3. **Check scheduler logs:**
278
+ ```bash
279
+ # Use volcano-scheduler-logs skill
280
+ bash skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh --keyword gang
281
+ ```
282
+
283
+ ## Key Insight
284
+
285
+ Gang Scheduling constraint: **Must have enough resources to schedule minMember Pods simultaneously on different nodes.**
286
+
287
+ Even if total cluster resources are sufficient, if resources are released gradually over time (as other pods complete), the "simultaneous" requirement may not be met.
288
+
289
+ **Distinguish between:**
290
+ 1. **Total shortage** - Entire cluster lacks resources
291
+ 2. **Cannot satisfy simultaneously** - Resources exist but not on enough nodes at the same time
292
+ 3. **Queue limit** - Queue deserved resources are exhausted
293
+
294
+ ## See Also
295
+
296
+ - `volcano-diagnose-pod` - General Pod scheduling diagnosis
297
+ - `volcano-queue-diagnose` - Queue status and resource analysis
298
+ - `volcano-resource-insufficient` - Resource shortage diagnosis
299
+ - `volcano-scheduler-logs` - Scheduler log analysis
@@ -0,0 +1,319 @@
1
+ ---
2
+ name: volcano-job-diagnose
3
+ description: >-
4
+ Diagnose Volcano Job status and issues.
5
+ Check Job phases, task statuses, PodGroup associations, and overall job health.
6
+ ---
7
+
8
+ # Volcano Job Diagnosis
9
+
10
+ Diagnose Volcano Job (batch.volcano.sh/v1beta1) status and issues. This skill checks Job phases, task statuses, PodGroup associations, and overall job health.
11
+
12
+ **Scope:** This skill is for **diagnosis only**. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify job specs or restart jobs — that should be left to the user.
13
+
14
+ ## Usage
15
+
16
+ ```bash
17
+ bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job <job-name> --namespace <namespace>
18
+ ```
19
+
20
+ ## Parameters
21
+
22
+ | Parameter | Required | Description |
23
+ |-----------|----------|-------------|
24
+ | `--job JOB` | yes | Job name to diagnose |
25
+ | `--namespace NS` | no | Namespace (default: `default`) |
26
+ | `--verbose` | no | Show detailed task and pod information |
27
+
28
+ ## Examples
29
+
30
+ Diagnose a Volcano Job:
31
+ ```bash
32
+ bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-training-job --namespace training
33
+ ```
34
+
35
+ Verbose mode with task details:
36
+ ```bash
37
+ bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-training-job --namespace training --verbose
38
+ ```
39
+
40
+ ## Understanding Volcano Jobs
41
+
42
+ ### Job Structure
43
+
44
+ ```yaml
45
+ apiVersion: batch.volcano.sh/v1beta1
46
+ kind: Job
47
+ spec:
48
+ schedulerName: volcano
49
+ tasks:
50
+ - name: worker
51
+ replicas: 4
52
+ template:
53
+ spec:
54
+ containers:
55
+ - name: worker
56
+ resources:
57
+ requests:
58
+ cpu: "4"
59
+ memory: "8Gi"
60
+ maxRetry: 3 # Max retries before job is Aborted
61
+ policies:
62
+ - event: PodFailed
63
+ action: RestartJob
64
+ ```
65
+
66
+ > **Note:** Volcano Jobs can also be queried using the short name `vcjob` (e.g., `kubectl get vcjob`). This is an alias for `job.batch.volcano.sh`. Be careful not to confuse with native Kubernetes `batch/v1 Job` — always use `job.batch.volcano.sh` or `vcjob` for Volcano Jobs.
67
+
68
+ ### Job Phases
69
+
70
+ | Phase | Meaning |
71
+ |-------|---------|
72
+ | `Pending` | Job is waiting for resources or admission |
73
+ | `Running` | Job is executing |
74
+ | `Completing` | Job tasks are completing |
75
+ | `Completed` | Job finished successfully |
76
+ | `Failed` | Job failed |
77
+ | `Restarting` | Job is being restarted due to policy |
78
+ | `Terminating` | Job is being terminated |
79
+ | `Aborted` | Job was aborted |
80
+
81
+ ### Task Statuses
82
+
83
+ Each task within a job has its own status:
84
+ - `Pending` - Task pods not yet scheduled
85
+ - `Running` - Task pods are running
86
+ - `Completed` - Task finished
87
+ - `Failed` - Task failed
88
+
89
+ ## Diagnostic Flow
90
+
91
+ ### Step 1: Job Overview
92
+
93
+ Get the Job status:
94
+
95
+ ```bash
96
+ kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o yaml
97
+ ```
98
+
99
+ **Key fields to check:**
100
+ - `status.state.phase` - Current job phase
101
+ - `status.failed` - Number of failed tasks
102
+ - `status.succeeded` - Number of succeeded tasks
103
+ - `status.running` - Number of running tasks
104
+ - `status.pending` - Number of pending tasks
105
+
106
+ ### Step 2: Check Tasks
107
+
108
+ List all tasks and their statuses:
109
+
110
+ ```bash
111
+ kubectl get pods -n <namespace> -l volcano.sh/job-name=<job-name> -o wide
112
+ ```
113
+
114
+ **What to look for:**
115
+ - Pod phases (Pending, Running, Completed, Failed)
116
+ - Pod restart counts
117
+ - Node assignments
118
+
119
+ ### Step 3: Check PodGroup Association
120
+
121
+ Find the PodGroup created for this Job:
122
+
123
+ ```bash
124
+ kubectl get podgroups -n <namespace> -l volcano.sh/job-name=<job-name>
125
+ ```
126
+
127
+ Or check the Job's tasks for PodGroup annotations:
128
+
129
+ ```bash
130
+ kubectl get pods -n <namespace> -l volcano.sh/job-name=<job-name> \
131
+ -o jsonpath='{.items[0].metadata.annotations.scheduling\.volcano\.sh/pod-group}'
132
+ ```
133
+
134
+ **Next step:** If PodGroup status is problematic, use `volcano-diagnose-pod` for detailed PodGroup analysis.
135
+
136
+ ### Step 4: Check Policies
137
+
138
+ Review job policies that may affect behavior:
139
+
140
+ ```bash
141
+ kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o jsonpath='{.spec.policies}'
142
+ ```
143
+
144
+ **Common policies:**
145
+ - `PodFailed` → `RestartJob` - Restart entire job on any pod failure
146
+ - `PodFailed` → `RestartTask` - Restart only the failed task
147
+ - `PodEvicted` → `RestartTask` - Restart evicted tasks
148
+ - `PodEvicted` → `AbortJob` - Abort entire job when a pod is evicted (can cause unexpected aborts during preemption)
149
+ - `TaskCompleted` → `CompleteJob` - Complete job when task finishes
150
+
151
+ Also check `maxRetry` — when retries are exhausted the job moves to `Aborted`:
152
+ ```bash
153
+ kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o jsonpath='{.spec.maxRetry}'
154
+ ```
155
+
156
+ ### Step 5: Events Analysis
157
+
158
+ Check job-related events:
159
+
160
+ ```bash
161
+ kubectl get events -n <namespace> --field-selector involvedObject.name=<job-name>
162
+ ```
163
+
164
+ **Common event patterns:**
165
+
166
+ #### `JobFailed` - Job has failed
167
+ Check the reason and message for failure details.
168
+
169
+ #### `JobRestarting` - Job is being restarted
170
+ Check the restart policy and previous failure reason.
171
+
172
+ #### `TaskFailed` - Individual task failed
173
+ May or may not cause entire job to fail depending on policy.
174
+
175
+ ## Common Issues
176
+
177
+ ### Issue 1: Job Stuck in Pending
178
+
179
+ **Symptom:** Job phase is `Pending`, no pods created.
180
+
181
+ **Check:**
182
+ 1. PodGroup status: `kubectl get podgroups -n <ns>`
183
+ 2. Queue state: `kubectl get queue <queue>`
184
+ 3. Events: `kubectl get events -n <ns> | grep <job-name>`
185
+
186
+ **Likely causes:**
187
+ - Queue is Closed
188
+ - PodGroup cannot be enqueued (resource shortage)
189
+ - Admission webhook rejection
190
+
191
+ ### Issue 2: Some Tasks Running, Others Pending
192
+
193
+ **Symptom:** Partial task scheduling (e.g., 2/4 tasks running).
194
+
195
+ **Check:**
196
+ 1. PodGroup minMember vs actual pod count
197
+ 2. Gang scheduling constraints
198
+ 3. Resource availability
199
+
200
+ **Likely causes:**
201
+ - Gang constraint not satisfied (use `volcano-gang-scheduling`)
202
+ - Resource fragmentation
203
+ - Queue quota exhausted
204
+
205
+ ### Issue 3: Job Restarting Repeatedly
206
+
207
+ **Symptom:** Job keeps restarting, never completes.
208
+
209
+ **Check:**
210
+ 1. Restart policy: `kubectl get job.batch.volcano.sh -o jsonpath='{.spec.policies}'`
211
+ 2. Pod failure reasons: `kubectl describe pod <pod>`
212
+ 3. Container logs: `kubectl logs <pod>`
213
+
214
+ **Likely causes:**
215
+ - Application crashing (check container logs)
216
+ - Resource pressure causing evictions
217
+ - Misconfigured restart policy
218
+
219
+ ### Issue 4: Job Failed After Some Tasks Completed
220
+
221
+ **Symptom:** Some tasks succeeded, but job marked as Failed.
222
+
223
+ **Check:**
224
+ 1. Failed task details
225
+ 2. Job completion policy
226
+ 3. Task lifecycle policies
227
+
228
+ **Likely causes:**
229
+ - One critical task failed
230
+ - Completion policy is strict (all tasks must succeed)
231
+ - Lifecycle policy triggered premature job failure
232
+
233
+ ### Issue 5: Job Aborted Unexpectedly
234
+
235
+ **Symptom:** Job was Running, then moved to `Aborted`.
236
+
237
+ **Check:**
238
+ ```bash
239
+ # Check maxRetry
240
+ kubectl get job.batch.volcano.sh <job> -n <ns> -o jsonpath='{.spec.maxRetry}'
241
+
242
+ # Check for preemption/eviction events
243
+ kubectl get events -n <ns> --field-selector reason=Preempted
244
+ kubectl get events -n <ns> --field-selector reason=Evicted
245
+
246
+ # Check if running pod count dropped below minMember (Gang breakage)
247
+ kubectl get podgroup -n <ns> -l volcano.sh/job-name=<job> -o jsonpath='{"running: "}{.items[0].status.running}{"\nminMember: "}{.items[0].spec.minMember}'
248
+ ```
249
+
250
+ **Likely causes:**
251
+ - `maxRetry` exhausted — job restarted too many times
252
+ - Preemption by higher-priority job — pods evicted, triggering `PodEvicted → AbortJob` policy
253
+ - Gang breakage — pod eviction caused running count to drop below `minMember`, tearing down the entire group
254
+ - Lifecycle policy mismatch — e.g., `PodEvicted → AbortJob` when `RestartTask` would be more appropriate
255
+
256
+ ## Task Lifecycle Policies
257
+
258
+ Volcano controls task coordination through lifecycle policies, not explicit task dependencies.
259
+
260
+ ```yaml
261
+ spec:
262
+ tasks:
263
+ - name: master
264
+ replicas: 1
265
+ policies:
266
+ - event: TaskCompleted
267
+ action: CompleteJob
268
+ - name: worker
269
+ replicas: 4
270
+ policies:
271
+ - event: PodFailed
272
+ action: RestartTask
273
+ ```
274
+
275
+ **Diagnosis:**
276
+ ```bash
277
+ # Check per-task status counts
278
+ kubectl get job.batch.volcano.sh <job> -o jsonpath='{.status.taskStatusCount}'
279
+
280
+ # Check configured policies
281
+ kubectl get job.batch.volcano.sh <job> -o jsonpath='{.spec.tasks[*].policies}'
282
+ ```
283
+
284
+ Look for mismatched events/actions that could cause unexpected restarts or premature completion.
285
+
286
+ ## Integration with Other Skills
287
+
288
+ Use this skill in combination with others:
289
+
290
+ ```bash
291
+ # 1. Job-level diagnosis
292
+ bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-job --namespace training
293
+
294
+ # 2. If PodGroup issues found → Pod-level diagnosis
295
+ bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod my-job-worker-0 --namespace training
296
+
297
+ # 3. If Gang issues → Gang scheduling analysis
298
+ # (refer to volcano-gang-scheduling skill)
299
+
300
+ # 4. If Queue issues → Queue diagnosis
301
+ bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue
302
+
303
+ # 5. Check scheduler logs for decisions
304
+ bash skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh --pod my-job-worker-0 --since 1h
305
+ ```
306
+
307
+ ## Environment Variables
308
+
309
+ | Variable | Default | Description |
310
+ |----------|---------|-------------|
311
+ | `VOLCANO_NAMESPACE` | `default` | Default namespace for job lookup |
312
+
313
+ ## See Also
314
+
315
+ - `volcano-diagnose-pod` - Pod-level scheduling diagnosis
316
+ - `volcano-gang-scheduling` - Gang scheduling constraint analysis
317
+ - `volcano-queue-diagnose` - Queue resource analysis
318
+ - `volcano-scheduler-logs` - Scheduler decision logs
319
+ - `deployment-rollout-debug` - (Similar concept for Deployments)