siclaw 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (270) hide show
  1. package/README.md +75 -114
  2. package/dist/agentbox/gateway-client.d.ts +2 -1
  3. package/dist/agentbox/gateway-client.js +6 -2
  4. package/dist/agentbox/gateway-client.js.map +1 -1
  5. package/dist/agentbox/http-server.js +184 -19
  6. package/dist/agentbox/http-server.js.map +1 -1
  7. package/dist/agentbox/resource-handlers.d.ts +1 -0
  8. package/dist/agentbox/resource-handlers.js +23 -23
  9. package/dist/agentbox/resource-handlers.js.map +1 -1
  10. package/dist/agentbox/session.js +85 -5
  11. package/dist/agentbox/session.js.map +1 -1
  12. package/dist/agentbox-main.d.ts +2 -1
  13. package/dist/agentbox-main.js +65 -18
  14. package/dist/agentbox-main.js.map +1 -1
  15. package/dist/cli-credentials.d.ts +1 -0
  16. package/dist/cli-credentials.js +109 -0
  17. package/dist/cli-credentials.js.map +1 -0
  18. package/dist/cli-first-run.d.ts +11 -0
  19. package/dist/cli-first-run.js +99 -0
  20. package/dist/cli-first-run.js.map +1 -0
  21. package/dist/cli-main.js +33 -11
  22. package/dist/cli-main.js.map +1 -1
  23. package/dist/cli-setup.d.ts +5 -11
  24. package/dist/cli-setup.js +12 -225
  25. package/dist/cli-setup.js.map +1 -1
  26. package/dist/core/agent-factory.d.ts +4 -0
  27. package/dist/core/agent-factory.js +102 -151
  28. package/dist/core/agent-factory.js.map +1 -1
  29. package/dist/core/config.d.ts +10 -3
  30. package/dist/core/config.js +11 -95
  31. package/dist/core/config.js.map +1 -1
  32. package/dist/core/extensions/deep-investigation.d.ts +2 -1
  33. package/dist/core/extensions/deep-investigation.js +144 -24
  34. package/dist/core/extensions/deep-investigation.js.map +1 -1
  35. package/dist/core/extensions/setup.d.ts +8 -0
  36. package/dist/core/extensions/setup.js +669 -0
  37. package/dist/core/extensions/setup.js.map +1 -0
  38. package/dist/core/llm-proxy.js +7 -3
  39. package/dist/core/llm-proxy.js.map +1 -1
  40. package/dist/core/mcp-client.d.ts +0 -10
  41. package/dist/core/mcp-client.js +0 -65
  42. package/dist/core/mcp-client.js.map +1 -1
  43. package/dist/core/prompt.d.ts +1 -1
  44. package/dist/core/prompt.js +42 -5
  45. package/dist/core/prompt.js.map +1 -1
  46. package/dist/core/provider-presets.d.ts +14 -0
  47. package/dist/core/provider-presets.js +81 -0
  48. package/dist/core/provider-presets.js.map +1 -0
  49. package/dist/cron/cron-coordinator.d.ts +2 -0
  50. package/dist/cron/cron-coordinator.js +46 -14
  51. package/dist/cron/cron-coordinator.js.map +1 -1
  52. package/dist/cron/cron-executor.js +33 -8
  53. package/dist/cron/cron-executor.js.map +1 -1
  54. package/dist/cron/cron-scheduler.d.ts +1 -1
  55. package/dist/cron/gateway-client.d.ts +5 -0
  56. package/dist/cron/gateway-client.js +43 -8
  57. package/dist/cron/gateway-client.js.map +1 -1
  58. package/dist/cron-main.js +39 -9
  59. package/dist/cron-main.js.map +1 -1
  60. package/dist/gateway/agentbox/client.d.ts +11 -0
  61. package/dist/gateway/agentbox/client.js +18 -0
  62. package/dist/gateway/agentbox/client.js.map +1 -1
  63. package/dist/gateway/agentbox/k8s-spawner.d.ts +11 -2
  64. package/dist/gateway/agentbox/k8s-spawner.js +95 -52
  65. package/dist/gateway/agentbox/k8s-spawner.js.map +1 -1
  66. package/dist/gateway/agentbox/local-spawner.d.ts +1 -1
  67. package/dist/gateway/agentbox/local-spawner.js +4 -2
  68. package/dist/gateway/agentbox/local-spawner.js.map +1 -1
  69. package/dist/gateway/agentbox/manager.d.ts +0 -10
  70. package/dist/gateway/agentbox/manager.js +11 -30
  71. package/dist/gateway/agentbox/manager.js.map +1 -1
  72. package/dist/gateway/agentbox/types.d.ts +6 -4
  73. package/dist/gateway/cron/cron-service.d.ts +49 -0
  74. package/dist/gateway/cron/cron-service.js +259 -0
  75. package/dist/gateway/cron/cron-service.js.map +1 -0
  76. package/dist/gateway/db/init-schema.js +44 -0
  77. package/dist/gateway/db/init-schema.js.map +1 -1
  78. package/dist/gateway/db/migrate-sqlite.js +73 -4
  79. package/dist/gateway/db/migrate-sqlite.js.map +1 -1
  80. package/dist/gateway/db/repositories/chat-repo.d.ts +56 -2
  81. package/dist/gateway/db/repositories/chat-repo.js +132 -2
  82. package/dist/gateway/db/repositories/chat-repo.js.map +1 -1
  83. package/dist/gateway/db/repositories/config-repo.d.ts +31 -2
  84. package/dist/gateway/db/repositories/config-repo.js +57 -7
  85. package/dist/gateway/db/repositories/config-repo.js.map +1 -1
  86. package/dist/gateway/db/repositories/env-repo.d.ts +14 -0
  87. package/dist/gateway/db/repositories/env-repo.js +15 -2
  88. package/dist/gateway/db/repositories/env-repo.js.map +1 -1
  89. package/dist/gateway/db/repositories/model-config-repo.d.ts +1 -1
  90. package/dist/gateway/db/repositories/model-config-repo.js +26 -12
  91. package/dist/gateway/db/repositories/model-config-repo.js.map +1 -1
  92. package/dist/gateway/db/repositories/skill-repo.d.ts +0 -5
  93. package/dist/gateway/db/repositories/skill-review-repo.d.ts +1 -0
  94. package/dist/gateway/db/repositories/skill-review-repo.js +4 -1
  95. package/dist/gateway/db/repositories/skill-review-repo.js.map +1 -1
  96. package/dist/gateway/db/repositories/skill-version-repo.js +0 -1
  97. package/dist/gateway/db/repositories/skill-version-repo.js.map +1 -1
  98. package/dist/gateway/db/repositories/system-config-repo.d.ts +1 -1
  99. package/dist/gateway/db/repositories/system-config-repo.js +2 -1
  100. package/dist/gateway/db/repositories/system-config-repo.js.map +1 -1
  101. package/dist/gateway/db/repositories/user-env-config-repo.d.ts +13 -0
  102. package/dist/gateway/db/repositories/user-env-config-repo.js +11 -0
  103. package/dist/gateway/db/repositories/user-env-config-repo.js.map +1 -1
  104. package/dist/gateway/db/repositories/workspace-repo.d.ts +3 -2
  105. package/dist/gateway/db/repositories/workspace-repo.js +6 -2
  106. package/dist/gateway/db/repositories/workspace-repo.js.map +1 -1
  107. package/dist/gateway/db/schema-mysql.d.ts +473 -51
  108. package/dist/gateway/db/schema-mysql.js +35 -4
  109. package/dist/gateway/db/schema-mysql.js.map +1 -1
  110. package/dist/gateway/db/schema-sqlite.d.ts +522 -57
  111. package/dist/gateway/db/schema-sqlite.js +38 -6
  112. package/dist/gateway/db/schema-sqlite.js.map +1 -1
  113. package/dist/gateway/db/schema.d.ts +471 -51
  114. package/dist/gateway/db/schema.js +1 -1
  115. package/dist/gateway/db/schema.js.map +1 -1
  116. package/dist/gateway/metrics-aggregator.d.ts +65 -0
  117. package/dist/gateway/metrics-aggregator.js +244 -0
  118. package/dist/gateway/metrics-aggregator.js.map +1 -0
  119. package/dist/gateway/plugins/channel-bridge.d.ts +4 -1
  120. package/dist/gateway/plugins/channel-bridge.js +78 -86
  121. package/dist/gateway/plugins/channel-bridge.js.map +1 -1
  122. package/dist/gateway/rpc-methods.d.ts +4 -2
  123. package/dist/gateway/rpc-methods.js +962 -163
  124. package/dist/gateway/rpc-methods.js.map +1 -1
  125. package/dist/gateway/security/cert-manager.d.ts +2 -2
  126. package/dist/gateway/security/cert-manager.js +4 -2
  127. package/dist/gateway/security/cert-manager.js.map +1 -1
  128. package/dist/gateway/server.d.ts +4 -8
  129. package/dist/gateway/server.js +297 -261
  130. package/dist/gateway/server.js.map +1 -1
  131. package/dist/gateway/skills/file-writer.js +17 -11
  132. package/dist/gateway/skills/file-writer.js.map +1 -1
  133. package/dist/gateway/skills/script-evaluator.js +12 -9
  134. package/dist/gateway/skills/script-evaluator.js.map +1 -1
  135. package/dist/gateway/web/dist/assets/index-0p17ZeTP.js +740 -0
  136. package/dist/gateway/web/dist/assets/index-9eP6nPUq.js +741 -0
  137. package/dist/gateway/web/dist/assets/index-9eP6nPUq.js.map +1 -0
  138. package/dist/gateway/web/dist/assets/index-CAmSY91d.js +675 -0
  139. package/dist/gateway/web/dist/assets/index-DMFEh8Pp.css +1 -0
  140. package/dist/gateway/web/dist/assets/index-DyowBCEj.css +1 -0
  141. package/dist/gateway/web/dist/assets/index-PDK5JJDO.css +1 -0
  142. package/dist/gateway/web/dist/index.html +2 -2
  143. package/dist/gateway-main.js +27 -10
  144. package/dist/gateway-main.js.map +1 -1
  145. package/dist/memory/embeddings.js +5 -4
  146. package/dist/memory/embeddings.js.map +1 -1
  147. package/dist/memory/indexer.d.ts +23 -3
  148. package/dist/memory/indexer.js +235 -23
  149. package/dist/memory/indexer.js.map +1 -1
  150. package/dist/memory/schema.js +15 -1
  151. package/dist/memory/schema.js.map +1 -1
  152. package/dist/memory/types.d.ts +18 -0
  153. package/dist/memory/types.js +6 -1
  154. package/dist/memory/types.js.map +1 -1
  155. package/dist/shared/detect-language.d.ts +12 -0
  156. package/dist/shared/detect-language.js +78 -0
  157. package/dist/shared/detect-language.js.map +1 -0
  158. package/dist/shared/diagnostic-events.d.ts +70 -0
  159. package/dist/shared/diagnostic-events.js +38 -0
  160. package/dist/shared/diagnostic-events.js.map +1 -0
  161. package/dist/shared/local-collector.d.ts +56 -0
  162. package/dist/shared/local-collector.js +284 -0
  163. package/dist/shared/local-collector.js.map +1 -0
  164. package/dist/shared/metrics-types.d.ts +64 -0
  165. package/dist/shared/metrics-types.js +25 -0
  166. package/dist/shared/metrics-types.js.map +1 -0
  167. package/dist/shared/metrics.d.ts +19 -0
  168. package/dist/shared/metrics.js +185 -0
  169. package/dist/shared/metrics.js.map +1 -0
  170. package/dist/shared/path-utils.d.ts +15 -0
  171. package/dist/shared/path-utils.js +23 -0
  172. package/dist/shared/path-utils.js.map +1 -0
  173. package/dist/shared/retry.d.ts +35 -0
  174. package/dist/shared/retry.js +61 -0
  175. package/dist/shared/retry.js.map +1 -0
  176. package/dist/tools/command-sets.d.ts +18 -2
  177. package/dist/tools/command-sets.js +207 -32
  178. package/dist/tools/command-sets.js.map +1 -1
  179. package/dist/tools/command-validator.d.ts +56 -0
  180. package/dist/tools/command-validator.js +357 -0
  181. package/dist/tools/command-validator.js.map +1 -0
  182. package/dist/tools/create-skill.js +26 -1
  183. package/dist/tools/create-skill.js.map +1 -1
  184. package/dist/tools/credential-list.js +1 -23
  185. package/dist/tools/credential-list.js.map +1 -1
  186. package/dist/tools/credential-manager.d.ts +98 -0
  187. package/dist/tools/credential-manager.js +313 -0
  188. package/dist/tools/credential-manager.js.map +1 -0
  189. package/dist/tools/deep-search/engine.js +184 -127
  190. package/dist/tools/deep-search/engine.js.map +1 -1
  191. package/dist/tools/deep-search/prompts.d.ts +10 -2
  192. package/dist/tools/deep-search/prompts.js +37 -36
  193. package/dist/tools/deep-search/prompts.js.map +1 -1
  194. package/dist/tools/deep-search/schemas.d.ts +87 -0
  195. package/dist/tools/deep-search/schemas.js +85 -0
  196. package/dist/tools/deep-search/schemas.js.map +1 -0
  197. package/dist/tools/deep-search/sub-agent.d.ts +21 -0
  198. package/dist/tools/deep-search/sub-agent.js +153 -4
  199. package/dist/tools/deep-search/sub-agent.js.map +1 -1
  200. package/dist/tools/deep-search/tool.js +1 -0
  201. package/dist/tools/deep-search/tool.js.map +1 -1
  202. package/dist/tools/deep-search/types.d.ts +2 -0
  203. package/dist/tools/deep-search/types.js.map +1 -1
  204. package/dist/tools/dp-tools.js +29 -5
  205. package/dist/tools/dp-tools.js.map +1 -1
  206. package/dist/tools/exec-utils.d.ts +85 -0
  207. package/dist/tools/exec-utils.js +294 -0
  208. package/dist/tools/exec-utils.js.map +1 -0
  209. package/dist/tools/fork-skill.js +14 -2
  210. package/dist/tools/fork-skill.js.map +1 -1
  211. package/dist/tools/investigation-feedback.d.ts +3 -0
  212. package/dist/tools/investigation-feedback.js +71 -0
  213. package/dist/tools/investigation-feedback.js.map +1 -0
  214. package/dist/tools/manage-schedule.js +16 -6
  215. package/dist/tools/manage-schedule.js.map +1 -1
  216. package/dist/tools/netns-script.js +27 -281
  217. package/dist/tools/netns-script.js.map +1 -1
  218. package/dist/tools/node-exec.d.ts +2 -14
  219. package/dist/tools/node-exec.js +18 -225
  220. package/dist/tools/node-exec.js.map +1 -1
  221. package/dist/tools/node-script.js +14 -168
  222. package/dist/tools/node-script.js.map +1 -1
  223. package/dist/tools/pod-exec.d.ts +1 -1
  224. package/dist/tools/pod-exec.js +10 -26
  225. package/dist/tools/pod-exec.js.map +1 -1
  226. package/dist/tools/pod-nsenter-exec.js +21 -225
  227. package/dist/tools/pod-nsenter-exec.js.map +1 -1
  228. package/dist/tools/pod-script.js +10 -19
  229. package/dist/tools/pod-script.js.map +1 -1
  230. package/dist/tools/restricted-bash.d.ts +1 -17
  231. package/dist/tools/restricted-bash.js +38 -252
  232. package/dist/tools/restricted-bash.js.map +1 -1
  233. package/dist/tools/run-skill.d.ts +3 -1
  234. package/dist/tools/run-skill.js +21 -1
  235. package/dist/tools/run-skill.js.map +1 -1
  236. package/dist/tools/script-resolver.d.ts +3 -1
  237. package/dist/tools/script-resolver.js +74 -30
  238. package/dist/tools/script-resolver.js.map +1 -1
  239. package/dist/tools/update-skill.js +17 -6
  240. package/dist/tools/update-skill.js.map +1 -1
  241. package/package.json +8 -6
  242. package/siclaw.mjs +10 -1
  243. package/skills/core/cluster-events/SKILL.md +1 -1
  244. package/skills/core/deep-investigation/SKILL.md +11 -0
  245. package/skills/core/deployment-rollout-debug/SKILL.md +1 -1
  246. package/skills/core/dns-debug/SKILL.md +1 -0
  247. package/skills/core/meta.json +12 -1
  248. package/skills/core/networkpolicy-debug/SKILL.md +332 -0
  249. package/skills/core/node-logs/scripts/get-node-logs.sh +19 -9
  250. package/skills/core/pod-pending-debug/SKILL.md +1 -0
  251. package/skills/core/quota-debug/SKILL.md +203 -0
  252. package/skills/core/service-debug/SKILL.md +1 -0
  253. package/skills/core/statefulset-debug/SKILL.md +280 -0
  254. package/skills/core/volcano-diagnose-pod/SKILL.md +196 -0
  255. package/skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh +175 -0
  256. package/skills/core/volcano-gang-scheduling/SKILL.md +299 -0
  257. package/skills/core/volcano-job-diagnose/SKILL.md +319 -0
  258. package/skills/core/volcano-job-diagnose/scripts/diagnose-job.sh +253 -0
  259. package/skills/core/volcano-node-resources/SKILL.md +334 -0
  260. package/skills/core/volcano-node-resources/scripts/get-node-resources.sh +281 -0
  261. package/skills/core/volcano-queue-diagnose/SKILL.md +294 -0
  262. package/skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh +283 -0
  263. package/skills/core/volcano-resource-insufficient/SKILL.md +315 -0
  264. package/skills/core/volcano-scheduler-config/SKILL.md +371 -0
  265. package/skills/core/volcano-scheduler-config/scripts/get-scheduler-config.sh +297 -0
  266. package/skills/core/volcano-scheduler-logs/SKILL.md +241 -0
  267. package/skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh +159 -0
  268. package/skills/platform/create-skill/SKILL.md +35 -3
  269. package/skills/platform/manage-skill/SKILL.md +9 -2
  270. package/skills/platform/update-skill/SKILL.md +17 -6
@@ -0,0 +1,281 @@
1
+ #!/bin/bash
2
+ # Query cluster node resources for Volcano scheduling.
3
+ # This script performs read-only operations using kubectl.
4
+ set -euo pipefail
5
+
6
+ show_help() {
7
+ cat <<EOF
8
+ Usage: $0 [options]
9
+
10
+ Query cluster node resources to understand capacity and availability.
11
+ Checks allocatable CPU, memory, GPU, and current usage.
12
+
13
+ Options:
14
+ --node NODE Query specific node only
15
+ --label LABEL Filter nodes by label (e.g., gpu=true)
16
+ --show-usage Show current resource usage (requires metrics-server)
17
+ --show-pods Show pods running on each node
18
+ --format FORMAT Output format: table (default), json, wide
19
+ -h, --help Show this help message
20
+
21
+ Examples:
22
+ $0 # All nodes
23
+ $0 --node worker-1 # Specific node
24
+ $0 --label nvidia.com/gpu.present=true # GPU nodes
25
+ $0 --show-usage --show-pods # With usage and pods
26
+ $0 --format json # JSON output
27
+ EOF
28
+ exit 0
29
+ }
30
+
31
+ # Parse arguments
32
+ NODE=""
33
+ LABEL=""
34
+ SHOW_USAGE=false
35
+ SHOW_PODS=false
36
+ FORMAT="table"
37
+
38
+ while [[ $# -gt 0 ]]; do
39
+ case $1 in
40
+ -h|--help) show_help ;;
41
+ --node) NODE="$2"; shift 2 ;;
42
+ --label) LABEL="$2"; shift 2 ;;
43
+ --show-usage) SHOW_USAGE=true; shift ;;
44
+ --show-pods) SHOW_PODS=true; shift ;;
45
+ --format) FORMAT="$2"; shift 2 ;;
46
+ *) echo "Unknown option: $1. Use --help for usage." >&2; exit 1 ;;
47
+ esac
48
+ done
49
+
50
+ # Validate format
51
+ if [[ "$FORMAT" != "table" && "$FORMAT" != "json" && "$FORMAT" != "wide" ]]; then
52
+ echo "Error: Invalid format '$FORMAT'. Use: table, json, or wide" >&2
53
+ exit 1
54
+ fi
55
+
56
+ echo "=== Volcano Node Resources ==="
57
+ [[ -n "$NODE" ]] && echo "Node: $NODE"
58
+ [[ -n "$LABEL" ]] && echo "Label filter: $LABEL"
59
+ echo "Show usage: $SHOW_USAGE"
60
+ echo "Show pods: $SHOW_PODS"
61
+ echo "Format: $FORMAT"
62
+ echo
63
+
64
+ # Build kubectl get nodes command
65
+ NODE_CMD="kubectl get nodes"
66
+ [[ -n "$LABEL" ]] && NODE_CMD="$NODE_CMD -l $LABEL"
67
+ [[ -n "$NODE" ]] && NODE_CMD="$NODE_CMD $NODE"
68
+
69
+ # Check if nodes exist
70
+ if ! $NODE_CMD -o name &>/dev/null; then
71
+ echo "Error: No nodes found matching criteria" >&2
72
+ exit 1
73
+ fi
74
+
75
+ # Function to get node resources
76
+ get_node_resources() {
77
+ local node="$1"
78
+
79
+ # Get allocatable resources
80
+ local cpu_alloc mem_alloc gpu_alloc pods_alloc
81
+ cpu_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.cpu}' 2>/dev/null || echo "N/A")
82
+ mem_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.memory}' 2>/dev/null || echo "N/A")
83
+ gpu_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.nvidia\.com/gpu}' 2>/dev/null || echo "0")
84
+ pods_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.pods}' 2>/dev/null || echo "N/A")
85
+
86
+ # Get capacity
87
+ local cpu_cap mem_cap
88
+ cpu_cap=$(kubectl get node "$node" -o jsonpath='{.status.capacity.cpu}' 2>/dev/null || echo "N/A")
89
+ mem_cap=$(kubectl get node "$node" -o jsonpath='{.status.capacity.memory}' 2>/dev/null || echo "N/A")
90
+
91
+ # Get status and age
92
+ local status age
93
+ status=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "Unknown")
94
+ # Calculate age in days (cross-platform: GNU date on Linux, BSD date on macOS)
95
+ age=$(kubectl get node "$node" -o jsonpath='{.metadata.creationTimestamp}' 2>/dev/null | {
96
+ IFS= read -r timestamp
97
+ if [[ -n "$timestamp" ]]; then
98
+ if date -d "$timestamp" +%s &>/dev/null 2>&1; then
99
+ # GNU date (Linux)
100
+ created=$(date -d "$timestamp" +%s 2>/dev/null)
101
+ else
102
+ # BSD date (macOS)
103
+ created=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$timestamp" +%s 2>/dev/null)
104
+ fi
105
+ now=$(date +%s)
106
+ if [[ -n "$created" && -n "$now" ]]; then
107
+ echo $(( (now - created) / 86400 ))
108
+ else
109
+ echo "N/A"
110
+ fi
111
+ else
112
+ echo "N/A"
113
+ fi
114
+ })
115
+
116
+ # Get taints
117
+ local taints
118
+ taints=$(kubectl get node "$node" -o jsonpath='{.spec.taints[*].key}' 2>/dev/null || echo "")
119
+
120
+ # Get allocated resources (from describe)
121
+ local cpu_req mem_req
122
+ if describe_output=$(kubectl describe node "$node" 2>/dev/null); then
123
+ cpu_req=$(echo "$describe_output" | grep -A 5 "Allocated resources" | grep "cpu-requests" | awk '{print $2}' || echo "N/A")
124
+ mem_req=$(echo "$describe_output" | grep -A 5 "Allocated resources" | grep "memory-requests" | awk '{print $2}' || echo "N/A")
125
+ else
126
+ cpu_req="N/A"
127
+ mem_req="N/A"
128
+ fi
129
+
130
+ # Calculate available (rough estimate)
131
+ local cpu_avail="N/A"
132
+ local mem_avail="N/A"
133
+
134
+ # Try to calculate if we have numeric values
135
+ if [[ "$cpu_alloc" =~ ^[0-9]+$ && "$cpu_req" =~ ^[0-9]+m?$ ]]; then
136
+ # Convert millicores to cores if needed
137
+ local alloc_val req_val
138
+ alloc_val=$cpu_alloc
139
+ if [[ "$cpu_req" =~ m$ ]]; then
140
+ req_val=$(echo "${cpu_req%m}" | awk '{print $1/1000}')
141
+ else
142
+ req_val=$cpu_req
143
+ fi
144
+ cpu_avail=$(awk "BEGIN {printf \"%.0f\", $alloc_val - $req_val}")
145
+ fi
146
+
147
+ # Output based on format
148
+ case "$FORMAT" in
149
+ table)
150
+ echo "Node: $node"
151
+ echo " Status: $status"
152
+ echo " Age: ${age}d"
153
+ [[ -n "$taints" ]] && echo " Taints: $taints"
154
+ echo " Resources:"
155
+ echo " CPU: Allocatable=$cpu_alloc | Requested=$cpu_req | Available=$cpu_avail"
156
+ echo " Memory: Allocatable=$mem_alloc | Requested=$mem_req | Available=$mem_avail"
157
+ [[ "$gpu_alloc" != "0" ]] && echo " GPU: Allocatable=$gpu_alloc"
158
+ [[ "$SHOW_USAGE" == "true" ]] && echo " Usage: (see metrics below)"
159
+
160
+ if [[ "$SHOW_USAGE" == "true" ]]; then
161
+ echo
162
+ echo " Resource Usage (requires metrics-server):"
163
+ if kubectl top node "$node" 2>/dev/null; then
164
+ : # success
165
+ else
166
+ echo " (Metrics not available)"
167
+ fi
168
+ fi
169
+
170
+ if [[ "$SHOW_PODS" == "true" ]]; then
171
+ echo
172
+ echo " Running Pods:"
173
+ kubectl get pods --all-namespaces --field-selector spec.nodeName="$node" -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory' 2>/dev/null | head -20 || echo " (Failed to list pods)"
174
+ fi
175
+ echo
176
+ ;;
177
+
178
+ wide)
179
+ echo "$node $cpu_alloc $mem_alloc $gpu_alloc $status ${age}d"
180
+ ;;
181
+
182
+ json)
183
+ echo " {"
184
+ echo " \"name\": \"$node\","
185
+ echo " \"status\": \"$status\","
186
+ echo " \"age_days\": $age,"
187
+ [[ -n "$taints" ]] && echo " \"taints\": \"$taints\","
188
+ echo " \"allocatable\": {"
189
+ echo " \"cpu\": \"$cpu_alloc\","
190
+ echo " \"memory\": \"$mem_alloc\","
191
+ echo " \"gpu\": \"$gpu_alloc\","
192
+ echo " \"pods\": \"$pods_alloc\""
193
+ echo " },"
194
+ echo " \"requested\": {"
195
+ echo " \"cpu\": \"$cpu_req\","
196
+ echo " \"memory\": \"$mem_req\""
197
+ echo " },"
198
+ echo " \"available\": {"
199
+ echo " \"cpu\": \"$cpu_avail\","
200
+ echo " \"memory\": \"$mem_avail\""
201
+ echo " }"
202
+ echo " }"
203
+ ;;
204
+ esac
205
+ }
206
+
207
+ # Main logic
208
+ case "$FORMAT" in
209
+ table|wide)
210
+ if [[ "$FORMAT" == "wide" ]]; then
211
+ echo "NAME CPU_ALLOC MEM_ALLOC GPU_ALLOC STATUS AGE"
212
+ echo "==== ========= ========= ========= ====== ===="
213
+ fi
214
+
215
+ if [[ -n "$NODE" ]]; then
216
+ get_node_resources "$NODE"
217
+ else
218
+ # Use process substitution instead of pipe to avoid subshell
219
+ while read -r n; do
220
+ [[ -n "$n" ]] && get_node_resources "$n"
221
+ done < <(kubectl get nodes ${LABEL:+-l $LABEL} -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n')
222
+ fi
223
+ ;;
224
+
225
+ json)
226
+ echo "{"
227
+ echo " \"nodes\": ["
228
+
229
+ # Use process substitution to avoid subshell issue with 'first' variable
230
+ first=true
231
+ if [[ -n "$NODE" ]]; then
232
+ get_node_resources "$NODE"
233
+ else
234
+ while read -r n; do
235
+ if [[ -n "$n" ]]; then
236
+ [[ "$first" == "false" ]] && echo ","
237
+ get_node_resources "$n"
238
+ first=false
239
+ fi
240
+ done < <(kubectl get nodes ${LABEL:+-l $LABEL} -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n')
241
+ fi
242
+
243
+ echo
244
+ echo " ]"
245
+ echo "}"
246
+ ;;
247
+ esac
248
+
249
+ # Summary for table format
250
+ if [[ "$FORMAT" == "table" ]]; then
251
+ echo "=== Summary ==="
252
+
253
+ # Count nodes by status
254
+ total=$(kubectl get nodes ${LABEL:+-l $LABEL} 2>/dev/null | wc -l)
255
+ total=$((total - 1)) # Subtract header
256
+ ready=$(kubectl get nodes ${LABEL:+-l $LABEL} 2>/dev/null | grep -c " Ready " || echo "0")
257
+ not_ready=$((total - ready))
258
+
259
+ echo "Total Nodes: $total"
260
+ echo " Ready: $ready"
261
+ echo " NotReady: $not_ready"
262
+
263
+ # Check for GPU nodes
264
+ gpu_nodes=$(kubectl get nodes ${LABEL:+-l $LABEL} -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}' 2>/dev/null | tr ' ' '\n' | grep -v "^0$" | grep -v "^$" | wc -l)
265
+ if [[ "$gpu_nodes" -gt 0 ]]; then
266
+ echo " GPU Nodes: $gpu_nodes"
267
+ fi
268
+
269
+ # Check metrics availability
270
+ if [[ "$SHOW_USAGE" == "true" ]]; then
271
+ echo
272
+ if kubectl top nodes &>/dev/null; then
273
+ echo "Metrics-server: Available"
274
+ else
275
+ echo "Metrics-server: Not Available (kubectl top nodes failed)"
276
+ fi
277
+ fi
278
+ fi
279
+
280
+ echo
281
+ echo "=== Query Complete ==="
@@ -0,0 +1,294 @@
1
+ ---
2
+ name: volcano-queue-diagnose
3
+ description: >-
4
+ Diagnose Volcano Queue status and resource allocation.
5
+ Check queue weights, deserved resources, allocated resources,
6
+ and identify queue-related scheduling bottlenecks.
7
+ ---
8
+
9
+ # Volcano Queue Diagnosis
10
+
11
+ Diagnose Volcano Queue status, resource allocation, and scheduling bottlenecks. This skill helps understand how resources are distributed across queues and why workloads may be pending due to queue constraints.
12
+
13
+ **Scope:** This skill is for **diagnosis only**. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify queue configurations or delete queues.
14
+
15
+ **Not applicable to native ResourceQuota:** Volcano Queue and Kubernetes ResourceQuota are independent mechanisms. If the cluster does not use Volcano, use `quota-debug` instead. To check: `kubectl get queue 2>/dev/null` — if it returns an error or empty, Volcano is not installed.
16
+
17
+ ## Usage
18
+
19
+ ```bash
20
+ bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh [options]
21
+ ```
22
+
23
+ ## Parameters
24
+
25
+ | Parameter | Required | Description |
26
+ |-----------|----------|-------------|
27
+ | `--queue QUEUE` | no | Queue name to diagnose (default: all queues) |
28
+ | `--show-pods` | no | Show pods associated with each queue |
29
+ | `--verbose` | no | Show detailed resource breakdown |
30
+
31
+ ## Examples
32
+
33
+ Diagnose all queues:
34
+ ```bash
35
+ bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh
36
+ ```
37
+
38
+ Diagnose specific queue:
39
+ ```bash
40
+ bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue
41
+ ```
42
+
43
+ Show verbose output with pod information:
44
+ ```bash
45
+ bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue --show-pods --verbose
46
+ ```
47
+
48
+ ## Understanding Volcano Queues
49
+
50
+ ### Queue Concept
51
+
52
+ In Volcano, a Queue is a cluster-level resource allocation unit. Jobs and PodGroups are submitted to queues, and the scheduler distributes resources among queues based on:
53
+
54
+ 1. **Weight** - Relative share of cluster resources (proportional: weight 10 vs weight 2 = 83% vs 17%)
55
+ 2. **Capability** - Maximum resources a queue can use (ceiling, not guarantee — actual allocation depends on cluster capacity and competition)
56
+ 3. **Parent** - Hierarchical queue relationships (if enabled)
57
+
58
+ **Important:** A Queue is a **cluster-scoped** resource. PodGroups from **any namespace** can reference the same queue, so cross-namespace resource competition within a queue is expected.
59
+
60
+ ### Queue Status Fields
61
+
62
+ | Field | Meaning |
63
+ |-------|---------|
64
+ | `state` | Queue state: Open, Closed, Closing |
65
+ | `deserved` | Resources the queue should receive based on weight |
66
+ | `allocated` | Resources currently allocated to jobs in this queue |
67
+ | `used` | Resources actually used by running pods (≤ allocated) |
68
+ | `pending` | Number of PodGroups waiting in the queue |
69
+ | `running` | Number of running PodGroups |
70
+
71
+ ## Diagnostic Flow
72
+
73
+ ### Step 1: List All Queues
74
+
75
+ Get an overview of all queues:
76
+
77
+ ```bash
78
+ kubectl get queue
79
+ ```
80
+
81
+ **Output columns:**
82
+ - NAME: Queue name
83
+ - WEIGHT: Queue weight (higher = more resources)
84
+ - STATE: Open, Closed, or Closing
85
+ - PARENT: Parent queue (for hierarchical queues)
86
+
87
+ ### Step 2: Check Queue Details
88
+
89
+ Get detailed information about a specific queue:
90
+
91
+ ```bash
92
+ kubectl get queue <queue-name> -o yaml
93
+ kubectl describe queue <queue-name>
94
+ ```
95
+
96
+ **Key sections to examine:**
97
+
98
+ #### Spec (Configuration)
99
+ ```yaml
100
+ spec:
101
+ weight: 10 # Relative weight (default: 1)
102
+ capability: # Max resources allowed
103
+ cpu: "100"
104
+ memory: "200Gi"
105
+ reclaimable: true # Allow resource reclamation
106
+ ```
107
+
108
+ #### Status (Runtime State)
109
+ ```yaml
110
+ status:
111
+ state: Open # Open, Closed, or Closing
112
+ pending: 5 # PodGroups waiting
113
+ running: 10 # Running PodGroups
114
+ deserved: # Resources this queue should get
115
+ cpu: "40"
116
+ memory: "80Gi"
117
+ allocated: # Resources actually allocated
118
+ cpu: "35"
119
+ memory: "70Gi"
120
+ ```
121
+
122
+ ### Step 3: Check Queue Resource Utilization
123
+
124
+ Calculate utilization ratios:
125
+
126
+ ```
127
+ Allocation Ratio = allocated / deserved
128
+ Utilization Ratio = used / allocated
129
+ ```
130
+
131
+ **Interpretation:**
132
+ - `allocated >= deserved`: Queue is at or over its fair share
133
+ - `allocated < deserved`: Queue has room to grow
134
+ - `used << allocated`: Jobs have reserved resources but not using them
135
+
136
+ ### Step 4: Identify PodGroups in Queue
137
+
138
+ Find workloads associated with a queue:
139
+
140
+ ```bash
141
+ # Find all PodGroups in a queue
142
+ kubectl get podgroups --all-namespaces -o json | \
143
+ jq -r '.items[] | select(.spec.queue=="<queue-name>") | "\(.metadata.namespace)/\(.metadata.name)"'
144
+
145
+ # Check pending PodGroups
146
+ kubectl get podgroups --all-namespaces -o json | \
147
+ jq -r '.items[] | select(.spec.queue=="<queue-name>" and .status.phase=="Pending") | \
148
+ "\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'
149
+ ```
150
+
151
+ ### Step 5: Check Queue Events
152
+
153
+ Look for queue-related events:
154
+
155
+ ```bash
156
+ kubectl get events --all-namespaces --field-selector reason=FailedScheduling | grep -i queue
157
+ ```
158
+
159
+ ## Common Queue Issues
160
+
161
+ ### Issue 1: Queue Resource Exhaustion
162
+
163
+ **Symptom:** `allocated >= deserved`, new PodGroups stay in Pending
164
+
165
+ **Check:**
166
+ ```bash
167
+ kubectl get queue <queue> -o jsonpath='{"
168
+ Deserved: "}{.status.deserved}{"
169
+ Allocated: "}{.status.allocated}{"
170
+ Ratio: "}{.status.allocated.cpu}{"/"}{.status.deserved.cpu}{"
171
+ "}'
172
+ ```
173
+
174
+ For GPU-specific checks (GPU is often the bottleneck):
175
+ ```bash
176
+ kubectl get queue -o custom-columns="NAME:.metadata.name,GPU_CAP:.spec.capability['nvidia.com/gpu'],GPU_ALLOC:.status.allocated['nvidia.com/gpu']"
177
+ ```
178
+
179
+ **Also cross-validate capability against actual cluster capacity** — a common misconfiguration is setting `spec.capability` higher than the cluster's physical resources:
180
+ ```bash
181
+ kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable['nvidia.com/gpu'],CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory"
182
+ ```
183
+ If the sum of all nodes' allocatable GPUs is less than the queue's `spec.capability`, the queue can never be fully utilized. When allocation reaches the cluster's physical limit, the queue appears to have remaining capacity but no more resources can actually be scheduled.
184
+
185
+ **Solution:**
186
+ - Increase queue weight (requires scheduler config change)
187
+ - Increase queue capability (only if cluster has physical capacity)
188
+ - Wait for other jobs to complete
189
+ - Check if other queues are over-allocated (reclaim may help)
190
+
191
+ ### Issue 2: Queue is Closed
192
+
193
+ **Symptom:** `status.state: Closed`, new PodGroups rejected
194
+
195
+ **Check:**
196
+ ```bash
197
+ kubectl get queue <queue> -o jsonpath='{.status.state}'
198
+ ```
199
+
200
+ **Solution:**
201
+ - Queue must be reopened by admin
202
+ - Use a different queue
203
+
204
+ ### Issue 3: Weight Imbalance
205
+
206
+ **Symptom:** One queue gets all resources, others starve
207
+
208
+ **Check:**
209
+ ```bash
210
+ kubectl get queue -o custom-columns='NAME:.metadata.name,WEIGHT:.spec.weight,STATE:.status.state,CPU_DESERVED:.status.deserved.cpu,CPU_ALLOC:.status.allocated.cpu,MEM_DESERVED:.status.deserved.memory,MEM_ALLOC:.status.allocated.memory'
211
+ ```
212
+
213
+ **Analysis:** Volcano distributes resources proportionally by weight. For example:
214
+ - Queue A (weight=10) + Queue B (weight=2): A gets 10/12 ≈ 83%, B gets 2/12 ≈ 17% of total cluster resources
215
+ - If Queue B has many pending jobs but low deserved resources, its weight is too low relative to others
216
+
217
+ **Solution:**
218
+ - Adjust queue weights proportionally
219
+ - Check if high-weight queues have capability limits preventing allocation
220
+
221
+ ### Issue 4: Resource Reclaim Not Working
222
+
223
+ **Symptom:** Queue is over-allocated but reclaim is not triggered
224
+
225
+ **Check:**
226
+ ```bash
227
+ # Check reclaim is enabled in scheduler config
228
+ kubectl get cm volcano-scheduler-configmap -n volcano-system -o yaml | grep reclaim
229
+ ```
230
+
231
+ **Reclaim troubleshooting checklist (all must be true):**
232
+ 1. `reclaim` action must be in scheduler actions
233
+ 2. `proportion` plugin must be enabled
234
+ 3. Source queue must be under-utilized (allocated < deserved)
235
+ 4. Target queue must have over-allocated resources (allocated > deserved)
236
+ 5. Target queue must have `reclaimable: true`
237
+
238
+ Check the reclaimable flag on the specific queue:
239
+ ```bash
240
+ kubectl get queue <queue> -o jsonpath='{.spec.reclaimable}'
241
+ ```
242
+ If `reclaimable` is `false` (or unset), the queue's resources **cannot be reclaimed** even if it's over-allocated.
243
+
244
+ **Solution:**
245
+ - Verify all 5 prerequisites above
246
+ - Check scheduler logs for reclaim attempts: use `volcano-scheduler-logs --keyword reclaim`
247
+
248
+ ## Queue Hierarchy (Advanced)
249
+
250
+ If using hierarchical queues:
251
+
252
+ ```bash
253
+ # Check parent-child relationships
254
+ kubectl get queue -o custom-columns='NAME:.metadata.name,PARENT:.spec.parent,WEIGHT:.spec.weight'
255
+ ```
256
+
257
+ **Key points:**
258
+ - Child queues share parent's deserved resources
259
+ - Weight is relative to siblings, not absolute
260
+ - Parent queue's deserved = sum of children's usage
261
+
262
+ ## Script Output Interpretation
263
+
264
+ The diagnose-queue.sh script provides:
265
+
266
+ 1. **Queue Summary Table**
267
+ - Name, State, Weight
268
+ - Pending/Running counts
269
+ - Resource allocation summary
270
+
271
+ 2. **Resource Breakdown (with --verbose)**
272
+ - CPU: deserved, allocated, usage ratio
273
+ - Memory: deserved, allocated, usage ratio
274
+ - GPU: if available
275
+
276
+ 3. **Warning Flags**
277
+ - `[OVER]` - Queue allocated > deserved
278
+ - `[FULL]` - Queue at capacity
279
+ - `[CLOSED]` - Queue not accepting new jobs
280
+ - `[HIGH_PEND]` - Many pending PodGroups
281
+
282
+ ## Environment Variables
283
+
284
+ | Variable | Default | Description |
285
+ |----------|---------|-------------|
286
+ | `VOLCANO_NAMESPACE` | `default` | Default namespace for pod lookup |
287
+
288
+ ## See Also
289
+
290
+ - `volcano-diagnose-pod` - Diagnose individual pod scheduling
291
+ - `volcano-gang-scheduling` - Gang constraint issues
292
+ - `volcano-resource-insufficient` - Resource shortage diagnosis
293
+ - `volcano-scheduler-logs` - Check scheduler decisions
294
+ - `quota-debug` - Native Kubernetes ResourceQuota/LimitRange diagnosis (non-Volcano)