siclaw 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +74 -114
- package/dist/agentbox/gateway-client.d.ts +2 -1
- package/dist/agentbox/gateway-client.js +6 -2
- package/dist/agentbox/gateway-client.js.map +1 -1
- package/dist/agentbox/http-server.js +184 -19
- package/dist/agentbox/http-server.js.map +1 -1
- package/dist/agentbox/resource-handlers.d.ts +1 -0
- package/dist/agentbox/resource-handlers.js +23 -23
- package/dist/agentbox/resource-handlers.js.map +1 -1
- package/dist/agentbox/session.js +85 -5
- package/dist/agentbox/session.js.map +1 -1
- package/dist/agentbox-main.d.ts +2 -1
- package/dist/agentbox-main.js +65 -18
- package/dist/agentbox-main.js.map +1 -1
- package/dist/cli-credentials.d.ts +1 -0
- package/dist/cli-credentials.js +109 -0
- package/dist/cli-credentials.js.map +1 -0
- package/dist/cli-first-run.d.ts +11 -0
- package/dist/cli-first-run.js +99 -0
- package/dist/cli-first-run.js.map +1 -0
- package/dist/cli-main.js +33 -11
- package/dist/cli-main.js.map +1 -1
- package/dist/cli-setup.d.ts +5 -11
- package/dist/cli-setup.js +12 -225
- package/dist/cli-setup.js.map +1 -1
- package/dist/core/agent-factory.d.ts +4 -0
- package/dist/core/agent-factory.js +102 -151
- package/dist/core/agent-factory.js.map +1 -1
- package/dist/core/config.d.ts +10 -3
- package/dist/core/config.js +11 -95
- package/dist/core/config.js.map +1 -1
- package/dist/core/extensions/deep-investigation.d.ts +2 -1
- package/dist/core/extensions/deep-investigation.js +144 -24
- package/dist/core/extensions/deep-investigation.js.map +1 -1
- package/dist/core/extensions/setup.d.ts +8 -0
- package/dist/core/extensions/setup.js +669 -0
- package/dist/core/extensions/setup.js.map +1 -0
- package/dist/core/llm-proxy.js +7 -3
- package/dist/core/llm-proxy.js.map +1 -1
- package/dist/core/mcp-client.d.ts +0 -10
- package/dist/core/mcp-client.js +0 -65
- package/dist/core/mcp-client.js.map +1 -1
- package/dist/core/prompt.d.ts +1 -1
- package/dist/core/prompt.js +42 -5
- package/dist/core/prompt.js.map +1 -1
- package/dist/core/provider-presets.d.ts +14 -0
- package/dist/core/provider-presets.js +81 -0
- package/dist/core/provider-presets.js.map +1 -0
- package/dist/cron/cron-coordinator.d.ts +2 -0
- package/dist/cron/cron-coordinator.js +46 -14
- package/dist/cron/cron-coordinator.js.map +1 -1
- package/dist/cron/cron-executor.js +33 -8
- package/dist/cron/cron-executor.js.map +1 -1
- package/dist/cron/cron-scheduler.d.ts +1 -1
- package/dist/cron/gateway-client.d.ts +5 -0
- package/dist/cron/gateway-client.js +43 -8
- package/dist/cron/gateway-client.js.map +1 -1
- package/dist/cron-main.js +39 -9
- package/dist/cron-main.js.map +1 -1
- package/dist/gateway/agentbox/client.d.ts +11 -0
- package/dist/gateway/agentbox/client.js +18 -0
- package/dist/gateway/agentbox/client.js.map +1 -1
- package/dist/gateway/agentbox/k8s-spawner.d.ts +11 -2
- package/dist/gateway/agentbox/k8s-spawner.js +95 -52
- package/dist/gateway/agentbox/k8s-spawner.js.map +1 -1
- package/dist/gateway/agentbox/local-spawner.d.ts +1 -1
- package/dist/gateway/agentbox/local-spawner.js +4 -2
- package/dist/gateway/agentbox/local-spawner.js.map +1 -1
- package/dist/gateway/agentbox/manager.d.ts +0 -10
- package/dist/gateway/agentbox/manager.js +11 -30
- package/dist/gateway/agentbox/manager.js.map +1 -1
- package/dist/gateway/agentbox/types.d.ts +6 -4
- package/dist/gateway/cron/cron-service.d.ts +49 -0
- package/dist/gateway/cron/cron-service.js +259 -0
- package/dist/gateway/cron/cron-service.js.map +1 -0
- package/dist/gateway/db/init-schema.js +44 -0
- package/dist/gateway/db/init-schema.js.map +1 -1
- package/dist/gateway/db/migrate-sqlite.js +73 -4
- package/dist/gateway/db/migrate-sqlite.js.map +1 -1
- package/dist/gateway/db/repositories/chat-repo.d.ts +56 -2
- package/dist/gateway/db/repositories/chat-repo.js +132 -2
- package/dist/gateway/db/repositories/chat-repo.js.map +1 -1
- package/dist/gateway/db/repositories/config-repo.d.ts +31 -2
- package/dist/gateway/db/repositories/config-repo.js +57 -7
- package/dist/gateway/db/repositories/config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/env-repo.d.ts +14 -0
- package/dist/gateway/db/repositories/env-repo.js +15 -2
- package/dist/gateway/db/repositories/env-repo.js.map +1 -1
- package/dist/gateway/db/repositories/model-config-repo.js +6 -5
- package/dist/gateway/db/repositories/model-config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/skill-repo.d.ts +0 -5
- package/dist/gateway/db/repositories/skill-review-repo.d.ts +1 -0
- package/dist/gateway/db/repositories/skill-review-repo.js +4 -1
- package/dist/gateway/db/repositories/skill-review-repo.js.map +1 -1
- package/dist/gateway/db/repositories/skill-version-repo.js +0 -1
- package/dist/gateway/db/repositories/skill-version-repo.js.map +1 -1
- package/dist/gateway/db/repositories/system-config-repo.d.ts +1 -1
- package/dist/gateway/db/repositories/system-config-repo.js +2 -1
- package/dist/gateway/db/repositories/system-config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/user-env-config-repo.d.ts +13 -0
- package/dist/gateway/db/repositories/user-env-config-repo.js +11 -0
- package/dist/gateway/db/repositories/user-env-config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/workspace-repo.d.ts +3 -2
- package/dist/gateway/db/repositories/workspace-repo.js +6 -2
- package/dist/gateway/db/repositories/workspace-repo.js.map +1 -1
- package/dist/gateway/db/schema-mysql.d.ts +473 -51
- package/dist/gateway/db/schema-mysql.js +35 -4
- package/dist/gateway/db/schema-mysql.js.map +1 -1
- package/dist/gateway/db/schema-sqlite.d.ts +522 -57
- package/dist/gateway/db/schema-sqlite.js +38 -6
- package/dist/gateway/db/schema-sqlite.js.map +1 -1
- package/dist/gateway/db/schema.d.ts +471 -51
- package/dist/gateway/db/schema.js +1 -1
- package/dist/gateway/db/schema.js.map +1 -1
- package/dist/gateway/metrics-aggregator.d.ts +65 -0
- package/dist/gateway/metrics-aggregator.js +244 -0
- package/dist/gateway/metrics-aggregator.js.map +1 -0
- package/dist/gateway/plugins/channel-bridge.d.ts +4 -1
- package/dist/gateway/plugins/channel-bridge.js +78 -86
- package/dist/gateway/plugins/channel-bridge.js.map +1 -1
- package/dist/gateway/rpc-methods.d.ts +4 -2
- package/dist/gateway/rpc-methods.js +852 -166
- package/dist/gateway/rpc-methods.js.map +1 -1
- package/dist/gateway/security/cert-manager.d.ts +2 -2
- package/dist/gateway/security/cert-manager.js +4 -2
- package/dist/gateway/security/cert-manager.js.map +1 -1
- package/dist/gateway/server.d.ts +4 -8
- package/dist/gateway/server.js +297 -261
- package/dist/gateway/server.js.map +1 -1
- package/dist/gateway/skills/file-writer.js +17 -11
- package/dist/gateway/skills/file-writer.js.map +1 -1
- package/dist/gateway/skills/script-evaluator.js +12 -9
- package/dist/gateway/skills/script-evaluator.js.map +1 -1
- package/dist/gateway/web/dist/assets/index-0p17ZeTP.js +740 -0
- package/dist/gateway/web/dist/assets/index-9eP6nPUq.js +741 -0
- package/dist/gateway/web/dist/assets/index-9eP6nPUq.js.map +1 -0
- package/dist/gateway/web/dist/assets/index-DyowBCEj.css +1 -0
- package/dist/gateway/web/dist/assets/index-PDK5JJDO.css +1 -0
- package/dist/gateway/web/dist/index.html +2 -2
- package/dist/gateway-main.js +27 -10
- package/dist/gateway-main.js.map +1 -1
- package/dist/memory/embeddings.js +5 -4
- package/dist/memory/embeddings.js.map +1 -1
- package/dist/memory/indexer.d.ts +23 -3
- package/dist/memory/indexer.js +235 -23
- package/dist/memory/indexer.js.map +1 -1
- package/dist/memory/schema.js +15 -1
- package/dist/memory/schema.js.map +1 -1
- package/dist/memory/types.d.ts +18 -0
- package/dist/memory/types.js +6 -1
- package/dist/memory/types.js.map +1 -1
- package/dist/shared/detect-language.d.ts +12 -0
- package/dist/shared/detect-language.js +78 -0
- package/dist/shared/detect-language.js.map +1 -0
- package/dist/shared/diagnostic-events.d.ts +70 -0
- package/dist/shared/diagnostic-events.js +38 -0
- package/dist/shared/diagnostic-events.js.map +1 -0
- package/dist/shared/local-collector.d.ts +56 -0
- package/dist/shared/local-collector.js +284 -0
- package/dist/shared/local-collector.js.map +1 -0
- package/dist/shared/metrics-types.d.ts +64 -0
- package/dist/shared/metrics-types.js +25 -0
- package/dist/shared/metrics-types.js.map +1 -0
- package/dist/shared/metrics.d.ts +19 -0
- package/dist/shared/metrics.js +185 -0
- package/dist/shared/metrics.js.map +1 -0
- package/dist/shared/path-utils.d.ts +15 -0
- package/dist/shared/path-utils.js +23 -0
- package/dist/shared/path-utils.js.map +1 -0
- package/dist/shared/retry.d.ts +35 -0
- package/dist/shared/retry.js +61 -0
- package/dist/shared/retry.js.map +1 -0
- package/dist/tools/command-sets.d.ts +18 -2
- package/dist/tools/command-sets.js +207 -32
- package/dist/tools/command-sets.js.map +1 -1
- package/dist/tools/command-validator.d.ts +56 -0
- package/dist/tools/command-validator.js +357 -0
- package/dist/tools/command-validator.js.map +1 -0
- package/dist/tools/create-skill.js +26 -1
- package/dist/tools/create-skill.js.map +1 -1
- package/dist/tools/credential-list.js +1 -23
- package/dist/tools/credential-list.js.map +1 -1
- package/dist/tools/credential-manager.d.ts +98 -0
- package/dist/tools/credential-manager.js +313 -0
- package/dist/tools/credential-manager.js.map +1 -0
- package/dist/tools/deep-search/engine.js +184 -127
- package/dist/tools/deep-search/engine.js.map +1 -1
- package/dist/tools/deep-search/prompts.d.ts +10 -2
- package/dist/tools/deep-search/prompts.js +37 -36
- package/dist/tools/deep-search/prompts.js.map +1 -1
- package/dist/tools/deep-search/schemas.d.ts +87 -0
- package/dist/tools/deep-search/schemas.js +85 -0
- package/dist/tools/deep-search/schemas.js.map +1 -0
- package/dist/tools/deep-search/sub-agent.d.ts +21 -0
- package/dist/tools/deep-search/sub-agent.js +153 -4
- package/dist/tools/deep-search/sub-agent.js.map +1 -1
- package/dist/tools/deep-search/tool.js +1 -0
- package/dist/tools/deep-search/tool.js.map +1 -1
- package/dist/tools/deep-search/types.d.ts +2 -0
- package/dist/tools/deep-search/types.js.map +1 -1
- package/dist/tools/dp-tools.js +29 -5
- package/dist/tools/dp-tools.js.map +1 -1
- package/dist/tools/exec-utils.d.ts +85 -0
- package/dist/tools/exec-utils.js +294 -0
- package/dist/tools/exec-utils.js.map +1 -0
- package/dist/tools/fork-skill.js +14 -2
- package/dist/tools/fork-skill.js.map +1 -1
- package/dist/tools/investigation-feedback.d.ts +3 -0
- package/dist/tools/investigation-feedback.js +71 -0
- package/dist/tools/investigation-feedback.js.map +1 -0
- package/dist/tools/manage-schedule.js +16 -6
- package/dist/tools/manage-schedule.js.map +1 -1
- package/dist/tools/netns-script.js +27 -281
- package/dist/tools/netns-script.js.map +1 -1
- package/dist/tools/node-exec.d.ts +2 -14
- package/dist/tools/node-exec.js +18 -225
- package/dist/tools/node-exec.js.map +1 -1
- package/dist/tools/node-script.js +14 -168
- package/dist/tools/node-script.js.map +1 -1
- package/dist/tools/pod-exec.d.ts +1 -1
- package/dist/tools/pod-exec.js +10 -26
- package/dist/tools/pod-exec.js.map +1 -1
- package/dist/tools/pod-nsenter-exec.js +21 -225
- package/dist/tools/pod-nsenter-exec.js.map +1 -1
- package/dist/tools/pod-script.js +10 -19
- package/dist/tools/pod-script.js.map +1 -1
- package/dist/tools/restricted-bash.d.ts +1 -17
- package/dist/tools/restricted-bash.js +38 -252
- package/dist/tools/restricted-bash.js.map +1 -1
- package/dist/tools/run-skill.d.ts +3 -1
- package/dist/tools/run-skill.js +21 -1
- package/dist/tools/run-skill.js.map +1 -1
- package/dist/tools/script-resolver.d.ts +3 -1
- package/dist/tools/script-resolver.js +74 -30
- package/dist/tools/script-resolver.js.map +1 -1
- package/dist/tools/update-skill.js +17 -6
- package/dist/tools/update-skill.js.map +1 -1
- package/package.json +4 -2
- package/siclaw.mjs +10 -1
- package/skills/core/cluster-events/SKILL.md +1 -1
- package/skills/core/deep-investigation/SKILL.md +11 -0
- package/skills/core/deployment-rollout-debug/SKILL.md +1 -1
- package/skills/core/dns-debug/SKILL.md +1 -0
- package/skills/core/meta.json +12 -1
- package/skills/core/networkpolicy-debug/SKILL.md +332 -0
- package/skills/core/node-logs/scripts/get-node-logs.sh +19 -9
- package/skills/core/pod-pending-debug/SKILL.md +1 -0
- package/skills/core/quota-debug/SKILL.md +203 -0
- package/skills/core/service-debug/SKILL.md +1 -0
- package/skills/core/statefulset-debug/SKILL.md +280 -0
- package/skills/core/volcano-diagnose-pod/SKILL.md +196 -0
- package/skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh +175 -0
- package/skills/core/volcano-gang-scheduling/SKILL.md +299 -0
- package/skills/core/volcano-job-diagnose/SKILL.md +319 -0
- package/skills/core/volcano-job-diagnose/scripts/diagnose-job.sh +253 -0
- package/skills/core/volcano-node-resources/SKILL.md +334 -0
- package/skills/core/volcano-node-resources/scripts/get-node-resources.sh +281 -0
- package/skills/core/volcano-queue-diagnose/SKILL.md +294 -0
- package/skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh +283 -0
- package/skills/core/volcano-resource-insufficient/SKILL.md +315 -0
- package/skills/core/volcano-scheduler-config/SKILL.md +371 -0
- package/skills/core/volcano-scheduler-config/scripts/get-scheduler-config.sh +297 -0
- package/skills/core/volcano-scheduler-logs/SKILL.md +241 -0
- package/skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh +159 -0
- package/skills/platform/create-skill/SKILL.md +35 -3
- package/skills/platform/manage-skill/SKILL.md +9 -2
- package/skills/platform/update-skill/SKILL.md +17 -6
|
@@ -0,0 +1,299 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: volcano-gang-scheduling
|
|
3
|
+
description: >-
|
|
4
|
+
Gang Scheduling diagnostic guide for Volcano.
|
|
5
|
+
Use when PodGroup cannot schedule completely, member Pods remain Pending,
|
|
6
|
+
or minAvailable/minMember constraints are not satisfied.
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Gang Scheduling Diagnosis
|
|
10
|
+
|
|
11
|
+
This is a diagnostic guide for Gang scheduling issues in Volcano. Gang scheduling requires that all members of a PodGroup be scheduled simultaneously. If the cluster cannot satisfy the `minMember` requirement, none of the pods will be scheduled.
|
|
12
|
+
|
|
13
|
+
**Scope:** This skill is for **diagnosis only**. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify PodGroups or resource configurations.
|
|
14
|
+
|
|
15
|
+
## When to Use This Guide
|
|
16
|
+
|
|
17
|
+
Use this skill when:
|
|
18
|
+
- PodGroup status is `Inqueue` but member Pods remain `Pending`
|
|
19
|
+
- Events contain `minMember` related errors
|
|
20
|
+
- Volcano Job has `minAvailable` or `minMember` that cannot be satisfied
|
|
21
|
+
- Some member Pods are running, others are Pending, and the entire group won't start
|
|
22
|
+
- You see `FailedScheduling` events mentioning Gang constraints
|
|
23
|
+
|
|
24
|
+
## Understanding Gang Scheduling
|
|
25
|
+
|
|
26
|
+
Gang scheduling in Volcano ensures that either all members of a workload are scheduled, or none are. This is crucial for distributed workloads like MPI, TensorFlow, PyTorch where partial scheduling is wasteful.
|
|
27
|
+
|
|
28
|
+
**Key Concepts:**
|
|
29
|
+
- `minMember` (in PodGroup spec): Minimum number of pods that must be scheduled simultaneously
|
|
30
|
+
- `minResources` (in PodGroup spec): Aggregate resource floor (e.g., total GPUs) that must be available — **both** `minMember` and `minResources` must be satisfied if set
|
|
31
|
+
- `minAvailable` (in Job spec): Similar concept at Job level
|
|
32
|
+
- The scheduler checks if there are **simultaneous** resources for all minMember pods before allocating
|
|
33
|
+
|
|
34
|
+
## Diagnostic Steps
|
|
35
|
+
|
|
36
|
+
### Step 1: Identify the PodGroup
|
|
37
|
+
|
|
38
|
+
Find the PodGroup associated with the pending pods:
|
|
39
|
+
|
|
40
|
+
```bash
|
|
41
|
+
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.annotations.scheduling.volcano.sh/pod-group}'
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
### Step 2: Check PodGroup Status
|
|
45
|
+
|
|
46
|
+
Get detailed PodGroup information:
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
kubectl get podgroup <podgroup-name> -n <namespace> -o yaml
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Key fields to examine:**
|
|
53
|
+
|
|
54
|
+
| Field | Meaning | What to Look For |
|
|
55
|
+
|-------|---------|------------------|
|
|
56
|
+
| `spec.minMember` | Minimum pods required | Is this number achievable? |
|
|
57
|
+
| `spec.minResources` | Aggregate resource floor | Is total cluster capacity sufficient? |
|
|
58
|
+
| `status.phase` | Current scheduling phase | Should be `Inqueue` for ready-to-schedule |
|
|
59
|
+
| `status.running` | Currently running pods | Compare to minMember |
|
|
60
|
+
| `status.pending` | Pending pods | These are waiting for Gang constraint |
|
|
61
|
+
| `spec.queue` | Queue name | Check if queue has sufficient resources |
|
|
62
|
+
|
|
63
|
+
**Common scenarios:**
|
|
64
|
+
|
|
65
|
+
- `status.phase: Pending` - PodGroup is waiting to be enqueued
|
|
66
|
+
- `status.phase: Inqueue` - Ready for scheduling but constraint not met
|
|
67
|
+
- `status.running < spec.minMember` - Gang constraint not satisfied
|
|
68
|
+
|
|
69
|
+
### Step 3: Calculate Resource Requirements
|
|
70
|
+
|
|
71
|
+
Calculate the total resources needed for the Gang:
|
|
72
|
+
|
|
73
|
+
```
|
|
74
|
+
Total CPU = minMember × single Pod CPU request
|
|
75
|
+
Total Memory = minMember × single Pod Memory request
|
|
76
|
+
Total GPU = minMember × single Pod GPU request (if applicable)
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Get a pod's resource requests:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
### Step 4: Check Cluster Resources
|
|
86
|
+
|
|
87
|
+
#### Option A: Check Node Resources
|
|
88
|
+
|
|
89
|
+
View available resources across nodes:
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,GPU:.status.allocatable.nvidia\.com/gpu'
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Check current resource usage:
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
kubectl top nodes
|
|
99
|
+
```
|
|
100
|
+
|
|
101
|
+
#### Option B: Check by Node Labels (if pods have node affinity)
|
|
102
|
+
|
|
103
|
+
If pods target specific nodes:
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
kubectl get nodes -l <label-key>=<label-value> -o wide
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
### Step 5: Check Events for Gang Errors
|
|
110
|
+
|
|
111
|
+
Look for Gang-specific scheduling errors:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
**Common Gang-related event messages:**
|
|
118
|
+
|
|
119
|
+
| Message | Meaning | Investigation |
|
|
120
|
+
|---------|---------|---------------|
|
|
121
|
+
| `minMember not satisfied` | Gang constraint preventing scheduling | Check if total resources >= minMember requirements |
|
|
122
|
+
| `gang member not ready` | Some pods in the gang are not ready | Check individual pod status |
|
|
123
|
+
| `resource insufficient` | Not enough resources for all members | Use `volcano-resource-insufficient` skill |
|
|
124
|
+
|
|
125
|
+
### Step 6: Verify Queue Resources
|
|
126
|
+
|
|
127
|
+
If the PodGroup is in a Queue, check if the queue has sufficient deserved resources:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
kubectl get queue <queue-name>
|
|
131
|
+
kubectl describe queue <queue-name>
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Look for:
|
|
135
|
+
- `status.deserved` vs `status.allocated`
|
|
136
|
+
- If allocated >= deserved, the queue is at capacity
|
|
137
|
+
- Check `status.state` is `Open` (not `Closing` or `Closed`)
|
|
138
|
+
|
|
139
|
+
## Common Causes and Solutions
|
|
140
|
+
|
|
141
|
+
### Cause 1: minMember Too Large
|
|
142
|
+
|
|
143
|
+
**Symptom:** `minMember` is larger than the number of available nodes, or requires more resources than any single node can provide.
|
|
144
|
+
|
|
145
|
+
**Example:**
|
|
146
|
+
- minMember = 10
|
|
147
|
+
- Each pod requests 8 GPUs
|
|
148
|
+
- Only 5 nodes have 8 GPUs each
|
|
149
|
+
- **Result:** Gang can never be satisfied
|
|
150
|
+
|
|
151
|
+
**Solution:**
|
|
152
|
+
- Reduce `minMember` in PodGroup spec
|
|
153
|
+
- Increase cluster capacity (add nodes)
|
|
154
|
+
- Reduce per-pod resource requests
|
|
155
|
+
|
|
156
|
+
### Cause 2: Resource Fragmentation
|
|
157
|
+
|
|
158
|
+
**Symptom:** Total cluster resources are sufficient, but not concentrated on enough nodes to satisfy simultaneous scheduling.
|
|
159
|
+
|
|
160
|
+
**Example:**
|
|
161
|
+
- minMember = 4, each needs 4 CPUs
|
|
162
|
+
- Total cluster: 20 CPUs available
|
|
163
|
+
- But distributed across 10 nodes with 2 CPUs each
|
|
164
|
+
- **Result:** Cannot find 4 nodes with 4 CPUs simultaneously
|
|
165
|
+
|
|
166
|
+
**Solution:**
|
|
167
|
+
- Configure `binpack` plugin to concentrate pods on fewer nodes
|
|
168
|
+
- Defragment cluster by rescheduling or draining nodes
|
|
169
|
+
- Adjust resource requests to fit node sizes
|
|
170
|
+
|
|
171
|
+
### Cause 3: Priority Preemption
|
|
172
|
+
|
|
173
|
+
**Symptom:** Resources exist but are being used by lower-priority workloads that should be preempted.
|
|
174
|
+
|
|
175
|
+
**Check:**
|
|
176
|
+
- Compare PodGroup priority vs running PodGroups
|
|
177
|
+
- Check if higher priority exists in the same queue
|
|
178
|
+
|
|
179
|
+
**Solution:**
|
|
180
|
+
- Ensure correct PriorityClass is assigned
|
|
181
|
+
- Check `priority` plugin is enabled in scheduler config
|
|
182
|
+
|
|
183
|
+
### Cause 4: Queue Resource Exhaustion
|
|
184
|
+
|
|
185
|
+
**Symptom:** The PodGroup's queue has used all its deserved resources.
|
|
186
|
+
|
|
187
|
+
**Check:**
|
|
188
|
+
```bash
|
|
189
|
+
kubectl get queue <queue-name> -o jsonpath='{.status.allocated}'
|
|
190
|
+
kubectl get queue <queue-name> -o jsonpath='{.status.deserved}'
|
|
191
|
+
```
|
|
192
|
+
|
|
193
|
+
**Solution:**
|
|
194
|
+
- Increase queue weight or capability
|
|
195
|
+
- Wait for other jobs to complete
|
|
196
|
+
- Use `volcano-queue-diagnose` for detailed analysis
|
|
197
|
+
|
|
198
|
+
### Cause 5: Affinity/Anti-Affinity Conflicts (Effective Node Pool Narrowing)
|
|
199
|
+
|
|
200
|
+
**Symptom:** Queue shows available capacity, but Gang still blocks. Pod scheduling constraints narrow the effective node pool below what Gang requires.
|
|
201
|
+
|
|
202
|
+
**Diagnosis — compute the effective node pool:**
|
|
203
|
+
```bash
|
|
204
|
+
# 1. Check pod's nodeSelector
|
|
205
|
+
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
|
|
206
|
+
|
|
207
|
+
# 2. Check matching nodes
|
|
208
|
+
kubectl get nodes -l <selector-key>=<selector-value> -o custom-columns="NAME:.metadata.name,CPU:.status.allocatable.cpu,GPU:.status.allocatable['nvidia.com/gpu']"
|
|
209
|
+
|
|
210
|
+
# 3. Check tolerations (tainted nodes require matching tolerations)
|
|
211
|
+
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'
|
|
212
|
+
kubectl get nodes -o custom-columns="NAME:.metadata.name,TAINTS:.spec.taints[*].key"
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
Volcano scheduling is **two-phase**: first queue-level admission (capacity check), then node-level placement. A job can pass the queue check but fail node placement if all matching nodes are occupied.
|
|
216
|
+
|
|
217
|
+
**Solution:**
|
|
218
|
+
- Relax affinity constraints if possible
|
|
219
|
+
- Ensure sufficient nodes match the constraints
|
|
220
|
+
- Verify toleration matches for tainted nodes
|
|
221
|
+
|
|
222
|
+
### Cause 6: Queue Has Capacity but Gang Still Blocks
|
|
223
|
+
|
|
224
|
+
**Symptom:** Queue `allocated < deserved`, PodGroup is `Inqueue`, but pods remain Pending.
|
|
225
|
+
|
|
226
|
+
**Check — verify remaining capacity vs Gang requirement:**
|
|
227
|
+
```bash
|
|
228
|
+
# Queue remaining capacity
|
|
229
|
+
kubectl get queue <queue> -o jsonpath='{"deserved: "}{.status.deserved}{"\nallocated: "}{.status.allocated}'
|
|
230
|
+
|
|
231
|
+
# PodGroup minMember and minResources
|
|
232
|
+
kubectl get podgroup <pg> -n <ns> -o jsonpath='{"minMember: "}{.spec.minMember}{"\nminResources: "}{.spec.minResources}'
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
Calculate: `remaining = deserved - allocated`. If `remaining < minMember × per-pod-resources`, the Gang cannot be satisfied even though the queue is not fully used.
|
|
236
|
+
|
|
237
|
+
If `minResources` is set, also verify: `remaining >= minResources` for each resource dimension.
|
|
238
|
+
|
|
239
|
+
**Solution:**
|
|
240
|
+
- Wait for enough resources to free up in the queue
|
|
241
|
+
- Reduce `minMember` or `minResources` if the job can tolerate partial scheduling
|
|
242
|
+
|
|
243
|
+
### Cause 7: Post-Scheduling Gang Breakage
|
|
244
|
+
|
|
245
|
+
**Symptom:** Job was Running, then moves to Aborted. Running pod count dropped below `minMember`.
|
|
246
|
+
|
|
247
|
+
This happens when pods are evicted (preemption, node failure, OOM) and the remaining count falls below the Gang constraint, causing the entire group to be torn down.
|
|
248
|
+
|
|
249
|
+
**Check:**
|
|
250
|
+
```bash
|
|
251
|
+
# Current running vs required
|
|
252
|
+
kubectl get podgroup <pg> -n <ns> -o jsonpath='{"running: "}{.status.running}{"\nminMember: "}{.spec.minMember}'
|
|
253
|
+
|
|
254
|
+
# Check for eviction/preemption events
|
|
255
|
+
kubectl get events -n <ns> --field-selector reason=Preempted
|
|
256
|
+
kubectl get events -n <ns> --field-selector reason=Evicted
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
**Solution:**
|
|
260
|
+
- Investigate why pods were evicted (resource pressure, preemption, node failure)
|
|
261
|
+
- Consider setting `reclaimable: false` on the queue to prevent preemption
|
|
262
|
+
- Increase cluster capacity to reduce eviction pressure
|
|
263
|
+
|
|
264
|
+
## Verification Steps
|
|
265
|
+
|
|
266
|
+
After identifying the issue, verify your analysis:
|
|
267
|
+
|
|
268
|
+
1. **Check if issue is Gang-specific:**
|
|
269
|
+
- Try scheduling a single pod with same resources
|
|
270
|
+
- If single pod schedules, it's a Gang constraint issue
|
|
271
|
+
- If single pod doesn't schedule, it's a resource/affinity issue
|
|
272
|
+
|
|
273
|
+
2. **Calculate minimum requirements:**
|
|
274
|
+
- Confirm minMember × per-pod-resources ≤ available resources
|
|
275
|
+
- Confirm enough nodes can accommodate the pods
|
|
276
|
+
|
|
277
|
+
3. **Check scheduler logs:**
|
|
278
|
+
```bash
|
|
279
|
+
# Use volcano-scheduler-logs skill
|
|
280
|
+
bash skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh --keyword gang
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
## Key Insight
|
|
284
|
+
|
|
285
|
+
Gang Scheduling constraint: **Must have enough resources to schedule minMember Pods simultaneously on different nodes.**
|
|
286
|
+
|
|
287
|
+
Even if total cluster resources are sufficient, if resources are released gradually over time (as other pods complete), the "simultaneous" requirement may not be met.
|
|
288
|
+
|
|
289
|
+
**Distinguish between:**
|
|
290
|
+
1. **Total shortage** - Entire cluster lacks resources
|
|
291
|
+
2. **Cannot satisfy simultaneously** - Resources exist but not on enough nodes at the same time
|
|
292
|
+
3. **Queue limit** - Queue deserved resources are exhausted
|
|
293
|
+
|
|
294
|
+
## See Also
|
|
295
|
+
|
|
296
|
+
- `volcano-diagnose-pod` - General Pod scheduling diagnosis
|
|
297
|
+
- `volcano-queue-diagnose` - Queue status and resource analysis
|
|
298
|
+
- `volcano-resource-insufficient` - Resource shortage diagnosis
|
|
299
|
+
- `volcano-scheduler-logs` - Scheduler log analysis
|
|
@@ -0,0 +1,319 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: volcano-job-diagnose
|
|
3
|
+
description: >-
|
|
4
|
+
Diagnose Volcano Job status and issues.
|
|
5
|
+
Check Job phases, task statuses, PodGroup associations, and overall job health.
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# Volcano Job Diagnosis
|
|
9
|
+
|
|
10
|
+
Diagnose Volcano Job (batch.volcano.sh/v1beta1) status and issues. This skill checks Job phases, task statuses, PodGroup associations, and overall job health.
|
|
11
|
+
|
|
12
|
+
**Scope:** This skill is for **diagnosis only**. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify job specs or restart jobs — that should be left to the user.
|
|
13
|
+
|
|
14
|
+
## Usage
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job <job-name> --namespace <namespace>
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
## Parameters
|
|
21
|
+
|
|
22
|
+
| Parameter | Required | Description |
|
|
23
|
+
|-----------|----------|-------------|
|
|
24
|
+
| `--job JOB` | yes | Job name to diagnose |
|
|
25
|
+
| `--namespace NS` | no | Namespace (default: `default`) |
|
|
26
|
+
| `--verbose` | no | Show detailed task and pod information |
|
|
27
|
+
|
|
28
|
+
## Examples
|
|
29
|
+
|
|
30
|
+
Diagnose a Volcano Job:
|
|
31
|
+
```bash
|
|
32
|
+
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-training-job --namespace training
|
|
33
|
+
```
|
|
34
|
+
|
|
35
|
+
Verbose mode with task details:
|
|
36
|
+
```bash
|
|
37
|
+
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-training-job --namespace training --verbose
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
## Understanding Volcano Jobs
|
|
41
|
+
|
|
42
|
+
### Job Structure
|
|
43
|
+
|
|
44
|
+
```yaml
|
|
45
|
+
apiVersion: batch.volcano.sh/v1beta1
|
|
46
|
+
kind: Job
|
|
47
|
+
spec:
|
|
48
|
+
schedulerName: volcano
|
|
49
|
+
tasks:
|
|
50
|
+
- name: worker
|
|
51
|
+
replicas: 4
|
|
52
|
+
template:
|
|
53
|
+
spec:
|
|
54
|
+
containers:
|
|
55
|
+
- name: worker
|
|
56
|
+
resources:
|
|
57
|
+
requests:
|
|
58
|
+
cpu: "4"
|
|
59
|
+
memory: "8Gi"
|
|
60
|
+
maxRetry: 3 # Max retries before job is Aborted
|
|
61
|
+
policies:
|
|
62
|
+
- event: PodFailed
|
|
63
|
+
action: RestartJob
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
> **Note:** Volcano Jobs can also be queried using the short name `vcjob` (e.g., `kubectl get vcjob`). This is an alias for `job.batch.volcano.sh`. Be careful not to confuse with native Kubernetes `batch/v1 Job` — always use `job.batch.volcano.sh` or `vcjob` for Volcano Jobs.
|
|
67
|
+
|
|
68
|
+
### Job Phases
|
|
69
|
+
|
|
70
|
+
| Phase | Meaning |
|
|
71
|
+
|-------|---------|
|
|
72
|
+
| `Pending` | Job is waiting for resources or admission |
|
|
73
|
+
| `Running` | Job is executing |
|
|
74
|
+
| `Completing` | Job tasks are completing |
|
|
75
|
+
| `Completed` | Job finished successfully |
|
|
76
|
+
| `Failed` | Job failed |
|
|
77
|
+
| `Restarting` | Job is being restarted due to policy |
|
|
78
|
+
| `Terminating` | Job is being terminated |
|
|
79
|
+
| `Aborted` | Job was aborted |
|
|
80
|
+
|
|
81
|
+
### Task Statuses
|
|
82
|
+
|
|
83
|
+
Each task within a job has its own status:
|
|
84
|
+
- `Pending` - Task pods not yet scheduled
|
|
85
|
+
- `Running` - Task pods are running
|
|
86
|
+
- `Completed` - Task finished
|
|
87
|
+
- `Failed` - Task failed
|
|
88
|
+
|
|
89
|
+
## Diagnostic Flow
|
|
90
|
+
|
|
91
|
+
### Step 1: Job Overview
|
|
92
|
+
|
|
93
|
+
Get the Job status:
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o yaml
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
**Key fields to check:**
|
|
100
|
+
- `status.state.phase` - Current job phase
|
|
101
|
+
- `status.failed` - Number of failed tasks
|
|
102
|
+
- `status.succeeded` - Number of succeeded tasks
|
|
103
|
+
- `status.running` - Number of running tasks
|
|
104
|
+
- `status.pending` - Number of pending tasks
|
|
105
|
+
|
|
106
|
+
### Step 2: Check Tasks
|
|
107
|
+
|
|
108
|
+
List all tasks and their statuses:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
kubectl get pods -n <namespace> -l volcano.sh/job-name=<job-name> -o wide
|
|
112
|
+
```
|
|
113
|
+
|
|
114
|
+
**What to look for:**
|
|
115
|
+
- Pod phases (Pending, Running, Completed, Failed)
|
|
116
|
+
- Pod restart counts
|
|
117
|
+
- Node assignments
|
|
118
|
+
|
|
119
|
+
### Step 3: Check PodGroup Association
|
|
120
|
+
|
|
121
|
+
Find the PodGroup created for this Job:
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
kubectl get podgroups -n <namespace> -l volcano.sh/job-name=<job-name>
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
Or check the Job's tasks for PodGroup annotations:
|
|
128
|
+
|
|
129
|
+
```bash
|
|
130
|
+
kubectl get pods -n <namespace> -l volcano.sh/job-name=<job-name> \
|
|
131
|
+
-o jsonpath='{.items[0].metadata.annotations.scheduling\.volcano\.sh/pod-group}'
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Next step:** If PodGroup status is problematic, use `volcano-diagnose-pod` for detailed PodGroup analysis.
|
|
135
|
+
|
|
136
|
+
### Step 4: Check Policies
|
|
137
|
+
|
|
138
|
+
Review job policies that may affect behavior:
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o jsonpath='{.spec.policies}'
|
|
142
|
+
```
|
|
143
|
+
|
|
144
|
+
**Common policies:**
|
|
145
|
+
- `PodFailed` → `RestartJob` - Restart entire job on any pod failure
|
|
146
|
+
- `PodFailed` → `RestartTask` - Restart only the failed task
|
|
147
|
+
- `PodEvicted` → `RestartTask` - Restart evicted tasks
|
|
148
|
+
- `PodEvicted` → `AbortJob` - Abort entire job when a pod is evicted (can cause unexpected aborts during preemption)
|
|
149
|
+
- `TaskCompleted` → `CompleteJob` - Complete job when task finishes
|
|
150
|
+
|
|
151
|
+
Also check `maxRetry` — when retries are exhausted the job moves to `Aborted`:
|
|
152
|
+
```bash
|
|
153
|
+
kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o jsonpath='{.spec.maxRetry}'
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### Step 5: Events Analysis
|
|
157
|
+
|
|
158
|
+
Check job-related events:
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
kubectl get events -n <namespace> --field-selector involvedObject.name=<job-name>
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
**Common event patterns:**
|
|
165
|
+
|
|
166
|
+
#### `JobFailed` - Job has failed
|
|
167
|
+
Check the reason and message for failure details.
|
|
168
|
+
|
|
169
|
+
#### `JobRestarting` - Job is being restarted
|
|
170
|
+
Check the restart policy and previous failure reason.
|
|
171
|
+
|
|
172
|
+
#### `TaskFailed` - Individual task failed
|
|
173
|
+
May or may not cause entire job to fail depending on policy.
|
|
174
|
+
|
|
175
|
+
## Common Issues
|
|
176
|
+
|
|
177
|
+
### Issue 1: Job Stuck in Pending
|
|
178
|
+
|
|
179
|
+
**Symptom:** Job phase is `Pending`, no pods created.
|
|
180
|
+
|
|
181
|
+
**Check:**
|
|
182
|
+
1. PodGroup status: `kubectl get podgroups -n <ns>`
|
|
183
|
+
2. Queue state: `kubectl get queue <queue>`
|
|
184
|
+
3. Events: `kubectl get events -n <ns> | grep <job-name>`
|
|
185
|
+
|
|
186
|
+
**Likely causes:**
|
|
187
|
+
- Queue is Closed
|
|
188
|
+
- PodGroup cannot be enqueued (resource shortage)
|
|
189
|
+
- Admission webhook rejection
|
|
190
|
+
|
|
191
|
+
### Issue 2: Some Tasks Running, Others Pending
|
|
192
|
+
|
|
193
|
+
**Symptom:** Partial task scheduling (e.g., 2/4 tasks running).
|
|
194
|
+
|
|
195
|
+
**Check:**
|
|
196
|
+
1. PodGroup minMember vs actual pod count
|
|
197
|
+
2. Gang scheduling constraints
|
|
198
|
+
3. Resource availability
|
|
199
|
+
|
|
200
|
+
**Likely causes:**
|
|
201
|
+
- Gang constraint not satisfied (use `volcano-gang-scheduling`)
|
|
202
|
+
- Resource fragmentation
|
|
203
|
+
- Queue quota exhausted
|
|
204
|
+
|
|
205
|
+
### Issue 3: Job Restarting Repeatedly
|
|
206
|
+
|
|
207
|
+
**Symptom:** Job keeps restarting, never completes.
|
|
208
|
+
|
|
209
|
+
**Check:**
|
|
210
|
+
1. Restart policy: `kubectl get job.batch.volcano.sh -o jsonpath='{.spec.policies}'`
|
|
211
|
+
2. Pod failure reasons: `kubectl describe pod <pod>`
|
|
212
|
+
3. Container logs: `kubectl logs <pod>`
|
|
213
|
+
|
|
214
|
+
**Likely causes:**
|
|
215
|
+
- Application crashing (check container logs)
|
|
216
|
+
- Resource pressure causing evictions
|
|
217
|
+
- Misconfigured restart policy
|
|
218
|
+
|
|
219
|
+
### Issue 4: Job Failed After Some Tasks Completed
|
|
220
|
+
|
|
221
|
+
**Symptom:** Some tasks succeeded, but job marked as Failed.
|
|
222
|
+
|
|
223
|
+
**Check:**
|
|
224
|
+
1. Failed task details
|
|
225
|
+
2. Job completion policy
|
|
226
|
+
3. Task lifecycle policies
|
|
227
|
+
|
|
228
|
+
**Likely causes:**
|
|
229
|
+
- One critical task failed
|
|
230
|
+
- Completion policy is strict (all tasks must succeed)
|
|
231
|
+
- Lifecycle policy triggered premature job failure
|
|
232
|
+
|
|
233
|
+
### Issue 5: Job Aborted Unexpectedly
|
|
234
|
+
|
|
235
|
+
**Symptom:** Job was Running, then moved to `Aborted`.
|
|
236
|
+
|
|
237
|
+
**Check:**
|
|
238
|
+
```bash
|
|
239
|
+
# Check maxRetry
|
|
240
|
+
kubectl get job.batch.volcano.sh <job> -n <ns> -o jsonpath='{.spec.maxRetry}'
|
|
241
|
+
|
|
242
|
+
# Check for preemption/eviction events
|
|
243
|
+
kubectl get events -n <ns> --field-selector reason=Preempted
|
|
244
|
+
kubectl get events -n <ns> --field-selector reason=Evicted
|
|
245
|
+
|
|
246
|
+
# Check if running pod count dropped below minMember (Gang breakage)
|
|
247
|
+
kubectl get podgroup -n <ns> -l volcano.sh/job-name=<job> -o jsonpath='{"running: "}{.items[0].status.running}{"\nminMember: "}{.items[0].spec.minMember}'
|
|
248
|
+
```
|
|
249
|
+
|
|
250
|
+
**Likely causes:**
|
|
251
|
+
- `maxRetry` exhausted — job restarted too many times
|
|
252
|
+
- Preemption by higher-priority job — pods evicted, triggering `PodEvicted → AbortJob` policy
|
|
253
|
+
- Gang breakage — pod eviction caused running count to drop below `minMember`, tearing down the entire group
|
|
254
|
+
- Lifecycle policy mismatch — e.g., `PodEvicted → AbortJob` when `RestartTask` would be more appropriate
|
|
255
|
+
|
|
256
|
+
## Task Lifecycle Policies
|
|
257
|
+
|
|
258
|
+
Volcano controls task coordination through lifecycle policies, not explicit task dependencies.
|
|
259
|
+
|
|
260
|
+
```yaml
|
|
261
|
+
spec:
|
|
262
|
+
tasks:
|
|
263
|
+
- name: master
|
|
264
|
+
replicas: 1
|
|
265
|
+
policies:
|
|
266
|
+
- event: TaskCompleted
|
|
267
|
+
action: CompleteJob
|
|
268
|
+
- name: worker
|
|
269
|
+
replicas: 4
|
|
270
|
+
policies:
|
|
271
|
+
- event: PodFailed
|
|
272
|
+
action: RestartTask
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
**Diagnosis:**
|
|
276
|
+
```bash
|
|
277
|
+
# Check per-task status counts
|
|
278
|
+
kubectl get job.batch.volcano.sh <job> -o jsonpath='{.status.taskStatusCount}'
|
|
279
|
+
|
|
280
|
+
# Check configured policies
|
|
281
|
+
kubectl get job.batch.volcano.sh <job> -o jsonpath='{.spec.tasks[*].policies}'
|
|
282
|
+
```
|
|
283
|
+
|
|
284
|
+
Look for mismatched events/actions that could cause unexpected restarts or premature completion.
|
|
285
|
+
|
|
286
|
+
## Integration with Other Skills
|
|
287
|
+
|
|
288
|
+
Use this skill in combination with others:
|
|
289
|
+
|
|
290
|
+
```bash
|
|
291
|
+
# 1. Job-level diagnosis
|
|
292
|
+
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-job --namespace training
|
|
293
|
+
|
|
294
|
+
# 2. If PodGroup issues found → Pod-level diagnosis
|
|
295
|
+
bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod my-job-worker-0 --namespace training
|
|
296
|
+
|
|
297
|
+
# 3. If Gang issues → Gang scheduling analysis
|
|
298
|
+
# (refer to volcano-gang-scheduling skill)
|
|
299
|
+
|
|
300
|
+
# 4. If Queue issues → Queue diagnosis
|
|
301
|
+
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue
|
|
302
|
+
|
|
303
|
+
# 5. Check scheduler logs for decisions
|
|
304
|
+
bash skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh --pod my-job-worker-0 --since 1h
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
## Environment Variables
|
|
308
|
+
|
|
309
|
+
| Variable | Default | Description |
|
|
310
|
+
|----------|---------|-------------|
|
|
311
|
+
| `VOLCANO_NAMESPACE` | `default` | Default namespace for job lookup |
|
|
312
|
+
|
|
313
|
+
## See Also
|
|
314
|
+
|
|
315
|
+
- `volcano-diagnose-pod` - Pod-level scheduling diagnosis
|
|
316
|
+
- `volcano-gang-scheduling` - Gang scheduling constraint analysis
|
|
317
|
+
- `volcano-queue-diagnose` - Queue resource analysis
|
|
318
|
+
- `volcano-scheduler-logs` - Scheduler decision logs
|
|
319
|
+
- `deployment-rollout-debug` - (Similar concept for Deployments)
|