siclaw 0.1.0 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +75 -114
- package/dist/agentbox/gateway-client.d.ts +2 -1
- package/dist/agentbox/gateway-client.js +6 -2
- package/dist/agentbox/gateway-client.js.map +1 -1
- package/dist/agentbox/http-server.js +184 -19
- package/dist/agentbox/http-server.js.map +1 -1
- package/dist/agentbox/resource-handlers.d.ts +1 -0
- package/dist/agentbox/resource-handlers.js +23 -23
- package/dist/agentbox/resource-handlers.js.map +1 -1
- package/dist/agentbox/session.js +85 -5
- package/dist/agentbox/session.js.map +1 -1
- package/dist/agentbox-main.d.ts +2 -1
- package/dist/agentbox-main.js +65 -18
- package/dist/agentbox-main.js.map +1 -1
- package/dist/cli-credentials.d.ts +1 -0
- package/dist/cli-credentials.js +109 -0
- package/dist/cli-credentials.js.map +1 -0
- package/dist/cli-first-run.d.ts +11 -0
- package/dist/cli-first-run.js +99 -0
- package/dist/cli-first-run.js.map +1 -0
- package/dist/cli-main.js +33 -11
- package/dist/cli-main.js.map +1 -1
- package/dist/cli-setup.d.ts +5 -11
- package/dist/cli-setup.js +12 -225
- package/dist/cli-setup.js.map +1 -1
- package/dist/core/agent-factory.d.ts +4 -0
- package/dist/core/agent-factory.js +102 -151
- package/dist/core/agent-factory.js.map +1 -1
- package/dist/core/config.d.ts +10 -3
- package/dist/core/config.js +11 -95
- package/dist/core/config.js.map +1 -1
- package/dist/core/extensions/deep-investigation.d.ts +2 -1
- package/dist/core/extensions/deep-investigation.js +144 -24
- package/dist/core/extensions/deep-investigation.js.map +1 -1
- package/dist/core/extensions/setup.d.ts +8 -0
- package/dist/core/extensions/setup.js +669 -0
- package/dist/core/extensions/setup.js.map +1 -0
- package/dist/core/llm-proxy.js +7 -3
- package/dist/core/llm-proxy.js.map +1 -1
- package/dist/core/mcp-client.d.ts +0 -10
- package/dist/core/mcp-client.js +0 -65
- package/dist/core/mcp-client.js.map +1 -1
- package/dist/core/prompt.d.ts +1 -1
- package/dist/core/prompt.js +42 -5
- package/dist/core/prompt.js.map +1 -1
- package/dist/core/provider-presets.d.ts +14 -0
- package/dist/core/provider-presets.js +81 -0
- package/dist/core/provider-presets.js.map +1 -0
- package/dist/cron/cron-coordinator.d.ts +2 -0
- package/dist/cron/cron-coordinator.js +46 -14
- package/dist/cron/cron-coordinator.js.map +1 -1
- package/dist/cron/cron-executor.js +33 -8
- package/dist/cron/cron-executor.js.map +1 -1
- package/dist/cron/cron-scheduler.d.ts +1 -1
- package/dist/cron/gateway-client.d.ts +5 -0
- package/dist/cron/gateway-client.js +43 -8
- package/dist/cron/gateway-client.js.map +1 -1
- package/dist/cron-main.js +39 -9
- package/dist/cron-main.js.map +1 -1
- package/dist/gateway/agentbox/client.d.ts +11 -0
- package/dist/gateway/agentbox/client.js +18 -0
- package/dist/gateway/agentbox/client.js.map +1 -1
- package/dist/gateway/agentbox/k8s-spawner.d.ts +11 -2
- package/dist/gateway/agentbox/k8s-spawner.js +95 -52
- package/dist/gateway/agentbox/k8s-spawner.js.map +1 -1
- package/dist/gateway/agentbox/local-spawner.d.ts +1 -1
- package/dist/gateway/agentbox/local-spawner.js +4 -2
- package/dist/gateway/agentbox/local-spawner.js.map +1 -1
- package/dist/gateway/agentbox/manager.d.ts +0 -10
- package/dist/gateway/agentbox/manager.js +11 -30
- package/dist/gateway/agentbox/manager.js.map +1 -1
- package/dist/gateway/agentbox/types.d.ts +6 -4
- package/dist/gateway/cron/cron-service.d.ts +49 -0
- package/dist/gateway/cron/cron-service.js +259 -0
- package/dist/gateway/cron/cron-service.js.map +1 -0
- package/dist/gateway/db/init-schema.js +44 -0
- package/dist/gateway/db/init-schema.js.map +1 -1
- package/dist/gateway/db/migrate-sqlite.js +73 -4
- package/dist/gateway/db/migrate-sqlite.js.map +1 -1
- package/dist/gateway/db/repositories/chat-repo.d.ts +56 -2
- package/dist/gateway/db/repositories/chat-repo.js +132 -2
- package/dist/gateway/db/repositories/chat-repo.js.map +1 -1
- package/dist/gateway/db/repositories/config-repo.d.ts +31 -2
- package/dist/gateway/db/repositories/config-repo.js +57 -7
- package/dist/gateway/db/repositories/config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/env-repo.d.ts +14 -0
- package/dist/gateway/db/repositories/env-repo.js +15 -2
- package/dist/gateway/db/repositories/env-repo.js.map +1 -1
- package/dist/gateway/db/repositories/model-config-repo.d.ts +1 -1
- package/dist/gateway/db/repositories/model-config-repo.js +26 -12
- package/dist/gateway/db/repositories/model-config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/skill-repo.d.ts +0 -5
- package/dist/gateway/db/repositories/skill-review-repo.d.ts +1 -0
- package/dist/gateway/db/repositories/skill-review-repo.js +4 -1
- package/dist/gateway/db/repositories/skill-review-repo.js.map +1 -1
- package/dist/gateway/db/repositories/skill-version-repo.js +0 -1
- package/dist/gateway/db/repositories/skill-version-repo.js.map +1 -1
- package/dist/gateway/db/repositories/system-config-repo.d.ts +1 -1
- package/dist/gateway/db/repositories/system-config-repo.js +2 -1
- package/dist/gateway/db/repositories/system-config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/user-env-config-repo.d.ts +13 -0
- package/dist/gateway/db/repositories/user-env-config-repo.js +11 -0
- package/dist/gateway/db/repositories/user-env-config-repo.js.map +1 -1
- package/dist/gateway/db/repositories/workspace-repo.d.ts +3 -2
- package/dist/gateway/db/repositories/workspace-repo.js +6 -2
- package/dist/gateway/db/repositories/workspace-repo.js.map +1 -1
- package/dist/gateway/db/schema-mysql.d.ts +473 -51
- package/dist/gateway/db/schema-mysql.js +35 -4
- package/dist/gateway/db/schema-mysql.js.map +1 -1
- package/dist/gateway/db/schema-sqlite.d.ts +522 -57
- package/dist/gateway/db/schema-sqlite.js +38 -6
- package/dist/gateway/db/schema-sqlite.js.map +1 -1
- package/dist/gateway/db/schema.d.ts +471 -51
- package/dist/gateway/db/schema.js +1 -1
- package/dist/gateway/db/schema.js.map +1 -1
- package/dist/gateway/metrics-aggregator.d.ts +65 -0
- package/dist/gateway/metrics-aggregator.js +244 -0
- package/dist/gateway/metrics-aggregator.js.map +1 -0
- package/dist/gateway/plugins/channel-bridge.d.ts +4 -1
- package/dist/gateway/plugins/channel-bridge.js +78 -86
- package/dist/gateway/plugins/channel-bridge.js.map +1 -1
- package/dist/gateway/rpc-methods.d.ts +4 -2
- package/dist/gateway/rpc-methods.js +962 -163
- package/dist/gateway/rpc-methods.js.map +1 -1
- package/dist/gateway/security/cert-manager.d.ts +2 -2
- package/dist/gateway/security/cert-manager.js +4 -2
- package/dist/gateway/security/cert-manager.js.map +1 -1
- package/dist/gateway/server.d.ts +4 -8
- package/dist/gateway/server.js +297 -261
- package/dist/gateway/server.js.map +1 -1
- package/dist/gateway/skills/file-writer.js +17 -11
- package/dist/gateway/skills/file-writer.js.map +1 -1
- package/dist/gateway/skills/script-evaluator.js +12 -9
- package/dist/gateway/skills/script-evaluator.js.map +1 -1
- package/dist/gateway/web/dist/assets/index-0p17ZeTP.js +740 -0
- package/dist/gateway/web/dist/assets/index-9eP6nPUq.js +741 -0
- package/dist/gateway/web/dist/assets/index-9eP6nPUq.js.map +1 -0
- package/dist/gateway/web/dist/assets/index-CAmSY91d.js +675 -0
- package/dist/gateway/web/dist/assets/index-DMFEh8Pp.css +1 -0
- package/dist/gateway/web/dist/assets/index-DyowBCEj.css +1 -0
- package/dist/gateway/web/dist/assets/index-PDK5JJDO.css +1 -0
- package/dist/gateway/web/dist/index.html +2 -2
- package/dist/gateway-main.js +27 -10
- package/dist/gateway-main.js.map +1 -1
- package/dist/memory/embeddings.js +5 -4
- package/dist/memory/embeddings.js.map +1 -1
- package/dist/memory/indexer.d.ts +23 -3
- package/dist/memory/indexer.js +235 -23
- package/dist/memory/indexer.js.map +1 -1
- package/dist/memory/schema.js +15 -1
- package/dist/memory/schema.js.map +1 -1
- package/dist/memory/types.d.ts +18 -0
- package/dist/memory/types.js +6 -1
- package/dist/memory/types.js.map +1 -1
- package/dist/shared/detect-language.d.ts +12 -0
- package/dist/shared/detect-language.js +78 -0
- package/dist/shared/detect-language.js.map +1 -0
- package/dist/shared/diagnostic-events.d.ts +70 -0
- package/dist/shared/diagnostic-events.js +38 -0
- package/dist/shared/diagnostic-events.js.map +1 -0
- package/dist/shared/local-collector.d.ts +56 -0
- package/dist/shared/local-collector.js +284 -0
- package/dist/shared/local-collector.js.map +1 -0
- package/dist/shared/metrics-types.d.ts +64 -0
- package/dist/shared/metrics-types.js +25 -0
- package/dist/shared/metrics-types.js.map +1 -0
- package/dist/shared/metrics.d.ts +19 -0
- package/dist/shared/metrics.js +185 -0
- package/dist/shared/metrics.js.map +1 -0
- package/dist/shared/path-utils.d.ts +15 -0
- package/dist/shared/path-utils.js +23 -0
- package/dist/shared/path-utils.js.map +1 -0
- package/dist/shared/retry.d.ts +35 -0
- package/dist/shared/retry.js +61 -0
- package/dist/shared/retry.js.map +1 -0
- package/dist/tools/command-sets.d.ts +18 -2
- package/dist/tools/command-sets.js +207 -32
- package/dist/tools/command-sets.js.map +1 -1
- package/dist/tools/command-validator.d.ts +56 -0
- package/dist/tools/command-validator.js +357 -0
- package/dist/tools/command-validator.js.map +1 -0
- package/dist/tools/create-skill.js +26 -1
- package/dist/tools/create-skill.js.map +1 -1
- package/dist/tools/credential-list.js +1 -23
- package/dist/tools/credential-list.js.map +1 -1
- package/dist/tools/credential-manager.d.ts +98 -0
- package/dist/tools/credential-manager.js +313 -0
- package/dist/tools/credential-manager.js.map +1 -0
- package/dist/tools/deep-search/engine.js +184 -127
- package/dist/tools/deep-search/engine.js.map +1 -1
- package/dist/tools/deep-search/prompts.d.ts +10 -2
- package/dist/tools/deep-search/prompts.js +37 -36
- package/dist/tools/deep-search/prompts.js.map +1 -1
- package/dist/tools/deep-search/schemas.d.ts +87 -0
- package/dist/tools/deep-search/schemas.js +85 -0
- package/dist/tools/deep-search/schemas.js.map +1 -0
- package/dist/tools/deep-search/sub-agent.d.ts +21 -0
- package/dist/tools/deep-search/sub-agent.js +153 -4
- package/dist/tools/deep-search/sub-agent.js.map +1 -1
- package/dist/tools/deep-search/tool.js +1 -0
- package/dist/tools/deep-search/tool.js.map +1 -1
- package/dist/tools/deep-search/types.d.ts +2 -0
- package/dist/tools/deep-search/types.js.map +1 -1
- package/dist/tools/dp-tools.js +29 -5
- package/dist/tools/dp-tools.js.map +1 -1
- package/dist/tools/exec-utils.d.ts +85 -0
- package/dist/tools/exec-utils.js +294 -0
- package/dist/tools/exec-utils.js.map +1 -0
- package/dist/tools/fork-skill.js +14 -2
- package/dist/tools/fork-skill.js.map +1 -1
- package/dist/tools/investigation-feedback.d.ts +3 -0
- package/dist/tools/investigation-feedback.js +71 -0
- package/dist/tools/investigation-feedback.js.map +1 -0
- package/dist/tools/manage-schedule.js +16 -6
- package/dist/tools/manage-schedule.js.map +1 -1
- package/dist/tools/netns-script.js +27 -281
- package/dist/tools/netns-script.js.map +1 -1
- package/dist/tools/node-exec.d.ts +2 -14
- package/dist/tools/node-exec.js +18 -225
- package/dist/tools/node-exec.js.map +1 -1
- package/dist/tools/node-script.js +14 -168
- package/dist/tools/node-script.js.map +1 -1
- package/dist/tools/pod-exec.d.ts +1 -1
- package/dist/tools/pod-exec.js +10 -26
- package/dist/tools/pod-exec.js.map +1 -1
- package/dist/tools/pod-nsenter-exec.js +21 -225
- package/dist/tools/pod-nsenter-exec.js.map +1 -1
- package/dist/tools/pod-script.js +10 -19
- package/dist/tools/pod-script.js.map +1 -1
- package/dist/tools/restricted-bash.d.ts +1 -17
- package/dist/tools/restricted-bash.js +38 -252
- package/dist/tools/restricted-bash.js.map +1 -1
- package/dist/tools/run-skill.d.ts +3 -1
- package/dist/tools/run-skill.js +21 -1
- package/dist/tools/run-skill.js.map +1 -1
- package/dist/tools/script-resolver.d.ts +3 -1
- package/dist/tools/script-resolver.js +74 -30
- package/dist/tools/script-resolver.js.map +1 -1
- package/dist/tools/update-skill.js +17 -6
- package/dist/tools/update-skill.js.map +1 -1
- package/package.json +8 -6
- package/siclaw.mjs +10 -1
- package/skills/core/cluster-events/SKILL.md +1 -1
- package/skills/core/deep-investigation/SKILL.md +11 -0
- package/skills/core/deployment-rollout-debug/SKILL.md +1 -1
- package/skills/core/dns-debug/SKILL.md +1 -0
- package/skills/core/meta.json +12 -1
- package/skills/core/networkpolicy-debug/SKILL.md +332 -0
- package/skills/core/node-logs/scripts/get-node-logs.sh +19 -9
- package/skills/core/pod-pending-debug/SKILL.md +1 -0
- package/skills/core/quota-debug/SKILL.md +203 -0
- package/skills/core/service-debug/SKILL.md +1 -0
- package/skills/core/statefulset-debug/SKILL.md +280 -0
- package/skills/core/volcano-diagnose-pod/SKILL.md +196 -0
- package/skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh +175 -0
- package/skills/core/volcano-gang-scheduling/SKILL.md +299 -0
- package/skills/core/volcano-job-diagnose/SKILL.md +319 -0
- package/skills/core/volcano-job-diagnose/scripts/diagnose-job.sh +253 -0
- package/skills/core/volcano-node-resources/SKILL.md +334 -0
- package/skills/core/volcano-node-resources/scripts/get-node-resources.sh +281 -0
- package/skills/core/volcano-queue-diagnose/SKILL.md +294 -0
- package/skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh +283 -0
- package/skills/core/volcano-resource-insufficient/SKILL.md +315 -0
- package/skills/core/volcano-scheduler-config/SKILL.md +371 -0
- package/skills/core/volcano-scheduler-config/scripts/get-scheduler-config.sh +297 -0
- package/skills/core/volcano-scheduler-logs/SKILL.md +241 -0
- package/skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh +159 -0
- package/skills/platform/create-skill/SKILL.md +35 -3
- package/skills/platform/manage-skill/SKILL.md +9 -2
- package/skills/platform/update-skill/SKILL.md +17 -6
|
@@ -0,0 +1,281 @@
|
|
|
1
|
+
#!/bin/bash
|
|
2
|
+
# Query cluster node resources for Volcano scheduling.
|
|
3
|
+
# This script performs read-only operations using kubectl.
|
|
4
|
+
set -euo pipefail
|
|
5
|
+
|
|
6
|
+
show_help() {
|
|
7
|
+
cat <<EOF
|
|
8
|
+
Usage: $0 [options]
|
|
9
|
+
|
|
10
|
+
Query cluster node resources to understand capacity and availability.
|
|
11
|
+
Checks allocatable CPU, memory, GPU, and current usage.
|
|
12
|
+
|
|
13
|
+
Options:
|
|
14
|
+
--node NODE Query specific node only
|
|
15
|
+
--label LABEL Filter nodes by label (e.g., gpu=true)
|
|
16
|
+
--show-usage Show current resource usage (requires metrics-server)
|
|
17
|
+
--show-pods Show pods running on each node
|
|
18
|
+
--format FORMAT Output format: table (default), json, wide
|
|
19
|
+
-h, --help Show this help message
|
|
20
|
+
|
|
21
|
+
Examples:
|
|
22
|
+
$0 # All nodes
|
|
23
|
+
$0 --node worker-1 # Specific node
|
|
24
|
+
$0 --label nvidia.com/gpu.present=true # GPU nodes
|
|
25
|
+
$0 --show-usage --show-pods # With usage and pods
|
|
26
|
+
$0 --format json # JSON output
|
|
27
|
+
EOF
|
|
28
|
+
exit 0
|
|
29
|
+
}
|
|
30
|
+
|
|
31
|
+
# Parse arguments
|
|
32
|
+
NODE=""
|
|
33
|
+
LABEL=""
|
|
34
|
+
SHOW_USAGE=false
|
|
35
|
+
SHOW_PODS=false
|
|
36
|
+
FORMAT="table"
|
|
37
|
+
|
|
38
|
+
while [[ $# -gt 0 ]]; do
|
|
39
|
+
case $1 in
|
|
40
|
+
-h|--help) show_help ;;
|
|
41
|
+
--node) NODE="$2"; shift 2 ;;
|
|
42
|
+
--label) LABEL="$2"; shift 2 ;;
|
|
43
|
+
--show-usage) SHOW_USAGE=true; shift ;;
|
|
44
|
+
--show-pods) SHOW_PODS=true; shift ;;
|
|
45
|
+
--format) FORMAT="$2"; shift 2 ;;
|
|
46
|
+
*) echo "Unknown option: $1. Use --help for usage." >&2; exit 1 ;;
|
|
47
|
+
esac
|
|
48
|
+
done
|
|
49
|
+
|
|
50
|
+
# Validate format
|
|
51
|
+
if [[ "$FORMAT" != "table" && "$FORMAT" != "json" && "$FORMAT" != "wide" ]]; then
|
|
52
|
+
echo "Error: Invalid format '$FORMAT'. Use: table, json, or wide" >&2
|
|
53
|
+
exit 1
|
|
54
|
+
fi
|
|
55
|
+
|
|
56
|
+
echo "=== Volcano Node Resources ==="
|
|
57
|
+
[[ -n "$NODE" ]] && echo "Node: $NODE"
|
|
58
|
+
[[ -n "$LABEL" ]] && echo "Label filter: $LABEL"
|
|
59
|
+
echo "Show usage: $SHOW_USAGE"
|
|
60
|
+
echo "Show pods: $SHOW_PODS"
|
|
61
|
+
echo "Format: $FORMAT"
|
|
62
|
+
echo
|
|
63
|
+
|
|
64
|
+
# Build kubectl get nodes command
|
|
65
|
+
NODE_CMD="kubectl get nodes"
|
|
66
|
+
[[ -n "$LABEL" ]] && NODE_CMD="$NODE_CMD -l $LABEL"
|
|
67
|
+
[[ -n "$NODE" ]] && NODE_CMD="$NODE_CMD $NODE"
|
|
68
|
+
|
|
69
|
+
# Check if nodes exist
|
|
70
|
+
if ! $NODE_CMD -o name &>/dev/null; then
|
|
71
|
+
echo "Error: No nodes found matching criteria" >&2
|
|
72
|
+
exit 1
|
|
73
|
+
fi
|
|
74
|
+
|
|
75
|
+
# Function to get node resources
|
|
76
|
+
get_node_resources() {
|
|
77
|
+
local node="$1"
|
|
78
|
+
|
|
79
|
+
# Get allocatable resources
|
|
80
|
+
local cpu_alloc mem_alloc gpu_alloc pods_alloc
|
|
81
|
+
cpu_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.cpu}' 2>/dev/null || echo "N/A")
|
|
82
|
+
mem_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.memory}' 2>/dev/null || echo "N/A")
|
|
83
|
+
gpu_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.nvidia\.com/gpu}' 2>/dev/null || echo "0")
|
|
84
|
+
pods_alloc=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.pods}' 2>/dev/null || echo "N/A")
|
|
85
|
+
|
|
86
|
+
# Get capacity
|
|
87
|
+
local cpu_cap mem_cap
|
|
88
|
+
cpu_cap=$(kubectl get node "$node" -o jsonpath='{.status.capacity.cpu}' 2>/dev/null || echo "N/A")
|
|
89
|
+
mem_cap=$(kubectl get node "$node" -o jsonpath='{.status.capacity.memory}' 2>/dev/null || echo "N/A")
|
|
90
|
+
|
|
91
|
+
# Get status and age
|
|
92
|
+
local status age
|
|
93
|
+
status=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' 2>/dev/null || echo "Unknown")
|
|
94
|
+
# Calculate age in days (cross-platform: GNU date on Linux, BSD date on macOS)
|
|
95
|
+
age=$(kubectl get node "$node" -o jsonpath='{.metadata.creationTimestamp}' 2>/dev/null | {
|
|
96
|
+
IFS= read -r timestamp
|
|
97
|
+
if [[ -n "$timestamp" ]]; then
|
|
98
|
+
if date -d "$timestamp" +%s &>/dev/null 2>&1; then
|
|
99
|
+
# GNU date (Linux)
|
|
100
|
+
created=$(date -d "$timestamp" +%s 2>/dev/null)
|
|
101
|
+
else
|
|
102
|
+
# BSD date (macOS)
|
|
103
|
+
created=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$timestamp" +%s 2>/dev/null)
|
|
104
|
+
fi
|
|
105
|
+
now=$(date +%s)
|
|
106
|
+
if [[ -n "$created" && -n "$now" ]]; then
|
|
107
|
+
echo $(( (now - created) / 86400 ))
|
|
108
|
+
else
|
|
109
|
+
echo "N/A"
|
|
110
|
+
fi
|
|
111
|
+
else
|
|
112
|
+
echo "N/A"
|
|
113
|
+
fi
|
|
114
|
+
})
|
|
115
|
+
|
|
116
|
+
# Get taints
|
|
117
|
+
local taints
|
|
118
|
+
taints=$(kubectl get node "$node" -o jsonpath='{.spec.taints[*].key}' 2>/dev/null || echo "")
|
|
119
|
+
|
|
120
|
+
# Get allocated resources (from describe)
|
|
121
|
+
local cpu_req mem_req
|
|
122
|
+
if describe_output=$(kubectl describe node "$node" 2>/dev/null); then
|
|
123
|
+
cpu_req=$(echo "$describe_output" | grep -A 5 "Allocated resources" | grep "cpu-requests" | awk '{print $2}' || echo "N/A")
|
|
124
|
+
mem_req=$(echo "$describe_output" | grep -A 5 "Allocated resources" | grep "memory-requests" | awk '{print $2}' || echo "N/A")
|
|
125
|
+
else
|
|
126
|
+
cpu_req="N/A"
|
|
127
|
+
mem_req="N/A"
|
|
128
|
+
fi
|
|
129
|
+
|
|
130
|
+
# Calculate available (rough estimate)
|
|
131
|
+
local cpu_avail="N/A"
|
|
132
|
+
local mem_avail="N/A"
|
|
133
|
+
|
|
134
|
+
# Try to calculate if we have numeric values
|
|
135
|
+
if [[ "$cpu_alloc" =~ ^[0-9]+$ && "$cpu_req" =~ ^[0-9]+m?$ ]]; then
|
|
136
|
+
# Convert millicores to cores if needed
|
|
137
|
+
local alloc_val req_val
|
|
138
|
+
alloc_val=$cpu_alloc
|
|
139
|
+
if [[ "$cpu_req" =~ m$ ]]; then
|
|
140
|
+
req_val=$(echo "${cpu_req%m}" | awk '{print $1/1000}')
|
|
141
|
+
else
|
|
142
|
+
req_val=$cpu_req
|
|
143
|
+
fi
|
|
144
|
+
cpu_avail=$(awk "BEGIN {printf \"%.0f\", $alloc_val - $req_val}")
|
|
145
|
+
fi
|
|
146
|
+
|
|
147
|
+
# Output based on format
|
|
148
|
+
case "$FORMAT" in
|
|
149
|
+
table)
|
|
150
|
+
echo "Node: $node"
|
|
151
|
+
echo " Status: $status"
|
|
152
|
+
echo " Age: ${age}d"
|
|
153
|
+
[[ -n "$taints" ]] && echo " Taints: $taints"
|
|
154
|
+
echo " Resources:"
|
|
155
|
+
echo " CPU: Allocatable=$cpu_alloc | Requested=$cpu_req | Available=$cpu_avail"
|
|
156
|
+
echo " Memory: Allocatable=$mem_alloc | Requested=$mem_req | Available=$mem_avail"
|
|
157
|
+
[[ "$gpu_alloc" != "0" ]] && echo " GPU: Allocatable=$gpu_alloc"
|
|
158
|
+
[[ "$SHOW_USAGE" == "true" ]] && echo " Usage: (see metrics below)"
|
|
159
|
+
|
|
160
|
+
if [[ "$SHOW_USAGE" == "true" ]]; then
|
|
161
|
+
echo
|
|
162
|
+
echo " Resource Usage (requires metrics-server):"
|
|
163
|
+
if kubectl top node "$node" 2>/dev/null; then
|
|
164
|
+
: # success
|
|
165
|
+
else
|
|
166
|
+
echo " (Metrics not available)"
|
|
167
|
+
fi
|
|
168
|
+
fi
|
|
169
|
+
|
|
170
|
+
if [[ "$SHOW_PODS" == "true" ]]; then
|
|
171
|
+
echo
|
|
172
|
+
echo " Running Pods:"
|
|
173
|
+
kubectl get pods --all-namespaces --field-selector spec.nodeName="$node" -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,STATUS:.status.phase,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory' 2>/dev/null | head -20 || echo " (Failed to list pods)"
|
|
174
|
+
fi
|
|
175
|
+
echo
|
|
176
|
+
;;
|
|
177
|
+
|
|
178
|
+
wide)
|
|
179
|
+
echo "$node $cpu_alloc $mem_alloc $gpu_alloc $status ${age}d"
|
|
180
|
+
;;
|
|
181
|
+
|
|
182
|
+
json)
|
|
183
|
+
echo " {"
|
|
184
|
+
echo " \"name\": \"$node\","
|
|
185
|
+
echo " \"status\": \"$status\","
|
|
186
|
+
echo " \"age_days\": $age,"
|
|
187
|
+
[[ -n "$taints" ]] && echo " \"taints\": \"$taints\","
|
|
188
|
+
echo " \"allocatable\": {"
|
|
189
|
+
echo " \"cpu\": \"$cpu_alloc\","
|
|
190
|
+
echo " \"memory\": \"$mem_alloc\","
|
|
191
|
+
echo " \"gpu\": \"$gpu_alloc\","
|
|
192
|
+
echo " \"pods\": \"$pods_alloc\""
|
|
193
|
+
echo " },"
|
|
194
|
+
echo " \"requested\": {"
|
|
195
|
+
echo " \"cpu\": \"$cpu_req\","
|
|
196
|
+
echo " \"memory\": \"$mem_req\""
|
|
197
|
+
echo " },"
|
|
198
|
+
echo " \"available\": {"
|
|
199
|
+
echo " \"cpu\": \"$cpu_avail\","
|
|
200
|
+
echo " \"memory\": \"$mem_avail\""
|
|
201
|
+
echo " }"
|
|
202
|
+
echo " }"
|
|
203
|
+
;;
|
|
204
|
+
esac
|
|
205
|
+
}
|
|
206
|
+
|
|
207
|
+
# Main logic
|
|
208
|
+
case "$FORMAT" in
|
|
209
|
+
table|wide)
|
|
210
|
+
if [[ "$FORMAT" == "wide" ]]; then
|
|
211
|
+
echo "NAME CPU_ALLOC MEM_ALLOC GPU_ALLOC STATUS AGE"
|
|
212
|
+
echo "==== ========= ========= ========= ====== ===="
|
|
213
|
+
fi
|
|
214
|
+
|
|
215
|
+
if [[ -n "$NODE" ]]; then
|
|
216
|
+
get_node_resources "$NODE"
|
|
217
|
+
else
|
|
218
|
+
# Use process substitution instead of pipe to avoid subshell
|
|
219
|
+
while read -r n; do
|
|
220
|
+
[[ -n "$n" ]] && get_node_resources "$n"
|
|
221
|
+
done < <(kubectl get nodes ${LABEL:+-l $LABEL} -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n')
|
|
222
|
+
fi
|
|
223
|
+
;;
|
|
224
|
+
|
|
225
|
+
json)
|
|
226
|
+
echo "{"
|
|
227
|
+
echo " \"nodes\": ["
|
|
228
|
+
|
|
229
|
+
# Use process substitution to avoid subshell issue with 'first' variable
|
|
230
|
+
first=true
|
|
231
|
+
if [[ -n "$NODE" ]]; then
|
|
232
|
+
get_node_resources "$NODE"
|
|
233
|
+
else
|
|
234
|
+
while read -r n; do
|
|
235
|
+
if [[ -n "$n" ]]; then
|
|
236
|
+
[[ "$first" == "false" ]] && echo ","
|
|
237
|
+
get_node_resources "$n"
|
|
238
|
+
first=false
|
|
239
|
+
fi
|
|
240
|
+
done < <(kubectl get nodes ${LABEL:+-l $LABEL} -o jsonpath='{.items[*].metadata.name}' 2>/dev/null | tr ' ' '\n')
|
|
241
|
+
fi
|
|
242
|
+
|
|
243
|
+
echo
|
|
244
|
+
echo " ]"
|
|
245
|
+
echo "}"
|
|
246
|
+
;;
|
|
247
|
+
esac
|
|
248
|
+
|
|
249
|
+
# Summary for table format
|
|
250
|
+
if [[ "$FORMAT" == "table" ]]; then
|
|
251
|
+
echo "=== Summary ==="
|
|
252
|
+
|
|
253
|
+
# Count nodes by status
|
|
254
|
+
total=$(kubectl get nodes ${LABEL:+-l $LABEL} 2>/dev/null | wc -l)
|
|
255
|
+
total=$((total - 1)) # Subtract header
|
|
256
|
+
ready=$(kubectl get nodes ${LABEL:+-l $LABEL} 2>/dev/null | grep -c " Ready " || echo "0")
|
|
257
|
+
not_ready=$((total - ready))
|
|
258
|
+
|
|
259
|
+
echo "Total Nodes: $total"
|
|
260
|
+
echo " Ready: $ready"
|
|
261
|
+
echo " NotReady: $not_ready"
|
|
262
|
+
|
|
263
|
+
# Check for GPU nodes
|
|
264
|
+
gpu_nodes=$(kubectl get nodes ${LABEL:+-l $LABEL} -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}' 2>/dev/null | tr ' ' '\n' | grep -v "^0$" | grep -v "^$" | wc -l)
|
|
265
|
+
if [[ "$gpu_nodes" -gt 0 ]]; then
|
|
266
|
+
echo " GPU Nodes: $gpu_nodes"
|
|
267
|
+
fi
|
|
268
|
+
|
|
269
|
+
# Check metrics availability
|
|
270
|
+
if [[ "$SHOW_USAGE" == "true" ]]; then
|
|
271
|
+
echo
|
|
272
|
+
if kubectl top nodes &>/dev/null; then
|
|
273
|
+
echo "Metrics-server: Available"
|
|
274
|
+
else
|
|
275
|
+
echo "Metrics-server: Not Available (kubectl top nodes failed)"
|
|
276
|
+
fi
|
|
277
|
+
fi
|
|
278
|
+
fi
|
|
279
|
+
|
|
280
|
+
echo
|
|
281
|
+
echo "=== Query Complete ==="
|
|
@@ -0,0 +1,294 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: volcano-queue-diagnose
|
|
3
|
+
description: >-
|
|
4
|
+
Diagnose Volcano Queue status and resource allocation.
|
|
5
|
+
Check queue weights, deserved resources, allocated resources,
|
|
6
|
+
and identify queue-related scheduling bottlenecks.
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Volcano Queue Diagnosis
|
|
10
|
+
|
|
11
|
+
Diagnose Volcano Queue status, resource allocation, and scheduling bottlenecks. This skill helps understand how resources are distributed across queues and why workloads may be pending due to queue constraints.
|
|
12
|
+
|
|
13
|
+
**Scope:** This skill is for **diagnosis only**. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify queue configurations or delete queues.
|
|
14
|
+
|
|
15
|
+
**Not applicable to native ResourceQuota:** Volcano Queue and Kubernetes ResourceQuota are independent mechanisms. If the cluster does not use Volcano, use `quota-debug` instead. To check: `kubectl get queue 2>/dev/null` — if it returns an error or empty, Volcano is not installed.
|
|
16
|
+
|
|
17
|
+
## Usage
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh [options]
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Parameters
|
|
24
|
+
|
|
25
|
+
| Parameter | Required | Description |
|
|
26
|
+
|-----------|----------|-------------|
|
|
27
|
+
| `--queue QUEUE` | no | Queue name to diagnose (default: all queues) |
|
|
28
|
+
| `--show-pods` | no | Show pods associated with each queue |
|
|
29
|
+
| `--verbose` | no | Show detailed resource breakdown |
|
|
30
|
+
|
|
31
|
+
## Examples
|
|
32
|
+
|
|
33
|
+
Diagnose all queues:
|
|
34
|
+
```bash
|
|
35
|
+
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
Diagnose specific queue:
|
|
39
|
+
```bash
|
|
40
|
+
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
Show verbose output with pod information:
|
|
44
|
+
```bash
|
|
45
|
+
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue --show-pods --verbose
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## Understanding Volcano Queues
|
|
49
|
+
|
|
50
|
+
### Queue Concept
|
|
51
|
+
|
|
52
|
+
In Volcano, a Queue is a cluster-level resource allocation unit. Jobs and PodGroups are submitted to queues, and the scheduler distributes resources among queues based on:
|
|
53
|
+
|
|
54
|
+
1. **Weight** - Relative share of cluster resources (proportional: weight 10 vs weight 2 = 83% vs 17%)
|
|
55
|
+
2. **Capability** - Maximum resources a queue can use (ceiling, not guarantee — actual allocation depends on cluster capacity and competition)
|
|
56
|
+
3. **Parent** - Hierarchical queue relationships (if enabled)
|
|
57
|
+
|
|
58
|
+
**Important:** A Queue is a **cluster-scoped** resource. PodGroups from **any namespace** can reference the same queue, so cross-namespace resource competition within a queue is expected.
|
|
59
|
+
|
|
60
|
+
### Queue Status Fields
|
|
61
|
+
|
|
62
|
+
| Field | Meaning |
|
|
63
|
+
|-------|---------|
|
|
64
|
+
| `state` | Queue state: Open, Closed, Closing |
|
|
65
|
+
| `deserved` | Resources the queue should receive based on weight |
|
|
66
|
+
| `allocated` | Resources currently allocated to jobs in this queue |
|
|
67
|
+
| `used` | Resources actually used by running pods (≤ allocated) |
|
|
68
|
+
| `pending` | Number of PodGroups waiting in the queue |
|
|
69
|
+
| `running` | Number of running PodGroups |
|
|
70
|
+
|
|
71
|
+
## Diagnostic Flow
|
|
72
|
+
|
|
73
|
+
### Step 1: List All Queues
|
|
74
|
+
|
|
75
|
+
Get an overview of all queues:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
kubectl get queue
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**Output columns:**
|
|
82
|
+
- NAME: Queue name
|
|
83
|
+
- WEIGHT: Queue weight (higher = more resources)
|
|
84
|
+
- STATE: Open, Closed, or Closing
|
|
85
|
+
- PARENT: Parent queue (for hierarchical queues)
|
|
86
|
+
|
|
87
|
+
### Step 2: Check Queue Details
|
|
88
|
+
|
|
89
|
+
Get detailed information about a specific queue:
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
kubectl get queue <queue-name> -o yaml
|
|
93
|
+
kubectl describe queue <queue-name>
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
**Key sections to examine:**
|
|
97
|
+
|
|
98
|
+
#### Spec (Configuration)
|
|
99
|
+
```yaml
|
|
100
|
+
spec:
|
|
101
|
+
weight: 10 # Relative weight (default: 1)
|
|
102
|
+
capability: # Max resources allowed
|
|
103
|
+
cpu: "100"
|
|
104
|
+
memory: "200Gi"
|
|
105
|
+
reclaimable: true # Allow resource reclamation
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
#### Status (Runtime State)
|
|
109
|
+
```yaml
|
|
110
|
+
status:
|
|
111
|
+
state: Open # Open, Closed, or Closing
|
|
112
|
+
pending: 5 # PodGroups waiting
|
|
113
|
+
running: 10 # Running PodGroups
|
|
114
|
+
deserved: # Resources this queue should get
|
|
115
|
+
cpu: "40"
|
|
116
|
+
memory: "80Gi"
|
|
117
|
+
allocated: # Resources actually allocated
|
|
118
|
+
cpu: "35"
|
|
119
|
+
memory: "70Gi"
|
|
120
|
+
```
|
|
121
|
+
|
|
122
|
+
### Step 3: Check Queue Resource Utilization
|
|
123
|
+
|
|
124
|
+
Calculate utilization ratios:
|
|
125
|
+
|
|
126
|
+
```
|
|
127
|
+
Allocation Ratio = allocated / deserved
|
|
128
|
+
Utilization Ratio = used / allocated
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
**Interpretation:**
|
|
132
|
+
- `allocated >= deserved`: Queue is at or over its fair share
|
|
133
|
+
- `allocated < deserved`: Queue has room to grow
|
|
134
|
+
- `used << allocated`: Jobs have reserved resources but not using them
|
|
135
|
+
|
|
136
|
+
### Step 4: Identify PodGroups in Queue
|
|
137
|
+
|
|
138
|
+
Find workloads associated with a queue:
|
|
139
|
+
|
|
140
|
+
```bash
|
|
141
|
+
# Find all PodGroups in a queue
|
|
142
|
+
kubectl get podgroups --all-namespaces -o json | \
|
|
143
|
+
jq -r '.items[] | select(.spec.queue=="<queue-name>") | "\(.metadata.namespace)/\(.metadata.name)"'
|
|
144
|
+
|
|
145
|
+
# Check pending PodGroups
|
|
146
|
+
kubectl get podgroups --all-namespaces -o json | \
|
|
147
|
+
jq -r '.items[] | select(.spec.queue=="<queue-name>" and .status.phase=="Pending") | \
|
|
148
|
+
"\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
### Step 5: Check Queue Events
|
|
152
|
+
|
|
153
|
+
Look for queue-related events:
|
|
154
|
+
|
|
155
|
+
```bash
|
|
156
|
+
kubectl get events --all-namespaces --field-selector reason=FailedScheduling | grep -i queue
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
## Common Queue Issues
|
|
160
|
+
|
|
161
|
+
### Issue 1: Queue Resource Exhaustion
|
|
162
|
+
|
|
163
|
+
**Symptom:** `allocated >= deserved`, new PodGroups stay in Pending
|
|
164
|
+
|
|
165
|
+
**Check:**
|
|
166
|
+
```bash
|
|
167
|
+
kubectl get queue <queue> -o jsonpath='{"
|
|
168
|
+
Deserved: "}{.status.deserved}{"
|
|
169
|
+
Allocated: "}{.status.allocated}{"
|
|
170
|
+
Ratio: "}{.status.allocated.cpu}{"/"}{.status.deserved.cpu}{"
|
|
171
|
+
"}'
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
For GPU-specific checks (GPU is often the bottleneck):
|
|
175
|
+
```bash
|
|
176
|
+
kubectl get queue -o custom-columns="NAME:.metadata.name,GPU_CAP:.spec.capability['nvidia.com/gpu'],GPU_ALLOC:.status.allocated['nvidia.com/gpu']"
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
**Also cross-validate capability against actual cluster capacity** — a common misconfiguration is setting `spec.capability` higher than the cluster's physical resources:
|
|
180
|
+
```bash
|
|
181
|
+
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable['nvidia.com/gpu'],CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory"
|
|
182
|
+
```
|
|
183
|
+
If the sum of all nodes' allocatable GPUs is less than the queue's `spec.capability`, the queue can never be fully utilized. When allocation reaches the cluster's physical limit, the queue appears to have remaining capacity but no more resources can actually be scheduled.
|
|
184
|
+
|
|
185
|
+
**Solution:**
|
|
186
|
+
- Increase queue weight (requires scheduler config change)
|
|
187
|
+
- Increase queue capability (only if cluster has physical capacity)
|
|
188
|
+
- Wait for other jobs to complete
|
|
189
|
+
- Check if other queues are over-allocated (reclaim may help)
|
|
190
|
+
|
|
191
|
+
### Issue 2: Queue is Closed
|
|
192
|
+
|
|
193
|
+
**Symptom:** `status.state: Closed`, new PodGroups rejected
|
|
194
|
+
|
|
195
|
+
**Check:**
|
|
196
|
+
```bash
|
|
197
|
+
kubectl get queue <queue> -o jsonpath='{.status.state}'
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
**Solution:**
|
|
201
|
+
- Queue must be reopened by admin
|
|
202
|
+
- Use a different queue
|
|
203
|
+
|
|
204
|
+
### Issue 3: Weight Imbalance
|
|
205
|
+
|
|
206
|
+
**Symptom:** One queue gets all resources, others starve
|
|
207
|
+
|
|
208
|
+
**Check:**
|
|
209
|
+
```bash
|
|
210
|
+
kubectl get queue -o custom-columns='NAME:.metadata.name,WEIGHT:.spec.weight,STATE:.status.state,CPU_DESERVED:.status.deserved.cpu,CPU_ALLOC:.status.allocated.cpu,MEM_DESERVED:.status.deserved.memory,MEM_ALLOC:.status.allocated.memory'
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
**Analysis:** Volcano distributes resources proportionally by weight. For example:
|
|
214
|
+
- Queue A (weight=10) + Queue B (weight=2): A gets 10/12 ≈ 83%, B gets 2/12 ≈ 17% of total cluster resources
|
|
215
|
+
- If Queue B has many pending jobs but low deserved resources, its weight is too low relative to others
|
|
216
|
+
|
|
217
|
+
**Solution:**
|
|
218
|
+
- Adjust queue weights proportionally
|
|
219
|
+
- Check if high-weight queues have capability limits preventing allocation
|
|
220
|
+
|
|
221
|
+
### Issue 4: Resource Reclaim Not Working
|
|
222
|
+
|
|
223
|
+
**Symptom:** Queue is over-allocated but reclaim is not triggered
|
|
224
|
+
|
|
225
|
+
**Check:**
|
|
226
|
+
```bash
|
|
227
|
+
# Check reclaim is enabled in scheduler config
|
|
228
|
+
kubectl get cm volcano-scheduler-configmap -n volcano-system -o yaml | grep reclaim
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
**Reclaim troubleshooting checklist (all must be true):**
|
|
232
|
+
1. `reclaim` action must be in scheduler actions
|
|
233
|
+
2. `proportion` plugin must be enabled
|
|
234
|
+
3. Source queue must be under-utilized (allocated < deserved)
|
|
235
|
+
4. Target queue must have over-allocated resources (allocated > deserved)
|
|
236
|
+
5. Target queue must have `reclaimable: true`
|
|
237
|
+
|
|
238
|
+
Check the reclaimable flag on the specific queue:
|
|
239
|
+
```bash
|
|
240
|
+
kubectl get queue <queue> -o jsonpath='{.spec.reclaimable}'
|
|
241
|
+
```
|
|
242
|
+
If `reclaimable` is `false` (or unset), the queue's resources **cannot be reclaimed** even if it's over-allocated.
|
|
243
|
+
|
|
244
|
+
**Solution:**
|
|
245
|
+
- Verify all 5 prerequisites above
|
|
246
|
+
- Check scheduler logs for reclaim attempts: use `volcano-scheduler-logs --keyword reclaim`
|
|
247
|
+
|
|
248
|
+
## Queue Hierarchy (Advanced)
|
|
249
|
+
|
|
250
|
+
If using hierarchical queues:
|
|
251
|
+
|
|
252
|
+
```bash
|
|
253
|
+
# Check parent-child relationships
|
|
254
|
+
kubectl get queue -o custom-columns='NAME:.metadata.name,PARENT:.spec.parent,WEIGHT:.spec.weight'
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
**Key points:**
|
|
258
|
+
- Child queues share parent's deserved resources
|
|
259
|
+
- Weight is relative to siblings, not absolute
|
|
260
|
+
- Parent queue's deserved = sum of children's usage
|
|
261
|
+
|
|
262
|
+
## Script Output Interpretation
|
|
263
|
+
|
|
264
|
+
The diagnose-queue.sh script provides:
|
|
265
|
+
|
|
266
|
+
1. **Queue Summary Table**
|
|
267
|
+
- Name, State, Weight
|
|
268
|
+
- Pending/Running counts
|
|
269
|
+
- Resource allocation summary
|
|
270
|
+
|
|
271
|
+
2. **Resource Breakdown (with --verbose)**
|
|
272
|
+
- CPU: deserved, allocated, usage ratio
|
|
273
|
+
- Memory: deserved, allocated, usage ratio
|
|
274
|
+
- GPU: if available
|
|
275
|
+
|
|
276
|
+
3. **Warning Flags**
|
|
277
|
+
- `[OVER]` - Queue allocated > deserved
|
|
278
|
+
- `[FULL]` - Queue at capacity
|
|
279
|
+
- `[CLOSED]` - Queue not accepting new jobs
|
|
280
|
+
- `[HIGH_PEND]` - Many pending PodGroups
|
|
281
|
+
|
|
282
|
+
## Environment Variables
|
|
283
|
+
|
|
284
|
+
| Variable | Default | Description |
|
|
285
|
+
|----------|---------|-------------|
|
|
286
|
+
| `VOLCANO_NAMESPACE` | `default` | Default namespace for pod lookup |
|
|
287
|
+
|
|
288
|
+
## See Also
|
|
289
|
+
|
|
290
|
+
- `volcano-diagnose-pod` - Diagnose individual pod scheduling
|
|
291
|
+
- `volcano-gang-scheduling` - Gang constraint issues
|
|
292
|
+
- `volcano-resource-insufficient` - Resource shortage diagnosis
|
|
293
|
+
- `volcano-scheduler-logs` - Check scheduler decisions
|
|
294
|
+
- `quota-debug` - Native Kubernetes ResourceQuota/LimitRange diagnosis (non-Volcano)
|