speek 0.0.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- speek-0.0.2/LICENSE +19 -0
- speek-0.0.2/PKG-INFO +122 -0
- speek-0.0.2/README.md +79 -0
- speek-0.0.2/pyproject.toml +40 -0
- speek-0.0.2/setup.cfg +4 -0
- speek-0.0.2/speek/check_slurm_resource.py +328 -0
- speek-0.0.2/speek.egg-info/PKG-INFO +122 -0
- speek-0.0.2/speek.egg-info/SOURCES.txt +10 -0
- speek-0.0.2/speek.egg-info/dependency_links.txt +1 -0
- speek-0.0.2/speek.egg-info/entry_points.txt +2 -0
- speek-0.0.2/speek.egg-info/requires.txt +1 -0
- speek-0.0.2/speek.egg-info/top_level.txt +1 -0
speek-0.0.2/LICENSE
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
1
|
+
Copyright (c) 2024 Dongyeop Lee
|
|
2
|
+
|
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
4
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
5
|
+
in the Software without restriction, including without limitation the rights
|
|
6
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
7
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
8
|
+
furnished to do so, subject to the following conditions:
|
|
9
|
+
|
|
10
|
+
The above copyright notice and this permission notice shall be included in all
|
|
11
|
+
copies or substantial portions of the Software.
|
|
12
|
+
|
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
15
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
16
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
17
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
18
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
19
|
+
SOFTWARE.
|
speek-0.0.2/PKG-INFO
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: speek
|
|
3
|
+
Version: 0.0.2
|
|
4
|
+
Summary: Peek into slurm's resource info such as GPU avaiability, usage per user, job status, and more.
|
|
5
|
+
Author-email: Dongyeop Lee <dylee23@postech.ac.kr>
|
|
6
|
+
License: Copyright (c) 2024 Dongyeop Lee
|
|
7
|
+
|
|
8
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
9
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
10
|
+
in the Software without restriction, including without limitation the rights
|
|
11
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
12
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
13
|
+
furnished to do so, subject to the following conditions:
|
|
14
|
+
|
|
15
|
+
The above copyright notice and this permission notice shall be included in all
|
|
16
|
+
copies or substantial portions of the Software.
|
|
17
|
+
|
|
18
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
19
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
20
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
21
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
22
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
23
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
24
|
+
SOFTWARE.
|
|
25
|
+
Project-URL: Repository, https://github.com/edong6768/speek.git
|
|
26
|
+
Keywords: slurm
|
|
27
|
+
Classifier: Intended Audience :: Science/Research
|
|
28
|
+
Classifier: Intended Audience :: System Administrators
|
|
29
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
30
|
+
Classifier: Topic :: System :: Monitoring
|
|
31
|
+
Classifier: Topic :: Utilities
|
|
32
|
+
Classifier: Programming Language :: Python :: 3
|
|
33
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
34
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
35
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
36
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
37
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
38
|
+
Requires-Python: >=3.8
|
|
39
|
+
Description-Content-Type: text/markdown
|
|
40
|
+
License-File: LICENSE
|
|
41
|
+
Requires-Dist: rich
|
|
42
|
+
Dynamic: license-file
|
|
43
|
+
|
|
44
|
+
# đ speek
|
|
45
|
+
|
|
46
|
+
**speek** is a suite of SLURM cluster monitoring tools â from quick one-shot snapshots to a full interactive TUI.
|
|
47
|
+
|
|
48
|
+
## Installation
|
|
49
|
+
|
|
50
|
+
```sh
|
|
51
|
+
pip install speek
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
For the latest development version:
|
|
55
|
+
```sh
|
|
56
|
+
pip install --pre speek
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Commands
|
|
60
|
+
|
|
61
|
+
| Command | Description |
|
|
62
|
+
|---------|-------------|
|
|
63
|
+
| `speek0` | Classic one-shot cluster overview â GPU availability, per-user usage, job status |
|
|
64
|
+
| `speek-` | Compact snapshot â per-model GPU bars, trends, pending pressure |
|
|
65
|
+
| `speek+` | Full interactive TUI â queue, nodes, users, stats, events, shell |
|
|
66
|
+
|
|
67
|
+
## speek0 â Classic Overview
|
|
68
|
+
|
|
69
|
+
```sh
|
|
70
|
+
speek0 [-u USER] [-f FILE] [-t T_AVAIL]
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
| Option | Description |
|
|
74
|
+
|--------|-------------|
|
|
75
|
+
| `-u USER` | Highlight a specific user (default: self) |
|
|
76
|
+
| `-f FILE` | User info CSV file |
|
|
77
|
+
| `-t T_AVAIL` | Time window for upcoming release, e.g. `5 m`, `1 h` |
|
|
78
|
+
|
|
79
|
+
Shows a table of GPU usage per partition, ranked users with `đ„đ„đ„`, utilization-colored counts, and your current jobs.
|
|
80
|
+
|
|
81
|
+
## speek- â Compact Snapshot
|
|
82
|
+
|
|
83
|
+
```sh
|
|
84
|
+
speek- [-u USER]
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Per-GPU-model view with utilization bars, free/total counts, pending pressure (`âžN`), availability trends (`ââ`), and your running/pending jobs. Detects down nodes and shows them as DEAD.
|
|
88
|
+
|
|
89
|
+
## speek+ â Interactive TUI
|
|
90
|
+
|
|
91
|
+
```sh
|
|
92
|
+
speek+
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Full-featured Textual TUI with:
|
|
96
|
+
|
|
97
|
+
- **Cluster** â speek0-style usage table (tab 1)
|
|
98
|
+
- **Queue** â all cluster jobs grouped by partition, foldable
|
|
99
|
+
- **Nodes** â per-partition node status with usage bars
|
|
100
|
+
- **Users** â per-user GPU usage, fairshare, per-partition breakdown
|
|
101
|
+
- **Stats** â GPU usage charts, per-user stacked view, issue dashboard
|
|
102
|
+
- **Logs** â session CLI output (not persisted)
|
|
103
|
+
- **Settings** â theme, refresh rates, cache management, log scanning
|
|
104
|
+
- **Info** â cluster probe results, scheduling factors, error detection rules
|
|
105
|
+
- **Help** â keybindings reference
|
|
106
|
+
|
|
107
|
+
### Features
|
|
108
|
+
|
|
109
|
+
- 70+ color themes (base16 standard)
|
|
110
|
+
- OOM and error detection (11 error types) with log scanning
|
|
111
|
+
- Job detail popup with stdout, stderr, GPU stats, analysis
|
|
112
|
+
- Built-in shell with tab completion, history, sbatch suggestions
|
|
113
|
+
- Per-job log hints in the table
|
|
114
|
+
- Event notifications with read/unread tracking
|
|
115
|
+
- Down node detection with DEAD indicators
|
|
116
|
+
|
|
117
|
+
## Requirements
|
|
118
|
+
|
|
119
|
+
- Python 3.8+
|
|
120
|
+
- SLURM cluster with `squeue`, `scontrol`, `sinfo`
|
|
121
|
+
- Optional: `sacct`, `sprio`, `sshare`, `sreport`, `scancel` for full features
|
|
122
|
+
- `rich` (all commands), `textual>=0.50.0` (speek+ only)
|
speek-0.0.2/README.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
# đ speek
|
|
2
|
+
|
|
3
|
+
**speek** is a suite of SLURM cluster monitoring tools â from quick one-shot snapshots to a full interactive TUI.
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```sh
|
|
8
|
+
pip install speek
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
For the latest development version:
|
|
12
|
+
```sh
|
|
13
|
+
pip install --pre speek
|
|
14
|
+
```
|
|
15
|
+
|
|
16
|
+
## Commands
|
|
17
|
+
|
|
18
|
+
| Command | Description |
|
|
19
|
+
|---------|-------------|
|
|
20
|
+
| `speek0` | Classic one-shot cluster overview â GPU availability, per-user usage, job status |
|
|
21
|
+
| `speek-` | Compact snapshot â per-model GPU bars, trends, pending pressure |
|
|
22
|
+
| `speek+` | Full interactive TUI â queue, nodes, users, stats, events, shell |
|
|
23
|
+
|
|
24
|
+
## speek0 â Classic Overview
|
|
25
|
+
|
|
26
|
+
```sh
|
|
27
|
+
speek0 [-u USER] [-f FILE] [-t T_AVAIL]
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
| Option | Description |
|
|
31
|
+
|--------|-------------|
|
|
32
|
+
| `-u USER` | Highlight a specific user (default: self) |
|
|
33
|
+
| `-f FILE` | User info CSV file |
|
|
34
|
+
| `-t T_AVAIL` | Time window for upcoming release, e.g. `5 m`, `1 h` |
|
|
35
|
+
|
|
36
|
+
Shows a table of GPU usage per partition, ranked users with `đ„đ„đ„`, utilization-colored counts, and your current jobs.
|
|
37
|
+
|
|
38
|
+
## speek- â Compact Snapshot
|
|
39
|
+
|
|
40
|
+
```sh
|
|
41
|
+
speek- [-u USER]
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Per-GPU-model view with utilization bars, free/total counts, pending pressure (`âžN`), availability trends (`ââ`), and your running/pending jobs. Detects down nodes and shows them as DEAD.
|
|
45
|
+
|
|
46
|
+
## speek+ â Interactive TUI
|
|
47
|
+
|
|
48
|
+
```sh
|
|
49
|
+
speek+
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Full-featured Textual TUI with:
|
|
53
|
+
|
|
54
|
+
- **Cluster** â speek0-style usage table (tab 1)
|
|
55
|
+
- **Queue** â all cluster jobs grouped by partition, foldable
|
|
56
|
+
- **Nodes** â per-partition node status with usage bars
|
|
57
|
+
- **Users** â per-user GPU usage, fairshare, per-partition breakdown
|
|
58
|
+
- **Stats** â GPU usage charts, per-user stacked view, issue dashboard
|
|
59
|
+
- **Logs** â session CLI output (not persisted)
|
|
60
|
+
- **Settings** â theme, refresh rates, cache management, log scanning
|
|
61
|
+
- **Info** â cluster probe results, scheduling factors, error detection rules
|
|
62
|
+
- **Help** â keybindings reference
|
|
63
|
+
|
|
64
|
+
### Features
|
|
65
|
+
|
|
66
|
+
- 70+ color themes (base16 standard)
|
|
67
|
+
- OOM and error detection (11 error types) with log scanning
|
|
68
|
+
- Job detail popup with stdout, stderr, GPU stats, analysis
|
|
69
|
+
- Built-in shell with tab completion, history, sbatch suggestions
|
|
70
|
+
- Per-job log hints in the table
|
|
71
|
+
- Event notifications with read/unread tracking
|
|
72
|
+
- Down node detection with DEAD indicators
|
|
73
|
+
|
|
74
|
+
## Requirements
|
|
75
|
+
|
|
76
|
+
- Python 3.8+
|
|
77
|
+
- SLURM cluster with `squeue`, `scontrol`, `sinfo`
|
|
78
|
+
- Optional: `sacct`, `sprio`, `sshare`, `sreport`, `scancel` for full features
|
|
79
|
+
- `rich` (all commands), `textual>=0.50.0` (speek+ only)
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
[build-system]
|
|
2
|
+
requires = ["setuptools"]
|
|
3
|
+
build-backend = "setuptools.build_meta"
|
|
4
|
+
|
|
5
|
+
[tool.setuptools]
|
|
6
|
+
packages = ["speek"]
|
|
7
|
+
|
|
8
|
+
[project]
|
|
9
|
+
name = "speek"
|
|
10
|
+
version = "0.0.2"
|
|
11
|
+
description = "Peek into slurm's resource info such as GPU avaiability, usage per user, job status, and more."
|
|
12
|
+
readme = "README.md"
|
|
13
|
+
requires-python = ">=3.8"
|
|
14
|
+
license = {file = "LICENSE"}
|
|
15
|
+
keywords = ["slurm"]
|
|
16
|
+
authors = [
|
|
17
|
+
{name = "Dongyeop Lee", email = "dylee23@postech.ac.kr"},
|
|
18
|
+
]
|
|
19
|
+
classifiers = [
|
|
20
|
+
"Intended Audience :: Science/Research",
|
|
21
|
+
"Intended Audience :: System Administrators",
|
|
22
|
+
"License :: OSI Approved :: MIT License",
|
|
23
|
+
"Topic :: System :: Monitoring",
|
|
24
|
+
"Topic :: Utilities",
|
|
25
|
+
"Programming Language :: Python :: 3",
|
|
26
|
+
"Programming Language :: Python :: 3.8",
|
|
27
|
+
"Programming Language :: Python :: 3.9",
|
|
28
|
+
"Programming Language :: Python :: 3.10",
|
|
29
|
+
"Programming Language :: Python :: 3.11",
|
|
30
|
+
"Programming Language :: Python :: 3.12",
|
|
31
|
+
]
|
|
32
|
+
dependencies = [
|
|
33
|
+
'rich',
|
|
34
|
+
]
|
|
35
|
+
|
|
36
|
+
[project.scripts]
|
|
37
|
+
speek = "speek.check_slurm_resource:main"
|
|
38
|
+
|
|
39
|
+
[project.urls]
|
|
40
|
+
Repository = "https://github.com/edong6768/speek.git"
|
speek-0.0.2/setup.cfg
ADDED
|
@@ -0,0 +1,328 @@
|
|
|
1
|
+
import subprocess
|
|
2
|
+
from glob import glob
|
|
3
|
+
import csv
|
|
4
|
+
import re
|
|
5
|
+
|
|
6
|
+
import argparse
|
|
7
|
+
from datetime import datetime, timedelta
|
|
8
|
+
|
|
9
|
+
from rich import print
|
|
10
|
+
from rich.table import Table
|
|
11
|
+
from rich.align import Align
|
|
12
|
+
from rich.live import Live
|
|
13
|
+
from rich.console import Group
|
|
14
|
+
|
|
15
|
+
parser = argparse.ArgumentParser(description="Peek into slurm resource info.")
|
|
16
|
+
parser.add_argument('-u', '--user', default=None, type=str, help='Specify highlighted user.')
|
|
17
|
+
|
|
18
|
+
parser.add_argument('-l', '--live', action='store_true', help='Live display of speek every 1 seconds.')
|
|
19
|
+
parser.set_defaults(live=False)
|
|
20
|
+
|
|
21
|
+
parser.add_argument('-f', '--file', default='auto', type=str, help='Specify file for user info.')
|
|
22
|
+
parser.add_argument('-t', '--t_avail', default='5 m', type=str, help='Time window width for upcomming release in {m:minutes, h:hours, d:days}. (default: 5 m)')
|
|
23
|
+
args = parser.parse_args()
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
def get_scontrol_dict(unit):
|
|
27
|
+
assert unit in ['Job', 'Partition', 'Node']
|
|
28
|
+
|
|
29
|
+
scontrol_str = subprocess.check_output(['scontrol', 'show', unit]).decode('utf-8').replace(' ', '\n')
|
|
30
|
+
|
|
31
|
+
scontrols = {}
|
|
32
|
+
delimiter = f'{unit}Name=' if unit != 'Job' else 'JobId='
|
|
33
|
+
for scontrol in scontrol_str.split(delimiter):
|
|
34
|
+
if not scontrol: continue
|
|
35
|
+
n, *infos = [i for i in scontrol.split('\n') if i]
|
|
36
|
+
if unit == 'Job': n = int(n) if n!='No' else 0
|
|
37
|
+
|
|
38
|
+
scontrols[n] = {}
|
|
39
|
+
for info in infos:
|
|
40
|
+
if '=' not in info:
|
|
41
|
+
scontrols[n][info] = None
|
|
42
|
+
continue
|
|
43
|
+
k, v = info.split('=', 1)
|
|
44
|
+
if ',' not in v or '[' in v:
|
|
45
|
+
scontrols[n][k] = v
|
|
46
|
+
elif '=' in v:
|
|
47
|
+
scontrols[n][k] = dict([i.split('=') for i in v.split(',')])
|
|
48
|
+
else:
|
|
49
|
+
scontrols[n][k] = tuple(v.split(','))
|
|
50
|
+
return scontrols
|
|
51
|
+
|
|
52
|
+
def td_parse(s):
|
|
53
|
+
dt = datetime.strptime(s, '%d-%H:%M:%S') if '-' in s else datetime.strptime(s, '%H:%M:%S')
|
|
54
|
+
return timedelta(days=dt.day, hours=dt.hour, minutes=dt.minute, seconds=dt.second)
|
|
55
|
+
|
|
56
|
+
|
|
57
|
+
def consecutor(lst):
|
|
58
|
+
assert all([isinstance(i, (int, float)) for i in lst]), 'List should be all numbers.'
|
|
59
|
+
lst.sort()
|
|
60
|
+
if len(lst)==0: return ''
|
|
61
|
+
pi, *ll = lst
|
|
62
|
+
cl = [[pi]]
|
|
63
|
+
for i in ll:
|
|
64
|
+
if i-pi>1: cl.append([i])
|
|
65
|
+
else: cl[-1].append(i)
|
|
66
|
+
pi = i
|
|
67
|
+
l_str = ' '.join([f'{{{c[0]}..{c[-1]}}}' if len(c)>1 else f'{c[0]}' for c in cl])
|
|
68
|
+
return l_str
|
|
69
|
+
|
|
70
|
+
|
|
71
|
+
def get_slurm_resource():
|
|
72
|
+
##############################################
|
|
73
|
+
# get user info #
|
|
74
|
+
##############################################
|
|
75
|
+
|
|
76
|
+
# who am I
|
|
77
|
+
me = args.user
|
|
78
|
+
if me==None:
|
|
79
|
+
me = subprocess.check_output(['whoami']).decode('utf-8').strip()
|
|
80
|
+
|
|
81
|
+
# who are they
|
|
82
|
+
paths = glob(args.file)
|
|
83
|
+
|
|
84
|
+
if paths:
|
|
85
|
+
with open(paths[0], 'r', newline='', encoding='utf-8') as f:
|
|
86
|
+
reader = csv.reader(f)
|
|
87
|
+
header, *users = list(reader)
|
|
88
|
+
|
|
89
|
+
user_info = [dict(zip(header, user)) for user in users]
|
|
90
|
+
user_lookup = {}
|
|
91
|
+
for user in user_info:
|
|
92
|
+
if not user['name']: continue
|
|
93
|
+
user_lookup[user['user']] = f"{user['name']} ({user['affiliation'].split('-')[0][:2]} {user['title']}, {user['user']})"
|
|
94
|
+
else:
|
|
95
|
+
user_lookup = {}
|
|
96
|
+
|
|
97
|
+
|
|
98
|
+
##############################################
|
|
99
|
+
# get gpu status #
|
|
100
|
+
##############################################
|
|
101
|
+
|
|
102
|
+
partitions, jobs = map(get_scontrol_dict, ('Partition', 'Job'))
|
|
103
|
+
|
|
104
|
+
gres_names = ['GRES/gpu', 'gres/gpu']
|
|
105
|
+
|
|
106
|
+
for gres in gres_names:
|
|
107
|
+
if gres in partitions[[*partitions.keys()][0]]['TRESBillingWeights']:
|
|
108
|
+
break
|
|
109
|
+
|
|
110
|
+
# partitions = {k: v for k, v in partitions.items() if 'cpu' not in k}
|
|
111
|
+
|
|
112
|
+
status = {'PENDING', 'RUNNING'}
|
|
113
|
+
resource = {'Available', 'Total', 'Usage', 'max_user'}
|
|
114
|
+
release = {'Time left', 'count', 'user'}
|
|
115
|
+
|
|
116
|
+
NewState = lambda fields: {k: 0 for k in fields}
|
|
117
|
+
|
|
118
|
+
user_status, gpu_resource = {}, NewState(resource)
|
|
119
|
+
user_job_status = {}
|
|
120
|
+
|
|
121
|
+
current_time = datetime.now()
|
|
122
|
+
|
|
123
|
+
td_str = {'m':'minutes', 'h':'hours', 'd':'days'}
|
|
124
|
+
t_width, t_unit = args.t_avail.split()
|
|
125
|
+
tw = timedelta(**{td_str[t_unit]: int(t_width)})
|
|
126
|
+
|
|
127
|
+
if jobs:
|
|
128
|
+
for id, job in jobs.items():
|
|
129
|
+
j_status = job.get('JobState', None)
|
|
130
|
+
|
|
131
|
+
if j_status in status:
|
|
132
|
+
job_name = job['JobName']
|
|
133
|
+
user, gpu = job['UserId'].split('(')[0].strip(), job['Partition']
|
|
134
|
+
gpu_count = int(re.split(':|=', job.get('TresPerNode', 'gres:gpu:0'))[-1])
|
|
135
|
+
|
|
136
|
+
if isinstance(gpu, tuple):
|
|
137
|
+
gpu = tuple(sorted(gpu, key=lambda x: float(partitions[x]['TRESBillingWeights'][gres]), reverse=True))
|
|
138
|
+
gpu_one = gpu[0]
|
|
139
|
+
else:
|
|
140
|
+
gpu_one = gpu
|
|
141
|
+
|
|
142
|
+
if partitions[gpu_one]['TRESBillingWeights'][gres]=='0': continue
|
|
143
|
+
|
|
144
|
+
# user status
|
|
145
|
+
u_stat = user_status.get(user, NewState(status))
|
|
146
|
+
u_stat[gpu_one] = u_stat.get(gpu, NewState(status))
|
|
147
|
+
|
|
148
|
+
u_stat[j_status] += gpu_count
|
|
149
|
+
u_stat[gpu_one][j_status] += gpu_count
|
|
150
|
+
|
|
151
|
+
user_status[user] = u_stat
|
|
152
|
+
|
|
153
|
+
uj_stat = user_job_status.get(user, {})
|
|
154
|
+
uj_stat[job_name] = uj_stat.get(job_name, {})
|
|
155
|
+
|
|
156
|
+
uj_stat[job_name][gpu] = uj_stat[job_name].get(gpu, {s:[] for s in status})
|
|
157
|
+
uj_stat[job_name][gpu][j_status].append((id, gpu_count))
|
|
158
|
+
|
|
159
|
+
user_job_status[user] = uj_stat
|
|
160
|
+
|
|
161
|
+
# gpu status
|
|
162
|
+
gpu_resource[gpu] = gpu_resource.get(gpu, NewState(resource))
|
|
163
|
+
|
|
164
|
+
if j_status=='RUNNING':
|
|
165
|
+
gpu_resource['Available'] -= gpu_count
|
|
166
|
+
gpu_resource[gpu]['Available'] -= gpu_count
|
|
167
|
+
|
|
168
|
+
time_left = {'td': td_parse(job['TimeLimit'])- td_parse(job['RunTime']),
|
|
169
|
+
'count': gpu_count, 'user': user}
|
|
170
|
+
|
|
171
|
+
up_re = gpu_resource[gpu].get('Upcomming release', [time_left, [time_left]])
|
|
172
|
+
up_re[0] = min(time_left, up_re[0], key=lambda x: x['td'])
|
|
173
|
+
|
|
174
|
+
up_re[1].append(time_left)
|
|
175
|
+
up_re[1] = [t for t in up_re[1] if t['td']-up_re[0]['td']<tw]
|
|
176
|
+
|
|
177
|
+
up_re[0]['total_count'] = sum([t['count'] for t in up_re[1]])
|
|
178
|
+
td = up_re[0]['td']
|
|
179
|
+
up_re[0]['str'] = (f'{td.days}-' if td.days else '') + f"{str(td).split(', ')[-1][:-3]} ({up_re[0]['total_count']})"
|
|
180
|
+
|
|
181
|
+
gpu_resource[gpu]['Upcomming release'] = up_re
|
|
182
|
+
|
|
183
|
+
|
|
184
|
+
for gpu, info in partitions.items():
|
|
185
|
+
if info['TRESBillingWeights'][gres]=='0': continue
|
|
186
|
+
count = int(info['TRES']['gres/gpu'])
|
|
187
|
+
|
|
188
|
+
gpu_resource[gpu] = gpu_resource.get(gpu, NewState(resource))
|
|
189
|
+
|
|
190
|
+
for s in ['Available', 'Total']:
|
|
191
|
+
gpu_resource[s] += count
|
|
192
|
+
gpu_resource[gpu][s] += count
|
|
193
|
+
|
|
194
|
+
gpu_resource['Usage'] = f"{(gpu_resource['Total'] - gpu_resource['Available'])/gpu_resource['Total']*100:.2f}%"
|
|
195
|
+
gpu_resource[gpu]['Usage'] = f"{(gpu_resource[gpu]['Total'] - gpu_resource[gpu]['Available'])/gpu_resource[gpu]['Total']*100:.2f}%"
|
|
196
|
+
|
|
197
|
+
for s in status:
|
|
198
|
+
max_user = max(user_status.items(), key=lambda x: x[1].get(gpu, NewState(status))[s]) if user_status else (None, NewState(status))
|
|
199
|
+
gpu_resource[gpu][f'max_{s}_user'] = max_user[0] if max_user[1].get(gpu, NewState(status))[s] else None
|
|
200
|
+
|
|
201
|
+
####################################################
|
|
202
|
+
# print usage table #
|
|
203
|
+
####################################################
|
|
204
|
+
|
|
205
|
+
tables = []
|
|
206
|
+
|
|
207
|
+
ranking = {0:'đ„', 1:'đ„', 2:'đ„'}
|
|
208
|
+
get_state = lambda p: ('â ïž ' if p==100 else 'đ„' if p>90 else 'đïž ' if p==0 else 'âïž ' if p<10 else '') #+' '
|
|
209
|
+
pareto = 'đ©'
|
|
210
|
+
king = {'RUNNING':'đ', 'PENDING':'âł'}
|
|
211
|
+
|
|
212
|
+
table1 = Table(title="Cluster Usage")
|
|
213
|
+
|
|
214
|
+
# add columns
|
|
215
|
+
partitions_list = [p for p in sorted({*partitions.keys()} - resource) if partitions[p]['TRESBillingWeights'][gres]!='0']
|
|
216
|
+
partitions_list = sorted(partitions_list, key=lambda x: gpu_resource[x]['Total']*float(partitions[x]['TRESBillingWeights'][gres]), reverse=True)
|
|
217
|
+
table1.add_column("User")
|
|
218
|
+
for i, p in enumerate(partitions_list):
|
|
219
|
+
table1.add_column(get_state(float(gpu_resource[p]['Usage'][:-1])) + p, justify="right")
|
|
220
|
+
table1.add_column("Total", justify="right")
|
|
221
|
+
|
|
222
|
+
|
|
223
|
+
# add rows
|
|
224
|
+
for f in ['Available', 'Total', 'Usage']:
|
|
225
|
+
table1.add_row(f, *[str(gpu_resource[p][f]) for p in partitions_list], str(gpu_resource[f]))
|
|
226
|
+
table1.add_row(f'Until release (~{t_width}{td_str[t_unit][0]})', *[gpu_resource[p].get('Upcomming release', [{}])[0].get('str', '') for p in partitions_list], '', end_section=True)
|
|
227
|
+
|
|
228
|
+
user_status_sorted = sorted(user_status.items(), key=lambda x: (x[1]['RUNNING'], x[1]['PENDING']), reverse=True)
|
|
229
|
+
agg_running = 0
|
|
230
|
+
for i, (user, info) in enumerate(user_status_sorted):
|
|
231
|
+
all_running = v if agg_running<(v:=gpu_resource['Total']-gpu_resource['Available'])*0.8 else float('inf')
|
|
232
|
+
agg_running += info['RUNNING']
|
|
233
|
+
|
|
234
|
+
style="on bright_black" if i%2 else ""
|
|
235
|
+
if user==me:
|
|
236
|
+
style="black on bright_green"
|
|
237
|
+
|
|
238
|
+
me_section = (me in {user, user_status_sorted[min(i+1, len(user_status_sorted)-1)][0]})
|
|
239
|
+
|
|
240
|
+
rank = ranking.get(i, i+1 if agg_running<all_running*0.8 else pareto)
|
|
241
|
+
|
|
242
|
+
user_true = user_lookup.get(user, user)
|
|
243
|
+
state_str = lambda state: (f"{v}" if (v:=state['RUNNING']) else '') + (f"({v})" if (v:=state['PENDING']) else '')
|
|
244
|
+
king_str = lambda p: ''.join([king[s] for s in sorted(status) if user==gpu_resource[p][f'max_{s}_user']])
|
|
245
|
+
table1.add_row(f'{rank:>2}. {user_true}', *[king_str(p)+state_str(info.get(p, NewState(status))) for p in partitions_list], state_str(info), style=style, end_section=me_section)
|
|
246
|
+
|
|
247
|
+
tables.append(' \n')
|
|
248
|
+
tables.append(Align(table1, align='center'))
|
|
249
|
+
# print(' \n ')
|
|
250
|
+
# print(Align(table1, align='center'))
|
|
251
|
+
|
|
252
|
+
|
|
253
|
+
##################################################
|
|
254
|
+
# print job table #
|
|
255
|
+
##################################################
|
|
256
|
+
|
|
257
|
+
jobs = user_job_status.get(me, {})
|
|
258
|
+
|
|
259
|
+
if jobs:
|
|
260
|
+
table2 = Table(title=f"{user_lookup.get(me, me)}'s Job Status")
|
|
261
|
+
|
|
262
|
+
for c in ['Status', 'Job', 'GPU', '#', 'ids']:
|
|
263
|
+
table2.add_column(c)
|
|
264
|
+
for s in sorted(status, reverse=True):
|
|
265
|
+
jobs_f = {k: {jn: j for jn, j in v.items() if j[s]} for k, v in jobs.items() if any(j[s] for j in v.values())}
|
|
266
|
+
for i, (job_name, job) in enumerate(jobs_f.items()):
|
|
267
|
+
def keykey(gpu):
|
|
268
|
+
if isinstance(gpu, tuple):
|
|
269
|
+
gpu = sorted(gpu, key=lambda x: float(partitions[x]['TRESBillingWeights'][gres]))[-1]
|
|
270
|
+
return gpu_resource[gpu]['Total']*float(partitions[gpu]['TRESBillingWeights'][gres])
|
|
271
|
+
job_sorted = sorted(job.keys(), key=keykey, reverse=True)
|
|
272
|
+
# job_sorted = sorted(job.keys(), key=lambda x: gpu_resource[x]['Total']*float(partitions[x]['TRESBillingWeights'][gres]), reverse=True)
|
|
273
|
+
for j, gpu in enumerate(job_sorted):
|
|
274
|
+
ids = job[gpu][s]
|
|
275
|
+
if isinstance(gpu, tuple):
|
|
276
|
+
gpu = '{' + ',\n '.join(gpu) + '}'
|
|
277
|
+
table2.add_row(s if i+j==0 else '', job_name if j==0 else '', gpu, str(len(ids)), consecutor([id for id, _ in ids]), end_section=((i==len(jobs_f)-1) and (j==len(job_sorted)-1)))
|
|
278
|
+
|
|
279
|
+
# print(' \n ')
|
|
280
|
+
# print(Align(table2, align='center'))
|
|
281
|
+
# print(' \n ')
|
|
282
|
+
|
|
283
|
+
tables.append(' \n ')
|
|
284
|
+
tables.append(Align(table2, align='center'))
|
|
285
|
+
tables.append(' \n ')
|
|
286
|
+
|
|
287
|
+
# table3 = Table(title="Job Status")
|
|
288
|
+
|
|
289
|
+
# table3.add_column("User")
|
|
290
|
+
# table3.add_column("#")
|
|
291
|
+
# table3.add_column("GPUs")
|
|
292
|
+
# table3.add_column("Status")
|
|
293
|
+
# table3.add_column("Status")
|
|
294
|
+
|
|
295
|
+
# user_job_status_sorted = [(user, user_job_status[user]) for user, _ in user_status_sorted]
|
|
296
|
+
# for i, (user, jobs) in enumerate(user_job_status_sorted):
|
|
297
|
+
# style="on bright_black" if i%2 else ""
|
|
298
|
+
# if user==me:
|
|
299
|
+
# style="black on bright_green"
|
|
300
|
+
|
|
301
|
+
# me_section = (me in {user, user_status_sorted[min(i+1, len(user_status_sorted)-1)][0]})
|
|
302
|
+
|
|
303
|
+
# j_gpu, j_status = [], []
|
|
304
|
+
# for job_name, job in jobs.items():
|
|
305
|
+
# j_str = lambda s: 'P' if s=='PENDING' else 'R' if s=='RUNNING' else ''
|
|
306
|
+
# j_gpu.append('['+', '.join([f'{k} ({" ".join([j_str(j)+str(sum([cc[1] for cc in c])) for j, c in sorted(v.items(), reverse=True)])})' for k, v in job.items()])+']')
|
|
307
|
+
# j_status.append(' [R ' + ', '.join([consecutor([id for id, _ in ids]) for _, v in job.items() for s, ids in v.items() if s=='RUNNING']) + ']' +
|
|
308
|
+
# ' [P ' + ' '.join([consecutor([id for id, _ in ids]) for _, v in job.items() for s, ids in v.items() if s=='PENDING']) + ']')
|
|
309
|
+
# if user==me:
|
|
310
|
+
# table3.add_row(user_lookup.get(user, user), str(len(jobs.items())), '\n'.join(jobs.keys()), '\n'.join(j_gpu), '\n'.join(j_status), style=style, end_section=me_section)
|
|
311
|
+
# else:
|
|
312
|
+
# table3.add_row(user_lookup.get(user, user), str(len(jobs.items())), '\n'.join(jobs.keys()), '\n'.join(j_gpu), '\n'.join(j_status), style=style, end_section=me_section)
|
|
313
|
+
# # table3.add_row(user_lookup.get(user, user), str(len(jobs.items())), ' / '.join(jobs.keys()), style=style, end_section=me_section)
|
|
314
|
+
|
|
315
|
+
# print(Align(table3, align='center'))
|
|
316
|
+
|
|
317
|
+
return Group(*tables)
|
|
318
|
+
|
|
319
|
+
def main():
|
|
320
|
+
if args.live:
|
|
321
|
+
with Live(get_slurm_resource(), refresh_per_second=1) as live:
|
|
322
|
+
while True:
|
|
323
|
+
live.update(get_slurm_resource())
|
|
324
|
+
else:
|
|
325
|
+
print(get_slurm_resource())
|
|
326
|
+
|
|
327
|
+
if __name__ == '__main__':
|
|
328
|
+
main()
|
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: speek
|
|
3
|
+
Version: 0.0.2
|
|
4
|
+
Summary: Peek into slurm's resource info such as GPU avaiability, usage per user, job status, and more.
|
|
5
|
+
Author-email: Dongyeop Lee <dylee23@postech.ac.kr>
|
|
6
|
+
License: Copyright (c) 2024 Dongyeop Lee
|
|
7
|
+
|
|
8
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
9
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
10
|
+
in the Software without restriction, including without limitation the rights
|
|
11
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
12
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
13
|
+
furnished to do so, subject to the following conditions:
|
|
14
|
+
|
|
15
|
+
The above copyright notice and this permission notice shall be included in all
|
|
16
|
+
copies or substantial portions of the Software.
|
|
17
|
+
|
|
18
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
19
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
20
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
21
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
22
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
23
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
24
|
+
SOFTWARE.
|
|
25
|
+
Project-URL: Repository, https://github.com/edong6768/speek.git
|
|
26
|
+
Keywords: slurm
|
|
27
|
+
Classifier: Intended Audience :: Science/Research
|
|
28
|
+
Classifier: Intended Audience :: System Administrators
|
|
29
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
30
|
+
Classifier: Topic :: System :: Monitoring
|
|
31
|
+
Classifier: Topic :: Utilities
|
|
32
|
+
Classifier: Programming Language :: Python :: 3
|
|
33
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
34
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
35
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
36
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
37
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
38
|
+
Requires-Python: >=3.8
|
|
39
|
+
Description-Content-Type: text/markdown
|
|
40
|
+
License-File: LICENSE
|
|
41
|
+
Requires-Dist: rich
|
|
42
|
+
Dynamic: license-file
|
|
43
|
+
|
|
44
|
+
# đ speek
|
|
45
|
+
|
|
46
|
+
**speek** is a suite of SLURM cluster monitoring tools â from quick one-shot snapshots to a full interactive TUI.
|
|
47
|
+
|
|
48
|
+
## Installation
|
|
49
|
+
|
|
50
|
+
```sh
|
|
51
|
+
pip install speek
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
For the latest development version:
|
|
55
|
+
```sh
|
|
56
|
+
pip install --pre speek
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
## Commands
|
|
60
|
+
|
|
61
|
+
| Command | Description |
|
|
62
|
+
|---------|-------------|
|
|
63
|
+
| `speek0` | Classic one-shot cluster overview â GPU availability, per-user usage, job status |
|
|
64
|
+
| `speek-` | Compact snapshot â per-model GPU bars, trends, pending pressure |
|
|
65
|
+
| `speek+` | Full interactive TUI â queue, nodes, users, stats, events, shell |
|
|
66
|
+
|
|
67
|
+
## speek0 â Classic Overview
|
|
68
|
+
|
|
69
|
+
```sh
|
|
70
|
+
speek0 [-u USER] [-f FILE] [-t T_AVAIL]
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
| Option | Description |
|
|
74
|
+
|--------|-------------|
|
|
75
|
+
| `-u USER` | Highlight a specific user (default: self) |
|
|
76
|
+
| `-f FILE` | User info CSV file |
|
|
77
|
+
| `-t T_AVAIL` | Time window for upcoming release, e.g. `5 m`, `1 h` |
|
|
78
|
+
|
|
79
|
+
Shows a table of GPU usage per partition, ranked users with `đ„đ„đ„`, utilization-colored counts, and your current jobs.
|
|
80
|
+
|
|
81
|
+
## speek- â Compact Snapshot
|
|
82
|
+
|
|
83
|
+
```sh
|
|
84
|
+
speek- [-u USER]
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Per-GPU-model view with utilization bars, free/total counts, pending pressure (`âžN`), availability trends (`ââ`), and your running/pending jobs. Detects down nodes and shows them as DEAD.
|
|
88
|
+
|
|
89
|
+
## speek+ â Interactive TUI
|
|
90
|
+
|
|
91
|
+
```sh
|
|
92
|
+
speek+
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Full-featured Textual TUI with:
|
|
96
|
+
|
|
97
|
+
- **Cluster** â speek0-style usage table (tab 1)
|
|
98
|
+
- **Queue** â all cluster jobs grouped by partition, foldable
|
|
99
|
+
- **Nodes** â per-partition node status with usage bars
|
|
100
|
+
- **Users** â per-user GPU usage, fairshare, per-partition breakdown
|
|
101
|
+
- **Stats** â GPU usage charts, per-user stacked view, issue dashboard
|
|
102
|
+
- **Logs** â session CLI output (not persisted)
|
|
103
|
+
- **Settings** â theme, refresh rates, cache management, log scanning
|
|
104
|
+
- **Info** â cluster probe results, scheduling factors, error detection rules
|
|
105
|
+
- **Help** â keybindings reference
|
|
106
|
+
|
|
107
|
+
### Features
|
|
108
|
+
|
|
109
|
+
- 70+ color themes (base16 standard)
|
|
110
|
+
- OOM and error detection (11 error types) with log scanning
|
|
111
|
+
- Job detail popup with stdout, stderr, GPU stats, analysis
|
|
112
|
+
- Built-in shell with tab completion, history, sbatch suggestions
|
|
113
|
+
- Per-job log hints in the table
|
|
114
|
+
- Event notifications with read/unread tracking
|
|
115
|
+
- Down node detection with DEAD indicators
|
|
116
|
+
|
|
117
|
+
## Requirements
|
|
118
|
+
|
|
119
|
+
- Python 3.8+
|
|
120
|
+
- SLURM cluster with `squeue`, `scontrol`, `sinfo`
|
|
121
|
+
- Optional: `sacct`, `sprio`, `sshare`, `sreport`, `scancel` for full features
|
|
122
|
+
- `rich` (all commands), `textual>=0.50.0` (speek+ only)
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
rich
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
speek
|