aluminatiai 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- aluminatiai-0.2.0/GETTING_STARTED.md +284 -0
- aluminatiai-0.2.0/PKG-INFO +393 -0
- aluminatiai-0.2.0/README.md +365 -0
- aluminatiai-0.2.0/__init__.py +2 -0
- aluminatiai-0.2.0/agent.py +607 -0
- aluminatiai-0.2.0/aluminati_agent.py +818 -0
- aluminatiai-0.2.0/aluminatiai.egg-info/PKG-INFO +393 -0
- aluminatiai-0.2.0/aluminatiai.egg-info/SOURCES.txt +56 -0
- aluminatiai-0.2.0/aluminatiai.egg-info/dependency_links.txt +1 -0
- aluminatiai-0.2.0/aluminatiai.egg-info/entry_points.txt +2 -0
- aluminatiai-0.2.0/aluminatiai.egg-info/requires.txt +7 -0
- aluminatiai-0.2.0/aluminatiai.egg-info/top_level.txt +1 -0
- aluminatiai-0.2.0/attribution/__init__.py +10 -0
- aluminatiai-0.2.0/attribution/engine.py +138 -0
- aluminatiai-0.2.0/attribution/pid_resolver.py +97 -0
- aluminatiai-0.2.0/attribution/process_probe.py +112 -0
- aluminatiai-0.2.0/benchmark_m5.py +850 -0
- aluminatiai-0.2.0/cli.py +27 -0
- aluminatiai-0.2.0/collector.py +342 -0
- aluminatiai-0.2.0/config.py +70 -0
- aluminatiai-0.2.0/efficiency/__init__.py +35 -0
- aluminatiai-0.2.0/efficiency/curve_builder.py +334 -0
- aluminatiai-0.2.0/efficiency/gpu_specs.py +202 -0
- aluminatiai-0.2.0/efficiency/hardware_match.py +398 -0
- aluminatiai-0.2.0/efficiency/profiler.py +1112 -0
- aluminatiai-0.2.0/gpu_collector.py +0 -0
- aluminatiai-0.2.0/integrations/__init__.py +10 -0
- aluminatiai-0.2.0/integrations/mlflow_callback.py +151 -0
- aluminatiai-0.2.0/integrations/otel_exporter.py +139 -0
- aluminatiai-0.2.0/integrations/wandb_callback.py +137 -0
- aluminatiai-0.2.0/main.py +438 -0
- aluminatiai-0.2.0/metrics_server.py +136 -0
- aluminatiai-0.2.0/pyproject.toml +62 -0
- aluminatiai-0.2.0/requirements.txt +15 -0
- aluminatiai-0.2.0/schedulers/__init__.py +16 -0
- aluminatiai-0.2.0/schedulers/base.py +98 -0
- aluminatiai-0.2.0/schedulers/detect.py +87 -0
- aluminatiai-0.2.0/schedulers/kubernetes.py +240 -0
- aluminatiai-0.2.0/schedulers/runai.py +222 -0
- aluminatiai-0.2.0/schedulers/slurm.py +353 -0
- aluminatiai-0.2.0/setup.cfg +4 -0
- aluminatiai-0.2.0/tests/test_attribution.py +194 -0
- aluminatiai-0.2.0/tests/test_collector.py +170 -0
- aluminatiai-0.2.0/uploader.py +250 -0
|
@@ -0,0 +1,284 @@
|
|
|
1
|
+
# Getting Started with GPU Energy Agent
|
|
2
|
+
|
|
3
|
+
## Prerequisites
|
|
4
|
+
|
|
5
|
+
1. **NVIDIA GPU** with driver 450.80.02 or newer
|
|
6
|
+
2. **Python 3.8+**
|
|
7
|
+
3. **Linux OS** (Ubuntu, CentOS, etc.)
|
|
8
|
+
|
|
9
|
+
## Quick Start
|
|
10
|
+
|
|
11
|
+
### 1. Install Dependencies
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
cd agent
|
|
15
|
+
pip install -r requirements.txt
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
Expected output:
|
|
19
|
+
```
|
|
20
|
+
Successfully installed nvidia-ml-py3-7.352.0 psutil-5.9.6 rich-13.7.0 ...
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### 2. Test GPU Detection
|
|
24
|
+
|
|
25
|
+
Run a quick test to verify the collector can access your GPUs:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
python collector.py
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
Expected output:
|
|
32
|
+
```
|
|
33
|
+
Testing GPU Collector...
|
|
34
|
+
Found 4 GPUs:
|
|
35
|
+
GPU 0: NVIDIA A100-SXM4-40GB (GPU-abc123...)
|
|
36
|
+
GPU 1: NVIDIA A100-SXM4-40GB (GPU-def456...)
|
|
37
|
+
...
|
|
38
|
+
|
|
39
|
+
Collecting 3 samples (2s intervals)...
|
|
40
|
+
GPU 0: 287.4W, 98% util, 76ยฐC, 1437.0J
|
|
41
|
+
...
|
|
42
|
+
|
|
43
|
+
โ
Collector test passed!
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
### 3. Run the Agent
|
|
47
|
+
|
|
48
|
+
Start monitoring all GPUs:
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
python main.py
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
You should see real-time metrics updating every 5 seconds:
|
|
55
|
+
|
|
56
|
+
```
|
|
57
|
+
GPU Energy Agent v0.1.0
|
|
58
|
+
๐ Monitoring 4 GPUs
|
|
59
|
+
โฑ๏ธ Sampling interval: 5.0s
|
|
60
|
+
|
|
61
|
+
โโโโโโโณโโโโโโโโโณโโโโโโโณโโโโโโโณโโโโโโโโโโณโโโโโโโโโโโ
|
|
62
|
+
โ GPU โ Power โ Util โ Temp โ Energy ฮโ Total kWhโ
|
|
63
|
+
โกโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฉ
|
|
64
|
+
โ GPU 0โ 287.4W โ 98% โ 76ยฐCโ 1437J โ 0.0012 โ
|
|
65
|
+
โ GPU 1โ 289.1W โ 97% โ 77ยฐCโ 1446J โ 0.0012 โ
|
|
66
|
+
...
|
|
67
|
+
```
|
|
68
|
+
|
|
69
|
+
Press `Ctrl+C` to stop.
|
|
70
|
+
|
|
71
|
+
### 4. Export to CSV
|
|
72
|
+
|
|
73
|
+
Run for 5 minutes and save metrics to CSV:
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
python main.py --duration 300 --output data/test_run.csv
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
This creates `data/test_run.csv` with timestamped metrics for analysis.
|
|
80
|
+
|
|
81
|
+
### 5. Run Tests
|
|
82
|
+
|
|
83
|
+
#### Test 1: Unit Tests
|
|
84
|
+
|
|
85
|
+
```bash
|
|
86
|
+
python tests/test_collector.py
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
Expected: All tests pass โ
|
|
90
|
+
|
|
91
|
+
#### Test 2: Overhead Benchmark
|
|
92
|
+
|
|
93
|
+
Measure agent CPU/memory impact:
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
python tests/benchmark_overhead.py
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
Expected results:
|
|
100
|
+
- CPU overhead: <0.1%
|
|
101
|
+
- Memory overhead: <50 MB
|
|
102
|
+
- Collection latency: <1ms per GPU
|
|
103
|
+
|
|
104
|
+
#### Test 3: Energy Validation
|
|
105
|
+
|
|
106
|
+
Validate energy calculations:
|
|
107
|
+
|
|
108
|
+
```bash
|
|
109
|
+
python tests/validate_energy.py --duration 60
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
Expected: Energy error <5% โ
|
|
113
|
+
|
|
114
|
+
## Common Issues
|
|
115
|
+
|
|
116
|
+
### Issue: "No NVIDIA GPUs found"
|
|
117
|
+
|
|
118
|
+
**Cause**: NVIDIA driver not installed or not working
|
|
119
|
+
|
|
120
|
+
**Fix**:
|
|
121
|
+
```bash
|
|
122
|
+
# Check driver
|
|
123
|
+
nvidia-smi
|
|
124
|
+
|
|
125
|
+
# If not working, install driver:
|
|
126
|
+
# Ubuntu:
|
|
127
|
+
sudo apt install nvidia-driver-535
|
|
128
|
+
|
|
129
|
+
# Then reboot
|
|
130
|
+
sudo reboot
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
### Issue: "Failed to initialize NVML"
|
|
134
|
+
|
|
135
|
+
**Cause**: Permission issue or driver mismatch
|
|
136
|
+
|
|
137
|
+
**Fix**:
|
|
138
|
+
```bash
|
|
139
|
+
# Check driver version
|
|
140
|
+
nvidia-smi
|
|
141
|
+
|
|
142
|
+
# Ensure you're in the right groups
|
|
143
|
+
groups
|
|
144
|
+
|
|
145
|
+
# Add to video group if needed
|
|
146
|
+
sudo usermod -a -G video $USER
|
|
147
|
+
# Then log out and back in
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### Issue: "Module 'pynvml' not found"
|
|
151
|
+
|
|
152
|
+
**Cause**: Dependencies not installed
|
|
153
|
+
|
|
154
|
+
**Fix**:
|
|
155
|
+
```bash
|
|
156
|
+
pip install -r requirements.txt
|
|
157
|
+
```
|
|
158
|
+
|
|
159
|
+
## Next Steps
|
|
160
|
+
|
|
161
|
+
### Week 1-2: Current Focus
|
|
162
|
+
|
|
163
|
+
- [x] Metrics collection working
|
|
164
|
+
- [x] Energy calculations validated
|
|
165
|
+
- [x] Overhead benchmarked
|
|
166
|
+
- [ ] Run 24-hour stability test
|
|
167
|
+
- [ ] Compare against external power meter
|
|
168
|
+
|
|
169
|
+
### Week 3-4: Attribution
|
|
170
|
+
|
|
171
|
+
- [ ] Add process tracking (PID โ GPU mapping)
|
|
172
|
+
- [ ] Parse command lines for model names
|
|
173
|
+
- [ ] Implement job start/stop detection
|
|
174
|
+
|
|
175
|
+
### Week 5-6: Backend Integration
|
|
176
|
+
|
|
177
|
+
- [ ] Connect to FastAPI backend
|
|
178
|
+
- [ ] Upload metrics to database
|
|
179
|
+
- [ ] Real-time dashboard
|
|
180
|
+
|
|
181
|
+
## Testing Checklist
|
|
182
|
+
|
|
183
|
+
Use this checklist to verify your MVP agent:
|
|
184
|
+
|
|
185
|
+
```
|
|
186
|
+
โก GPU detection works (python collector.py)
|
|
187
|
+
โก Agent runs without errors (python main.py)
|
|
188
|
+
โก CPU overhead <0.1% (python tests/benchmark_overhead.py)
|
|
189
|
+
โก Energy error <5% (python tests/validate_energy.py)
|
|
190
|
+
โก CSV export works (python main.py --output data/test.csv)
|
|
191
|
+
โก Can run for 1+ hours without crash
|
|
192
|
+
โก Energy totals match expectations
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## Getting Help
|
|
196
|
+
|
|
197
|
+
If you encounter issues:
|
|
198
|
+
|
|
199
|
+
1. Check `logs/` directory for error messages
|
|
200
|
+
2. Run with verbose logging: `python main.py -v`
|
|
201
|
+
3. Review [troubleshooting guide](README.md#troubleshooting)
|
|
202
|
+
4. Open issue in main repository
|
|
203
|
+
|
|
204
|
+
## Performance Tuning
|
|
205
|
+
|
|
206
|
+
### Reduce Overhead
|
|
207
|
+
|
|
208
|
+
If CPU usage is too high:
|
|
209
|
+
|
|
210
|
+
```bash
|
|
211
|
+
# Increase sampling interval
|
|
212
|
+
python main.py --interval 10
|
|
213
|
+
|
|
214
|
+
# Disable clock monitoring (edit collector.py):
|
|
215
|
+
collector = GPUCollector(collect_clocks=False)
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
### Increase Accuracy
|
|
219
|
+
|
|
220
|
+
For better energy accuracy:
|
|
221
|
+
|
|
222
|
+
```bash
|
|
223
|
+
# Use faster sampling (1s interval)
|
|
224
|
+
python main.py --interval 1
|
|
225
|
+
|
|
226
|
+
# Trade-off: Slightly higher overhead (~0.05% CPU)
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
## Example Workflows
|
|
230
|
+
|
|
231
|
+
### Workflow 1: Validate Against Power Meter
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
# 1. Start a GPU workload (e.g., training)
|
|
235
|
+
python your_training_script.py &
|
|
236
|
+
|
|
237
|
+
# 2. Run agent for 10 minutes
|
|
238
|
+
python main.py --duration 600 --output data/validation.csv
|
|
239
|
+
|
|
240
|
+
# 3. Compare total kWh with power meter reading
|
|
241
|
+
# Note: Meter shows whole system, agent shows GPU only
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
### Workflow 2: Daily Monitoring
|
|
245
|
+
|
|
246
|
+
```bash
|
|
247
|
+
# Run as background service
|
|
248
|
+
nohup python main.py --output data/daily_$(date +%Y%m%d).csv > logs/agent.log 2>&1 &
|
|
249
|
+
|
|
250
|
+
# Check status
|
|
251
|
+
tail -f logs/agent.log
|
|
252
|
+
|
|
253
|
+
# Stop
|
|
254
|
+
pkill -f "python main.py"
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
### Workflow 3: Job Energy Profiling
|
|
258
|
+
|
|
259
|
+
```bash
|
|
260
|
+
# Start agent
|
|
261
|
+
python main.py --output data/job_profile.csv &
|
|
262
|
+
AGENT_PID=$!
|
|
263
|
+
|
|
264
|
+
# Run your ML job
|
|
265
|
+
python train_resnet.py
|
|
266
|
+
|
|
267
|
+
# Stop agent
|
|
268
|
+
kill $AGENT_PID
|
|
269
|
+
|
|
270
|
+
# Analyze CSV to see energy consumption during job
|
|
271
|
+
```
|
|
272
|
+
|
|
273
|
+
## Success Criteria (Week 1-2)
|
|
274
|
+
|
|
275
|
+
Your MVP agent is ready when:
|
|
276
|
+
|
|
277
|
+
- โ
Runs with <0.1% CPU overhead
|
|
278
|
+
- โ
Collects accurate power metrics (validated against NVML)
|
|
279
|
+
- โ
Energy calculations within 5% of theoretical
|
|
280
|
+
- โ
Can run continuously for 24+ hours
|
|
281
|
+
- โ
CSV export works correctly
|
|
282
|
+
- โ
All unit tests pass
|
|
283
|
+
|
|
284
|
+
Once these are met, move to Week 3-4 (attribution)!
|
|
@@ -0,0 +1,393 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: aluminatiai
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: GPU energy monitoring agent โ per-job cost attribution for AI teams
|
|
5
|
+
License: MIT
|
|
6
|
+
Project-URL: Homepage, https://aluminatiai.com
|
|
7
|
+
Project-URL: Documentation, https://aluminatiai.com/docs/agent
|
|
8
|
+
Project-URL: Repository, https://github.com/AgentMulder404/AluminatAI
|
|
9
|
+
Project-URL: Bug Tracker, https://github.com/AgentMulder404/AluminatAI/issues
|
|
10
|
+
Keywords: gpu,monitoring,energy,mlops,cost,nvidia
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
20
|
+
Requires-Python: >=3.8
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
Requires-Dist: pynvml>=11.5.0
|
|
23
|
+
Requires-Dist: requests>=2.28
|
|
24
|
+
Requires-Dist: python-dotenv>=1.0
|
|
25
|
+
Requires-Dist: rich>=13.0
|
|
26
|
+
Provides-Extra: prometheus
|
|
27
|
+
Requires-Dist: prometheus-client>=0.19; extra == "prometheus"
|
|
28
|
+
|
|
29
|
+
# AluminatAI - GPU Energy Intelligence Platform
|
|
30
|
+
|
|
31
|
+
**Know exactly what your GPUs cost. Every watt, every dollar, every job.**
|
|
32
|
+
|
|
33
|
+
AluminatAI is an open-source GPU energy monitoring platform that gives AI teams real-time visibility into power consumption, energy costs, and utilization across their GPU fleet. A lightweight Python agent runs on your GPU machines and streams metrics to a cloud dashboard where you can track spending, compare jobs, and optimize workloads.
|
|
34
|
+
|
|
35
|
+
**Live:** [https://www.aluminatiai.com/](https://aluminatiai-landing.vercel.app)
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## How It Works
|
|
40
|
+
|
|
41
|
+
```
|
|
42
|
+
โโโโโโโโโโโโโโโโโโโโ HTTPS/JSON โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
|
43
|
+
โ GPU Machine โ โโโโโโโโโโโโโโโโโโบ โ AluminatAI Platform โ
|
|
44
|
+
โ โ every 60s โ โ
|
|
45
|
+
โ โโโโโโโโโโโโโโโโ โ โ โโโโโโโโโโโโ โ
|
|
46
|
+
โ โ Python Agent โ โ โ โ Next.js โ Vercel โ
|
|
47
|
+
โ โ (pynvml) โ โ โ โ API โ โ
|
|
48
|
+
โ โโโโโโโโโโโโโโโโ โ โ โโโโโโฌโโโโโโ โ
|
|
49
|
+
โ โ โ โ โ
|
|
50
|
+
โ NVIDIA A100/H100 โ โ โโโโโโผโโโโโโ โ
|
|
51
|
+
โ RTX 3090/4090 โ โ โ Supabase โ PostgreSQL โ
|
|
52
|
+
โ Any NVIDIA GPU โ โ โ Database โ + RLS โ
|
|
53
|
+
โโโโโโโโโโโโโโโโโโโโ โ โโโโโโฌโโโโโโ โ
|
|
54
|
+
โ โ โ
|
|
55
|
+
โ โโโโโโผโโโโโโ โ
|
|
56
|
+
โ โDashboard โ React โ
|
|
57
|
+
โ โ UI โ + Recharts โ
|
|
58
|
+
โ โโโโโโโโโโโโ โ
|
|
59
|
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
## Features
|
|
63
|
+
|
|
64
|
+
- **Real-Time GPU Monitoring** - Power draw, utilization, temperature, memory, and clock speeds sampled every 5 seconds
|
|
65
|
+
- **Energy Cost Tracking** - Calculates energy consumption in kWh and converts to dollar costs at your electricity rate
|
|
66
|
+
- **Job Attribution** - Track which training jobs consumed how much energy and what they cost
|
|
67
|
+
- **Dashboard** - Three views: Today's Cost, Jobs Table, and Utilization vs Power chart
|
|
68
|
+
- **Free Trial** - 30-day free trial with auto-generated API keys on signup
|
|
69
|
+
- **Lightweight Agent** - <1% CPU, ~50MB RAM overhead on GPU machines
|
|
70
|
+
- **Secure** - Row-Level Security, API key auth with `pgcrypto`, rate limiting, server-side validation
|
|
71
|
+
- **Minimax Scheduler** - Bonus hackathon project: AI-powered job scheduling that balances speed vs. energy cost
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## Project Structure
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
AluminatAI/
|
|
79
|
+
โโโ aluminatai-landing/ # Next.js web platform (deployed to Vercel)
|
|
80
|
+
โ โโโ app/
|
|
81
|
+
โ โ โโโ api/
|
|
82
|
+
โ โ โ โโโ metrics/ingest/ # GPU metrics ingestion endpoint
|
|
83
|
+
โ โ โ โโโ dashboard/ # today-cost, jobs, utilization-chart
|
|
84
|
+
โ โ โ โโโ user/profile/ # User profile + API key rotation
|
|
85
|
+
โ โ โ โโโ cron/ # Materialized view refresh
|
|
86
|
+
โ โ โโโ dashboard/ # Protected dashboard UI
|
|
87
|
+
โ โ โโโ login/ # Auth pages
|
|
88
|
+
โ โ โโโ page.tsx # Landing page
|
|
89
|
+
โ โโโ components/ # React components
|
|
90
|
+
โ โโโ lib/ # Auth, rate limiting, Supabase clients
|
|
91
|
+
โ โโโ database/migrations/ # SQL migrations (001-005)
|
|
92
|
+
โ
|
|
93
|
+
โโโ agent/ # Python GPU monitoring agent
|
|
94
|
+
โ โโโ main.py # Agent entry point
|
|
95
|
+
โ โโโ collector.py # NVML-based GPU metrics collector
|
|
96
|
+
โ โโโ uploader.py # API upload with retry + local backup
|
|
97
|
+
โ โโโ config.py # Environment-based configuration
|
|
98
|
+
โ โโโ install.sh # One-line install script
|
|
99
|
+
โ โโโ tests/ # Test suite + Colab notebook
|
|
100
|
+
โ
|
|
101
|
+
โโโ minimax-scheduler/ # Hackathon: Minimax GPU job scheduler
|
|
102
|
+
โ โโโ backend/ # FastAPI + minimax algorithm
|
|
103
|
+
โ
|
|
104
|
+
โโโ backend/ # Legacy FastAPI backend (reference)
|
|
105
|
+
โโโ frontend/ # Legacy React frontend (reference)
|
|
106
|
+
โโโ docker/ # Docker configs for agent + backend
|
|
107
|
+
โโโ docs/ # Architecture docs, metrics schema
|
|
108
|
+
โโโ assets/ # Logo and diagrams
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
---
|
|
112
|
+
|
|
113
|
+
## Quick Start
|
|
114
|
+
|
|
115
|
+
### Prerequisites
|
|
116
|
+
|
|
117
|
+
- Node.js 18+ and npm
|
|
118
|
+
- Python 3.8+
|
|
119
|
+
- A Supabase account ([supabase.com](https://supabase.com))
|
|
120
|
+
- An NVIDIA GPU (for the agent) or Google Colab with GPU runtime
|
|
121
|
+
|
|
122
|
+
### 1. Clone the Repository
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
git clone https://github.com/AgentMulder404/aluminatai-landing.git
|
|
126
|
+
cd aluminatai-landing
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### 2. Set Up the Database (Supabase)
|
|
130
|
+
|
|
131
|
+
1. Create a new project at [supabase.com](https://supabase.com)
|
|
132
|
+
2. Go to **SQL Editor** and run the migrations in order:
|
|
133
|
+
|
|
134
|
+
```bash
|
|
135
|
+
# Run these SQL files in the Supabase SQL Editor:
|
|
136
|
+
database/migrations/002_gpu_monitoring_schema_postgres.sql
|
|
137
|
+
database/migrations/003_fix_materialized_view.sql
|
|
138
|
+
database/migrations/004_fix_trigger_permissions.sql
|
|
139
|
+
database/migrations/005_secure_api_keys_and_constraints.sql
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
This creates:
|
|
143
|
+
- `users` table with auto-generated API keys (using `pgcrypto`)
|
|
144
|
+
- `gpu_metrics` time-series table with CHECK constraints
|
|
145
|
+
- `gpu_jobs` table for job tracking
|
|
146
|
+
- `gpu_metrics_hourly` materialized view for fast dashboard queries
|
|
147
|
+
- Row-Level Security policies on all tables
|
|
148
|
+
- Triggers for user profile auto-creation on signup
|
|
149
|
+
|
|
150
|
+
### 3. Set Up the Web Platform
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
cd aluminatai-landing
|
|
154
|
+
npm install
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
Create a `.env.local` file:
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
# Supabase (from your project settings > API)
|
|
161
|
+
NEXT_PUBLIC_SUPABASE_URL=https://your-project.supabase.co
|
|
162
|
+
NEXT_PUBLIC_SUPABASE_ANON_KEY=your-anon-key
|
|
163
|
+
SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
|
|
164
|
+
|
|
165
|
+
# Cron secret (generate with: openssl rand -base64 32)
|
|
166
|
+
CRON_SECRET=your-cron-secret
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
Run the development server:
|
|
170
|
+
|
|
171
|
+
```bash
|
|
172
|
+
npm run dev
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
Visit `http://localhost:3000` - you should see the landing page.
|
|
176
|
+
|
|
177
|
+
### 4. Create an Account
|
|
178
|
+
|
|
179
|
+
1. Click **"Start Free Trial"** on the landing page
|
|
180
|
+
2. Enter your name, email, and password
|
|
181
|
+
3. You'll be redirected to the dashboard setup page
|
|
182
|
+
4. Copy your API key (starts with `alum_`)
|
|
183
|
+
|
|
184
|
+
### 5. Install the GPU Agent
|
|
185
|
+
|
|
186
|
+
On your GPU machine (or Google Colab):
|
|
187
|
+
|
|
188
|
+
```bash
|
|
189
|
+
# Install dependencies
|
|
190
|
+
pip install pynvml requests python-dotenv rich
|
|
191
|
+
|
|
192
|
+
# Set environment variables
|
|
193
|
+
export ALUMINATAI_API_KEY="alum_your_key_here"
|
|
194
|
+
export ALUMINATAI_API_ENDPOINT="http://localhost:3000/api/metrics/ingest"
|
|
195
|
+
|
|
196
|
+
# Run the agent
|
|
197
|
+
python agent/main.py
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Options:
|
|
201
|
+
|
|
202
|
+
```bash
|
|
203
|
+
# Custom sampling interval (1 second)
|
|
204
|
+
python agent/main.py --interval 1
|
|
205
|
+
|
|
206
|
+
# Save to CSV + upload
|
|
207
|
+
python agent/main.py --output data/metrics.csv
|
|
208
|
+
|
|
209
|
+
# Run for 5 minutes
|
|
210
|
+
python agent/main.py --duration 300
|
|
211
|
+
|
|
212
|
+
# Quiet mode (no console output)
|
|
213
|
+
python agent/main.py --quiet --output data/metrics.csv
|
|
214
|
+
```
|
|
215
|
+
|
|
216
|
+
For production, use the systemd service:
|
|
217
|
+
|
|
218
|
+
```bash
|
|
219
|
+
cd agent
|
|
220
|
+
chmod +x install.sh
|
|
221
|
+
sudo ./install.sh
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
### 6. Test on Google Colab (A100)
|
|
225
|
+
|
|
226
|
+
Upload `agent/tests/AluminatAI_A100_Test.ipynb` to Google Colab:
|
|
227
|
+
|
|
228
|
+
1. Go to [colab.research.google.com](https://colab.research.google.com)
|
|
229
|
+
2. **File > Upload notebook** and select the `.ipynb` file
|
|
230
|
+
3. **Runtime > Change runtime type** > select **A100 GPU**
|
|
231
|
+
4. Paste your API key in Cell 2
|
|
232
|
+
5. **Runtime > Run all**
|
|
233
|
+
|
|
234
|
+
The notebook runs 7 test suites:
|
|
235
|
+
- NVML hardware access
|
|
236
|
+
- Collector class + energy calculation
|
|
237
|
+
- API authentication validation
|
|
238
|
+
- End-to-end collect + upload
|
|
239
|
+
- Stress test under GPU load (8192x8192 matmul)
|
|
240
|
+
- API key security audit
|
|
241
|
+
- 60-second continuous monitoring demo
|
|
242
|
+
|
|
243
|
+
---
|
|
244
|
+
|
|
245
|
+
## API Reference
|
|
246
|
+
|
|
247
|
+
### Metrics Ingestion
|
|
248
|
+
|
|
249
|
+
```
|
|
250
|
+
POST /api/metrics/ingest
|
|
251
|
+
Header: X-API-Key: alum_your_key_here
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
**Request body** (single metric or array):
|
|
255
|
+
|
|
256
|
+
```json
|
|
257
|
+
[
|
|
258
|
+
{
|
|
259
|
+
"timestamp": "2026-02-06T12:00:00Z",
|
|
260
|
+
"gpu_index": 0,
|
|
261
|
+
"gpu_uuid": "GPU-abc123",
|
|
262
|
+
"gpu_name": "NVIDIA A100-SXM4-40GB",
|
|
263
|
+
"power_draw_w": 285.5,
|
|
264
|
+
"power_limit_w": 400.0,
|
|
265
|
+
"energy_delta_j": 571.0,
|
|
266
|
+
"utilization_gpu_pct": 95,
|
|
267
|
+
"utilization_memory_pct": 60,
|
|
268
|
+
"temperature_c": 72,
|
|
269
|
+
"memory_used_mb": 32000,
|
|
270
|
+
"memory_total_mb": 40960
|
|
271
|
+
}
|
|
272
|
+
]
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
**Validation rules:**
|
|
276
|
+
- `power_draw_w`: 0-1500W
|
|
277
|
+
- `temperature_c`: 0-120C
|
|
278
|
+
- `utilization_*_pct`: 0-100
|
|
279
|
+
- `timestamp`: valid ISO 8601, not more than 5 minutes in the future
|
|
280
|
+
- Max 1000 metrics per request
|
|
281
|
+
|
|
282
|
+
**Rate limit:** 100 requests/minute per user
|
|
283
|
+
|
|
284
|
+
### Dashboard APIs
|
|
285
|
+
|
|
286
|
+
| Endpoint | Method | Auth | Rate Limit | Description |
|
|
287
|
+
|---|---|---|---|---|
|
|
288
|
+
| `/api/dashboard/today-cost` | GET | Session | 60/min | Today's energy cost |
|
|
289
|
+
| `/api/dashboard/jobs` | GET | Session | 60/min | Job history with pagination |
|
|
290
|
+
| `/api/dashboard/utilization-chart` | GET | Session | 60/min | Time-series chart data |
|
|
291
|
+
| `/api/user/profile` | GET | Session | - | User profile + API key |
|
|
292
|
+
| `/api/user/profile` | PATCH | Session | - | Update profile settings |
|
|
293
|
+
| `/api/user/profile` | POST | Session | 5/hr | Rotate API key |
|
|
294
|
+
|
|
295
|
+
### API Key Rotation
|
|
296
|
+
|
|
297
|
+
```bash
|
|
298
|
+
curl -X POST https://aluminatiai-landing.vercel.app/api/user/profile \
|
|
299
|
+
-H "Content-Type: application/json" \
|
|
300
|
+
-H "Cookie: your-session-cookie" \
|
|
301
|
+
-d '{"action": "rotate_api_key"}'
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
---
|
|
305
|
+
|
|
306
|
+
## Security
|
|
307
|
+
|
|
308
|
+
- **API Keys**: Generated with `pgcrypto gen_random_bytes()` - 340 bits of entropy
|
|
309
|
+
- **Row-Level Security**: Users can only access their own data
|
|
310
|
+
- **Rate Limiting**: Per-user limits on all endpoints
|
|
311
|
+
- **Input Validation**: Server-side + database CHECK constraints
|
|
312
|
+
- **HTTPS**: Enforced by Vercel
|
|
313
|
+
- **No ambiguous characters**: API keys exclude `0, O, I, l, 1` to prevent copy errors
|
|
314
|
+
|
|
315
|
+
---
|
|
316
|
+
|
|
317
|
+
## Deployment
|
|
318
|
+
|
|
319
|
+
### Vercel (Web Platform)
|
|
320
|
+
|
|
321
|
+
```bash
|
|
322
|
+
# Install Vercel CLI
|
|
323
|
+
npm i -g vercel
|
|
324
|
+
|
|
325
|
+
# Deploy
|
|
326
|
+
cd aluminatai-landing
|
|
327
|
+
vercel
|
|
328
|
+
|
|
329
|
+
# Set environment variables in Vercel dashboard
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
### Cron Job (Materialized View Refresh)
|
|
333
|
+
|
|
334
|
+
Set up a cron job to refresh the hourly metrics view:
|
|
335
|
+
|
|
336
|
+
- **URL**: `https://your-app.vercel.app/api/cron/refresh-metrics`
|
|
337
|
+
- **Method**: POST
|
|
338
|
+
- **Header**: `Authorization: Bearer your-cron-secret`
|
|
339
|
+
- **Schedule**: Every hour (`0 * * * *`)
|
|
340
|
+
|
|
341
|
+
You can use [cron-job.org](https://cron-job.org) (free) or Vercel Cron.
|
|
342
|
+
|
|
343
|
+
---
|
|
344
|
+
|
|
345
|
+
## Tech Stack
|
|
346
|
+
|
|
347
|
+
| Component | Technology |
|
|
348
|
+
|---|---|
|
|
349
|
+
| Web Framework | Next.js 16 |
|
|
350
|
+
| UI | React 19 + Tailwind CSS 4 |
|
|
351
|
+
| Charts | Recharts |
|
|
352
|
+
| Database | Supabase PostgreSQL |
|
|
353
|
+
| Auth | Supabase Auth |
|
|
354
|
+
| GPU Agent | Python + pynvml (NVML) |
|
|
355
|
+
| Deployment | Vercel |
|
|
356
|
+
| Scheduler | Minimax with alpha-beta pruning |
|
|
357
|
+
|
|
358
|
+
---
|
|
359
|
+
|
|
360
|
+
## Minimax GPU Scheduler
|
|
361
|
+
|
|
362
|
+
A bonus hackathon project in `minimax-scheduler/` that uses game theory to optimize GPU job scheduling:
|
|
363
|
+
|
|
364
|
+
- **Speed Player (Maximizer)**: Wants to complete jobs ASAP
|
|
365
|
+
- **Cost Player (Minimizer)**: Wants to minimize energy costs
|
|
366
|
+
- **Alpha-Beta Pruning**: Efficiently explores the decision tree
|
|
367
|
+
- **Result**: 15-30% cost savings vs. naive FIFO scheduling
|
|
368
|
+
|
|
369
|
+
```bash
|
|
370
|
+
cd minimax-scheduler/backend
|
|
371
|
+
pip install -r requirements.txt
|
|
372
|
+
python demo.py
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
---
|
|
376
|
+
|
|
377
|
+
## Contributing
|
|
378
|
+
|
|
379
|
+
1. Fork the repository
|
|
380
|
+
2. Create a feature branch (`git checkout -b feature/my-feature`)
|
|
381
|
+
3. Commit your changes
|
|
382
|
+
4. Push to the branch (`git push origin feature/my-feature`)
|
|
383
|
+
5. Open a Pull Request
|
|
384
|
+
|
|
385
|
+
---
|
|
386
|
+
|
|
387
|
+
## License
|
|
388
|
+
|
|
389
|
+
This project is open source. See [LICENSE](LICENSE) for details.
|
|
390
|
+
|
|
391
|
+
---
|
|
392
|
+
|
|
393
|
+
Built by [@AgentMulder404](https://github.com/AgentMulder404)
|