aluminatiai 0.2.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (44) hide show
  1. aluminatiai-0.2.0/GETTING_STARTED.md +284 -0
  2. aluminatiai-0.2.0/PKG-INFO +393 -0
  3. aluminatiai-0.2.0/README.md +365 -0
  4. aluminatiai-0.2.0/__init__.py +2 -0
  5. aluminatiai-0.2.0/agent.py +607 -0
  6. aluminatiai-0.2.0/aluminati_agent.py +818 -0
  7. aluminatiai-0.2.0/aluminatiai.egg-info/PKG-INFO +393 -0
  8. aluminatiai-0.2.0/aluminatiai.egg-info/SOURCES.txt +56 -0
  9. aluminatiai-0.2.0/aluminatiai.egg-info/dependency_links.txt +1 -0
  10. aluminatiai-0.2.0/aluminatiai.egg-info/entry_points.txt +2 -0
  11. aluminatiai-0.2.0/aluminatiai.egg-info/requires.txt +7 -0
  12. aluminatiai-0.2.0/aluminatiai.egg-info/top_level.txt +1 -0
  13. aluminatiai-0.2.0/attribution/__init__.py +10 -0
  14. aluminatiai-0.2.0/attribution/engine.py +138 -0
  15. aluminatiai-0.2.0/attribution/pid_resolver.py +97 -0
  16. aluminatiai-0.2.0/attribution/process_probe.py +112 -0
  17. aluminatiai-0.2.0/benchmark_m5.py +850 -0
  18. aluminatiai-0.2.0/cli.py +27 -0
  19. aluminatiai-0.2.0/collector.py +342 -0
  20. aluminatiai-0.2.0/config.py +70 -0
  21. aluminatiai-0.2.0/efficiency/__init__.py +35 -0
  22. aluminatiai-0.2.0/efficiency/curve_builder.py +334 -0
  23. aluminatiai-0.2.0/efficiency/gpu_specs.py +202 -0
  24. aluminatiai-0.2.0/efficiency/hardware_match.py +398 -0
  25. aluminatiai-0.2.0/efficiency/profiler.py +1112 -0
  26. aluminatiai-0.2.0/gpu_collector.py +0 -0
  27. aluminatiai-0.2.0/integrations/__init__.py +10 -0
  28. aluminatiai-0.2.0/integrations/mlflow_callback.py +151 -0
  29. aluminatiai-0.2.0/integrations/otel_exporter.py +139 -0
  30. aluminatiai-0.2.0/integrations/wandb_callback.py +137 -0
  31. aluminatiai-0.2.0/main.py +438 -0
  32. aluminatiai-0.2.0/metrics_server.py +136 -0
  33. aluminatiai-0.2.0/pyproject.toml +62 -0
  34. aluminatiai-0.2.0/requirements.txt +15 -0
  35. aluminatiai-0.2.0/schedulers/__init__.py +16 -0
  36. aluminatiai-0.2.0/schedulers/base.py +98 -0
  37. aluminatiai-0.2.0/schedulers/detect.py +87 -0
  38. aluminatiai-0.2.0/schedulers/kubernetes.py +240 -0
  39. aluminatiai-0.2.0/schedulers/runai.py +222 -0
  40. aluminatiai-0.2.0/schedulers/slurm.py +353 -0
  41. aluminatiai-0.2.0/setup.cfg +4 -0
  42. aluminatiai-0.2.0/tests/test_attribution.py +194 -0
  43. aluminatiai-0.2.0/tests/test_collector.py +170 -0
  44. aluminatiai-0.2.0/uploader.py +250 -0
@@ -0,0 +1,284 @@
1
+ # Getting Started with GPU Energy Agent
2
+
3
+ ## Prerequisites
4
+
5
+ 1. **NVIDIA GPU** with driver 450.80.02 or newer
6
+ 2. **Python 3.8+**
7
+ 3. **Linux OS** (Ubuntu, CentOS, etc.)
8
+
9
+ ## Quick Start
10
+
11
+ ### 1. Install Dependencies
12
+
13
+ ```bash
14
+ cd agent
15
+ pip install -r requirements.txt
16
+ ```
17
+
18
+ Expected output:
19
+ ```
20
+ Successfully installed nvidia-ml-py3-7.352.0 psutil-5.9.6 rich-13.7.0 ...
21
+ ```
22
+
23
+ ### 2. Test GPU Detection
24
+
25
+ Run a quick test to verify the collector can access your GPUs:
26
+
27
+ ```bash
28
+ python collector.py
29
+ ```
30
+
31
+ Expected output:
32
+ ```
33
+ Testing GPU Collector...
34
+ Found 4 GPUs:
35
+ GPU 0: NVIDIA A100-SXM4-40GB (GPU-abc123...)
36
+ GPU 1: NVIDIA A100-SXM4-40GB (GPU-def456...)
37
+ ...
38
+
39
+ Collecting 3 samples (2s intervals)...
40
+ GPU 0: 287.4W, 98% util, 76ยฐC, 1437.0J
41
+ ...
42
+
43
+ โœ… Collector test passed!
44
+ ```
45
+
46
+ ### 3. Run the Agent
47
+
48
+ Start monitoring all GPUs:
49
+
50
+ ```bash
51
+ python main.py
52
+ ```
53
+
54
+ You should see real-time metrics updating every 5 seconds:
55
+
56
+ ```
57
+ GPU Energy Agent v0.1.0
58
+ ๐Ÿ“Š Monitoring 4 GPUs
59
+ โฑ๏ธ Sampling interval: 5.0s
60
+
61
+ โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ณโ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”“
62
+ โ”ƒ GPU โ”ƒ Power โ”ƒ Util โ”ƒ Temp โ”ƒ Energy ฮ”โ”ƒ Total kWhโ”ƒ
63
+ โ”กโ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ•‡โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”ฉ
64
+ โ”‚ GPU 0โ”‚ 287.4W โ”‚ 98% โ”‚ 76ยฐCโ”‚ 1437J โ”‚ 0.0012 โ”‚
65
+ โ”‚ GPU 1โ”‚ 289.1W โ”‚ 97% โ”‚ 77ยฐCโ”‚ 1446J โ”‚ 0.0012 โ”‚
66
+ ...
67
+ ```
68
+
69
+ Press `Ctrl+C` to stop.
70
+
71
+ ### 4. Export to CSV
72
+
73
+ Run for 5 minutes and save metrics to CSV:
74
+
75
+ ```bash
76
+ python main.py --duration 300 --output data/test_run.csv
77
+ ```
78
+
79
+ This creates `data/test_run.csv` with timestamped metrics for analysis.
80
+
81
+ ### 5. Run Tests
82
+
83
+ #### Test 1: Unit Tests
84
+
85
+ ```bash
86
+ python tests/test_collector.py
87
+ ```
88
+
89
+ Expected: All tests pass โœ…
90
+
91
+ #### Test 2: Overhead Benchmark
92
+
93
+ Measure agent CPU/memory impact:
94
+
95
+ ```bash
96
+ python tests/benchmark_overhead.py
97
+ ```
98
+
99
+ Expected results:
100
+ - CPU overhead: <0.1%
101
+ - Memory overhead: <50 MB
102
+ - Collection latency: <1ms per GPU
103
+
104
+ #### Test 3: Energy Validation
105
+
106
+ Validate energy calculations:
107
+
108
+ ```bash
109
+ python tests/validate_energy.py --duration 60
110
+ ```
111
+
112
+ Expected: Energy error <5% โœ…
113
+
114
+ ## Common Issues
115
+
116
+ ### Issue: "No NVIDIA GPUs found"
117
+
118
+ **Cause**: NVIDIA driver not installed or not working
119
+
120
+ **Fix**:
121
+ ```bash
122
+ # Check driver
123
+ nvidia-smi
124
+
125
+ # If not working, install driver:
126
+ # Ubuntu:
127
+ sudo apt install nvidia-driver-535
128
+
129
+ # Then reboot
130
+ sudo reboot
131
+ ```
132
+
133
+ ### Issue: "Failed to initialize NVML"
134
+
135
+ **Cause**: Permission issue or driver mismatch
136
+
137
+ **Fix**:
138
+ ```bash
139
+ # Check driver version
140
+ nvidia-smi
141
+
142
+ # Ensure you're in the right groups
143
+ groups
144
+
145
+ # Add to video group if needed
146
+ sudo usermod -a -G video $USER
147
+ # Then log out and back in
148
+ ```
149
+
150
+ ### Issue: "Module 'pynvml' not found"
151
+
152
+ **Cause**: Dependencies not installed
153
+
154
+ **Fix**:
155
+ ```bash
156
+ pip install -r requirements.txt
157
+ ```
158
+
159
+ ## Next Steps
160
+
161
+ ### Week 1-2: Current Focus
162
+
163
+ - [x] Metrics collection working
164
+ - [x] Energy calculations validated
165
+ - [x] Overhead benchmarked
166
+ - [ ] Run 24-hour stability test
167
+ - [ ] Compare against external power meter
168
+
169
+ ### Week 3-4: Attribution
170
+
171
+ - [ ] Add process tracking (PID โ†’ GPU mapping)
172
+ - [ ] Parse command lines for model names
173
+ - [ ] Implement job start/stop detection
174
+
175
+ ### Week 5-6: Backend Integration
176
+
177
+ - [ ] Connect to FastAPI backend
178
+ - [ ] Upload metrics to database
179
+ - [ ] Real-time dashboard
180
+
181
+ ## Testing Checklist
182
+
183
+ Use this checklist to verify your MVP agent:
184
+
185
+ ```
186
+ โ–ก GPU detection works (python collector.py)
187
+ โ–ก Agent runs without errors (python main.py)
188
+ โ–ก CPU overhead <0.1% (python tests/benchmark_overhead.py)
189
+ โ–ก Energy error <5% (python tests/validate_energy.py)
190
+ โ–ก CSV export works (python main.py --output data/test.csv)
191
+ โ–ก Can run for 1+ hours without crash
192
+ โ–ก Energy totals match expectations
193
+ ```
194
+
195
+ ## Getting Help
196
+
197
+ If you encounter issues:
198
+
199
+ 1. Check `logs/` directory for error messages
200
+ 2. Run with verbose logging: `python main.py -v`
201
+ 3. Review [troubleshooting guide](README.md#troubleshooting)
202
+ 4. Open issue in main repository
203
+
204
+ ## Performance Tuning
205
+
206
+ ### Reduce Overhead
207
+
208
+ If CPU usage is too high:
209
+
210
+ ```bash
211
+ # Increase sampling interval
212
+ python main.py --interval 10
213
+
214
+ # Disable clock monitoring (edit collector.py):
215
+ collector = GPUCollector(collect_clocks=False)
216
+ ```
217
+
218
+ ### Increase Accuracy
219
+
220
+ For better energy accuracy:
221
+
222
+ ```bash
223
+ # Use faster sampling (1s interval)
224
+ python main.py --interval 1
225
+
226
+ # Trade-off: Slightly higher overhead (~0.05% CPU)
227
+ ```
228
+
229
+ ## Example Workflows
230
+
231
+ ### Workflow 1: Validate Against Power Meter
232
+
233
+ ```bash
234
+ # 1. Start a GPU workload (e.g., training)
235
+ python your_training_script.py &
236
+
237
+ # 2. Run agent for 10 minutes
238
+ python main.py --duration 600 --output data/validation.csv
239
+
240
+ # 3. Compare total kWh with power meter reading
241
+ # Note: Meter shows whole system, agent shows GPU only
242
+ ```
243
+
244
+ ### Workflow 2: Daily Monitoring
245
+
246
+ ```bash
247
+ # Run as background service
248
+ nohup python main.py --output data/daily_$(date +%Y%m%d).csv > logs/agent.log 2>&1 &
249
+
250
+ # Check status
251
+ tail -f logs/agent.log
252
+
253
+ # Stop
254
+ pkill -f "python main.py"
255
+ ```
256
+
257
+ ### Workflow 3: Job Energy Profiling
258
+
259
+ ```bash
260
+ # Start agent
261
+ python main.py --output data/job_profile.csv &
262
+ AGENT_PID=$!
263
+
264
+ # Run your ML job
265
+ python train_resnet.py
266
+
267
+ # Stop agent
268
+ kill $AGENT_PID
269
+
270
+ # Analyze CSV to see energy consumption during job
271
+ ```
272
+
273
+ ## Success Criteria (Week 1-2)
274
+
275
+ Your MVP agent is ready when:
276
+
277
+ - โœ… Runs with <0.1% CPU overhead
278
+ - โœ… Collects accurate power metrics (validated against NVML)
279
+ - โœ… Energy calculations within 5% of theoretical
280
+ - โœ… Can run continuously for 24+ hours
281
+ - โœ… CSV export works correctly
282
+ - โœ… All unit tests pass
283
+
284
+ Once these are met, move to Week 3-4 (attribution)!
@@ -0,0 +1,393 @@
1
+ Metadata-Version: 2.4
2
+ Name: aluminatiai
3
+ Version: 0.2.0
4
+ Summary: GPU energy monitoring agent โ€” per-job cost attribution for AI teams
5
+ License: MIT
6
+ Project-URL: Homepage, https://aluminatiai.com
7
+ Project-URL: Documentation, https://aluminatiai.com/docs/agent
8
+ Project-URL: Repository, https://github.com/AgentMulder404/AluminatAI
9
+ Project-URL: Bug Tracker, https://github.com/AgentMulder404/AluminatAI/issues
10
+ Keywords: gpu,monitoring,energy,mlops,cost,nvidia
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
14
+ Classifier: Programming Language :: Python :: 3
15
+ Classifier: Programming Language :: Python :: 3.8
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Requires-Python: >=3.8
21
+ Description-Content-Type: text/markdown
22
+ Requires-Dist: pynvml>=11.5.0
23
+ Requires-Dist: requests>=2.28
24
+ Requires-Dist: python-dotenv>=1.0
25
+ Requires-Dist: rich>=13.0
26
+ Provides-Extra: prometheus
27
+ Requires-Dist: prometheus-client>=0.19; extra == "prometheus"
28
+
29
+ # AluminatAI - GPU Energy Intelligence Platform
30
+
31
+ **Know exactly what your GPUs cost. Every watt, every dollar, every job.**
32
+
33
+ AluminatAI is an open-source GPU energy monitoring platform that gives AI teams real-time visibility into power consumption, energy costs, and utilization across their GPU fleet. A lightweight Python agent runs on your GPU machines and streams metrics to a cloud dashboard where you can track spending, compare jobs, and optimize workloads.
34
+
35
+ **Live:** [https://www.aluminatiai.com/](https://aluminatiai-landing.vercel.app)
36
+
37
+ ---
38
+
39
+ ## How It Works
40
+
41
+ ```
42
+ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” HTTPS/JSON โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
43
+ โ”‚ GPU Machine โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ AluminatAI Platform โ”‚
44
+ โ”‚ โ”‚ every 60s โ”‚ โ”‚
45
+ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
46
+ โ”‚ โ”‚ Python Agent โ”‚ โ”‚ โ”‚ โ”‚ Next.js โ”‚ Vercel โ”‚
47
+ โ”‚ โ”‚ (pynvml) โ”‚ โ”‚ โ”‚ โ”‚ API โ”‚ โ”‚
48
+ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
49
+ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
50
+ โ”‚ NVIDIA A100/H100 โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚
51
+ โ”‚ RTX 3090/4090 โ”‚ โ”‚ โ”‚ Supabase โ”‚ PostgreSQL โ”‚
52
+ โ”‚ Any NVIDIA GPU โ”‚ โ”‚ โ”‚ Database โ”‚ + RLS โ”‚
53
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
54
+ โ”‚ โ”‚ โ”‚
55
+ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ” โ”‚
56
+ โ”‚ โ”‚Dashboard โ”‚ React โ”‚
57
+ โ”‚ โ”‚ UI โ”‚ + Recharts โ”‚
58
+ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
59
+ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
60
+ ```
61
+
62
+ ## Features
63
+
64
+ - **Real-Time GPU Monitoring** - Power draw, utilization, temperature, memory, and clock speeds sampled every 5 seconds
65
+ - **Energy Cost Tracking** - Calculates energy consumption in kWh and converts to dollar costs at your electricity rate
66
+ - **Job Attribution** - Track which training jobs consumed how much energy and what they cost
67
+ - **Dashboard** - Three views: Today's Cost, Jobs Table, and Utilization vs Power chart
68
+ - **Free Trial** - 30-day free trial with auto-generated API keys on signup
69
+ - **Lightweight Agent** - <1% CPU, ~50MB RAM overhead on GPU machines
70
+ - **Secure** - Row-Level Security, API key auth with `pgcrypto`, rate limiting, server-side validation
71
+ - **Minimax Scheduler** - Bonus hackathon project: AI-powered job scheduling that balances speed vs. energy cost
72
+
73
+ ---
74
+
75
+ ## Project Structure
76
+
77
+ ```
78
+ AluminatAI/
79
+ โ”œโ”€โ”€ aluminatai-landing/ # Next.js web platform (deployed to Vercel)
80
+ โ”‚ โ”œโ”€โ”€ app/
81
+ โ”‚ โ”‚ โ”œโ”€โ”€ api/
82
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ metrics/ingest/ # GPU metrics ingestion endpoint
83
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ dashboard/ # today-cost, jobs, utilization-chart
84
+ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ user/profile/ # User profile + API key rotation
85
+ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ cron/ # Materialized view refresh
86
+ โ”‚ โ”‚ โ”œโ”€โ”€ dashboard/ # Protected dashboard UI
87
+ โ”‚ โ”‚ โ”œโ”€โ”€ login/ # Auth pages
88
+ โ”‚ โ”‚ โ””โ”€โ”€ page.tsx # Landing page
89
+ โ”‚ โ”œโ”€โ”€ components/ # React components
90
+ โ”‚ โ”œโ”€โ”€ lib/ # Auth, rate limiting, Supabase clients
91
+ โ”‚ โ””โ”€โ”€ database/migrations/ # SQL migrations (001-005)
92
+ โ”‚
93
+ โ”œโ”€โ”€ agent/ # Python GPU monitoring agent
94
+ โ”‚ โ”œโ”€โ”€ main.py # Agent entry point
95
+ โ”‚ โ”œโ”€โ”€ collector.py # NVML-based GPU metrics collector
96
+ โ”‚ โ”œโ”€โ”€ uploader.py # API upload with retry + local backup
97
+ โ”‚ โ”œโ”€โ”€ config.py # Environment-based configuration
98
+ โ”‚ โ”œโ”€โ”€ install.sh # One-line install script
99
+ โ”‚ โ””โ”€โ”€ tests/ # Test suite + Colab notebook
100
+ โ”‚
101
+ โ”œโ”€โ”€ minimax-scheduler/ # Hackathon: Minimax GPU job scheduler
102
+ โ”‚ โ””โ”€โ”€ backend/ # FastAPI + minimax algorithm
103
+ โ”‚
104
+ โ”œโ”€โ”€ backend/ # Legacy FastAPI backend (reference)
105
+ โ”œโ”€โ”€ frontend/ # Legacy React frontend (reference)
106
+ โ”œโ”€โ”€ docker/ # Docker configs for agent + backend
107
+ โ”œโ”€โ”€ docs/ # Architecture docs, metrics schema
108
+ โ””โ”€โ”€ assets/ # Logo and diagrams
109
+ ```
110
+
111
+ ---
112
+
113
+ ## Quick Start
114
+
115
+ ### Prerequisites
116
+
117
+ - Node.js 18+ and npm
118
+ - Python 3.8+
119
+ - A Supabase account ([supabase.com](https://supabase.com))
120
+ - An NVIDIA GPU (for the agent) or Google Colab with GPU runtime
121
+
122
+ ### 1. Clone the Repository
123
+
124
+ ```bash
125
+ git clone https://github.com/AgentMulder404/aluminatai-landing.git
126
+ cd aluminatai-landing
127
+ ```
128
+
129
+ ### 2. Set Up the Database (Supabase)
130
+
131
+ 1. Create a new project at [supabase.com](https://supabase.com)
132
+ 2. Go to **SQL Editor** and run the migrations in order:
133
+
134
+ ```bash
135
+ # Run these SQL files in the Supabase SQL Editor:
136
+ database/migrations/002_gpu_monitoring_schema_postgres.sql
137
+ database/migrations/003_fix_materialized_view.sql
138
+ database/migrations/004_fix_trigger_permissions.sql
139
+ database/migrations/005_secure_api_keys_and_constraints.sql
140
+ ```
141
+
142
+ This creates:
143
+ - `users` table with auto-generated API keys (using `pgcrypto`)
144
+ - `gpu_metrics` time-series table with CHECK constraints
145
+ - `gpu_jobs` table for job tracking
146
+ - `gpu_metrics_hourly` materialized view for fast dashboard queries
147
+ - Row-Level Security policies on all tables
148
+ - Triggers for user profile auto-creation on signup
149
+
150
+ ### 3. Set Up the Web Platform
151
+
152
+ ```bash
153
+ cd aluminatai-landing
154
+ npm install
155
+ ```
156
+
157
+ Create a `.env.local` file:
158
+
159
+ ```bash
160
+ # Supabase (from your project settings > API)
161
+ NEXT_PUBLIC_SUPABASE_URL=https://your-project.supabase.co
162
+ NEXT_PUBLIC_SUPABASE_ANON_KEY=your-anon-key
163
+ SUPABASE_SERVICE_ROLE_KEY=your-service-role-key
164
+
165
+ # Cron secret (generate with: openssl rand -base64 32)
166
+ CRON_SECRET=your-cron-secret
167
+ ```
168
+
169
+ Run the development server:
170
+
171
+ ```bash
172
+ npm run dev
173
+ ```
174
+
175
+ Visit `http://localhost:3000` - you should see the landing page.
176
+
177
+ ### 4. Create an Account
178
+
179
+ 1. Click **"Start Free Trial"** on the landing page
180
+ 2. Enter your name, email, and password
181
+ 3. You'll be redirected to the dashboard setup page
182
+ 4. Copy your API key (starts with `alum_`)
183
+
184
+ ### 5. Install the GPU Agent
185
+
186
+ On your GPU machine (or Google Colab):
187
+
188
+ ```bash
189
+ # Install dependencies
190
+ pip install pynvml requests python-dotenv rich
191
+
192
+ # Set environment variables
193
+ export ALUMINATAI_API_KEY="alum_your_key_here"
194
+ export ALUMINATAI_API_ENDPOINT="http://localhost:3000/api/metrics/ingest"
195
+
196
+ # Run the agent
197
+ python agent/main.py
198
+ ```
199
+
200
+ Options:
201
+
202
+ ```bash
203
+ # Custom sampling interval (1 second)
204
+ python agent/main.py --interval 1
205
+
206
+ # Save to CSV + upload
207
+ python agent/main.py --output data/metrics.csv
208
+
209
+ # Run for 5 minutes
210
+ python agent/main.py --duration 300
211
+
212
+ # Quiet mode (no console output)
213
+ python agent/main.py --quiet --output data/metrics.csv
214
+ ```
215
+
216
+ For production, use the systemd service:
217
+
218
+ ```bash
219
+ cd agent
220
+ chmod +x install.sh
221
+ sudo ./install.sh
222
+ ```
223
+
224
+ ### 6. Test on Google Colab (A100)
225
+
226
+ Upload `agent/tests/AluminatAI_A100_Test.ipynb` to Google Colab:
227
+
228
+ 1. Go to [colab.research.google.com](https://colab.research.google.com)
229
+ 2. **File > Upload notebook** and select the `.ipynb` file
230
+ 3. **Runtime > Change runtime type** > select **A100 GPU**
231
+ 4. Paste your API key in Cell 2
232
+ 5. **Runtime > Run all**
233
+
234
+ The notebook runs 7 test suites:
235
+ - NVML hardware access
236
+ - Collector class + energy calculation
237
+ - API authentication validation
238
+ - End-to-end collect + upload
239
+ - Stress test under GPU load (8192x8192 matmul)
240
+ - API key security audit
241
+ - 60-second continuous monitoring demo
242
+
243
+ ---
244
+
245
+ ## API Reference
246
+
247
+ ### Metrics Ingestion
248
+
249
+ ```
250
+ POST /api/metrics/ingest
251
+ Header: X-API-Key: alum_your_key_here
252
+ ```
253
+
254
+ **Request body** (single metric or array):
255
+
256
+ ```json
257
+ [
258
+ {
259
+ "timestamp": "2026-02-06T12:00:00Z",
260
+ "gpu_index": 0,
261
+ "gpu_uuid": "GPU-abc123",
262
+ "gpu_name": "NVIDIA A100-SXM4-40GB",
263
+ "power_draw_w": 285.5,
264
+ "power_limit_w": 400.0,
265
+ "energy_delta_j": 571.0,
266
+ "utilization_gpu_pct": 95,
267
+ "utilization_memory_pct": 60,
268
+ "temperature_c": 72,
269
+ "memory_used_mb": 32000,
270
+ "memory_total_mb": 40960
271
+ }
272
+ ]
273
+ ```
274
+
275
+ **Validation rules:**
276
+ - `power_draw_w`: 0-1500W
277
+ - `temperature_c`: 0-120C
278
+ - `utilization_*_pct`: 0-100
279
+ - `timestamp`: valid ISO 8601, not more than 5 minutes in the future
280
+ - Max 1000 metrics per request
281
+
282
+ **Rate limit:** 100 requests/minute per user
283
+
284
+ ### Dashboard APIs
285
+
286
+ | Endpoint | Method | Auth | Rate Limit | Description |
287
+ |---|---|---|---|---|
288
+ | `/api/dashboard/today-cost` | GET | Session | 60/min | Today's energy cost |
289
+ | `/api/dashboard/jobs` | GET | Session | 60/min | Job history with pagination |
290
+ | `/api/dashboard/utilization-chart` | GET | Session | 60/min | Time-series chart data |
291
+ | `/api/user/profile` | GET | Session | - | User profile + API key |
292
+ | `/api/user/profile` | PATCH | Session | - | Update profile settings |
293
+ | `/api/user/profile` | POST | Session | 5/hr | Rotate API key |
294
+
295
+ ### API Key Rotation
296
+
297
+ ```bash
298
+ curl -X POST https://aluminatiai-landing.vercel.app/api/user/profile \
299
+ -H "Content-Type: application/json" \
300
+ -H "Cookie: your-session-cookie" \
301
+ -d '{"action": "rotate_api_key"}'
302
+ ```
303
+
304
+ ---
305
+
306
+ ## Security
307
+
308
+ - **API Keys**: Generated with `pgcrypto gen_random_bytes()` - 340 bits of entropy
309
+ - **Row-Level Security**: Users can only access their own data
310
+ - **Rate Limiting**: Per-user limits on all endpoints
311
+ - **Input Validation**: Server-side + database CHECK constraints
312
+ - **HTTPS**: Enforced by Vercel
313
+ - **No ambiguous characters**: API keys exclude `0, O, I, l, 1` to prevent copy errors
314
+
315
+ ---
316
+
317
+ ## Deployment
318
+
319
+ ### Vercel (Web Platform)
320
+
321
+ ```bash
322
+ # Install Vercel CLI
323
+ npm i -g vercel
324
+
325
+ # Deploy
326
+ cd aluminatai-landing
327
+ vercel
328
+
329
+ # Set environment variables in Vercel dashboard
330
+ ```
331
+
332
+ ### Cron Job (Materialized View Refresh)
333
+
334
+ Set up a cron job to refresh the hourly metrics view:
335
+
336
+ - **URL**: `https://your-app.vercel.app/api/cron/refresh-metrics`
337
+ - **Method**: POST
338
+ - **Header**: `Authorization: Bearer your-cron-secret`
339
+ - **Schedule**: Every hour (`0 * * * *`)
340
+
341
+ You can use [cron-job.org](https://cron-job.org) (free) or Vercel Cron.
342
+
343
+ ---
344
+
345
+ ## Tech Stack
346
+
347
+ | Component | Technology |
348
+ |---|---|
349
+ | Web Framework | Next.js 16 |
350
+ | UI | React 19 + Tailwind CSS 4 |
351
+ | Charts | Recharts |
352
+ | Database | Supabase PostgreSQL |
353
+ | Auth | Supabase Auth |
354
+ | GPU Agent | Python + pynvml (NVML) |
355
+ | Deployment | Vercel |
356
+ | Scheduler | Minimax with alpha-beta pruning |
357
+
358
+ ---
359
+
360
+ ## Minimax GPU Scheduler
361
+
362
+ A bonus hackathon project in `minimax-scheduler/` that uses game theory to optimize GPU job scheduling:
363
+
364
+ - **Speed Player (Maximizer)**: Wants to complete jobs ASAP
365
+ - **Cost Player (Minimizer)**: Wants to minimize energy costs
366
+ - **Alpha-Beta Pruning**: Efficiently explores the decision tree
367
+ - **Result**: 15-30% cost savings vs. naive FIFO scheduling
368
+
369
+ ```bash
370
+ cd minimax-scheduler/backend
371
+ pip install -r requirements.txt
372
+ python demo.py
373
+ ```
374
+
375
+ ---
376
+
377
+ ## Contributing
378
+
379
+ 1. Fork the repository
380
+ 2. Create a feature branch (`git checkout -b feature/my-feature`)
381
+ 3. Commit your changes
382
+ 4. Push to the branch (`git push origin feature/my-feature`)
383
+ 5. Open a Pull Request
384
+
385
+ ---
386
+
387
+ ## License
388
+
389
+ This project is open source. See [LICENSE](LICENSE) for details.
390
+
391
+ ---
392
+
393
+ Built by [@AgentMulder404](https://github.com/AgentMulder404)