groove-dev 0.27.108 → 0.27.110

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (66) hide show
  1. package/DYNAMIC_LEAF_ARCH.md +488 -0
  2. package/MERKLE_TREE_ARCHITECTURE.md +354 -0
  3. package/TRAINING_DATA.md +12 -9
  4. package/codex/browser-racing-game/README.md +45 -0
  5. package/codex/browser-racing-game/dist/assets/index-D-sGTraQ.js +47 -0
  6. package/codex/browser-racing-game/dist/assets/index-S75nJv69.css +1 -0
  7. package/codex/browser-racing-game/dist/index.html +14 -0
  8. package/codex/browser-racing-game/index.html +13 -0
  9. package/codex/browser-racing-game/package-lock.json +841 -0
  10. package/codex/browser-racing-game/package.json +15 -0
  11. package/codex/browser-racing-game/src/app.css +359 -0
  12. package/codex/browser-racing-game/src/main.ts +913 -0
  13. package/codex/browser-racing-game/tsconfig.json +20 -0
  14. package/codex/browser-racing-game/vite.config.ts +12 -0
  15. package/moe-training/client/domain-tagger.js +224 -30
  16. package/moe-training/client/trajectory-capture.js +2 -0
  17. package/moe-training/shared/envelope-schema.js +38 -0
  18. package/moe-training/test/client/domain-tagger.test.js +107 -0
  19. package/moe-training/test/shared/envelope-schema.test.js +116 -0
  20. package/node_modules/@groove-dev/cli/package.json +1 -1
  21. package/node_modules/@groove-dev/daemon/package.json +1 -1
  22. package/node_modules/@groove-dev/daemon/src/api.js +25 -5
  23. package/node_modules/@groove-dev/daemon/src/index.js +1 -1
  24. package/node_modules/@groove-dev/daemon/src/journalist.js +24 -18
  25. package/node_modules/@groove-dev/daemon/src/preview.js +113 -9
  26. package/node_modules/@groove-dev/daemon/src/process.js +12 -31
  27. package/node_modules/@groove-dev/daemon/src/providers/base.js +1 -0
  28. package/node_modules/@groove-dev/daemon/src/providers/codex.js +28 -9
  29. package/node_modules/@groove-dev/daemon/src/rotator.js +6 -1
  30. package/node_modules/@groove-dev/daemon/src/tunnel-manager.js +179 -36
  31. package/node_modules/@groove-dev/daemon/test/codex-provider.test.js +63 -0
  32. package/node_modules/@groove-dev/daemon/test/rotator.test.js +10 -10
  33. package/node_modules/@groove-dev/gui/dist/assets/{index-CEgtSfbG.js → index-B8JomvGM.js} +38 -38
  34. package/node_modules/@groove-dev/gui/dist/assets/{index-_3cJS_UG.css → index-DAlSbVyK.css} +1 -1
  35. package/node_modules/@groove-dev/gui/dist/index.html +2 -2
  36. package/node_modules/@groove-dev/gui/package.json +1 -1
  37. package/node_modules/@groove-dev/gui/src/components/preview/preview-workspace.jsx +1 -3
  38. package/node_modules/@groove-dev/gui/src/components/settings/quick-connect.jsx +22 -5
  39. package/node_modules/@groove-dev/gui/src/components/settings/ssh-wizard.jsx +9 -0
  40. package/node_modules/@groove-dev/gui/src/stores/groove.js +24 -0
  41. package/node_modules/moe-training/client/domain-tagger.js +224 -30
  42. package/node_modules/moe-training/client/trajectory-capture.js +2 -0
  43. package/node_modules/moe-training/shared/envelope-schema.js +38 -0
  44. package/node_modules/moe-training/test/client/domain-tagger.test.js +107 -0
  45. package/node_modules/moe-training/test/shared/envelope-schema.test.js +116 -0
  46. package/package.json +1 -1
  47. package/packages/cli/package.json +1 -1
  48. package/packages/daemon/package.json +1 -1
  49. package/packages/daemon/src/api.js +25 -5
  50. package/packages/daemon/src/index.js +1 -1
  51. package/packages/daemon/src/journalist.js +24 -18
  52. package/packages/daemon/src/preview.js +113 -9
  53. package/packages/daemon/src/process.js +12 -31
  54. package/packages/daemon/src/providers/base.js +1 -0
  55. package/packages/daemon/src/providers/codex.js +28 -9
  56. package/packages/daemon/src/rotator.js +6 -1
  57. package/packages/daemon/src/tunnel-manager.js +179 -36
  58. package/packages/gui/dist/assets/{index-CEgtSfbG.js → index-B8JomvGM.js} +38 -38
  59. package/packages/gui/dist/assets/{index-_3cJS_UG.css → index-DAlSbVyK.css} +1 -1
  60. package/packages/gui/dist/index.html +2 -2
  61. package/packages/gui/package.json +1 -1
  62. package/packages/gui/src/components/preview/preview-workspace.jsx +1 -3
  63. package/packages/gui/src/components/settings/quick-connect.jsx +22 -5
  64. package/packages/gui/src/components/settings/ssh-wizard.jsx +9 -0
  65. package/packages/gui/src/stores/groove.js +24 -0
  66. package/ssh/main.js +0 -2253
@@ -0,0 +1,488 @@
1
+ # Dynamic Leaf Architecture
2
+
3
+ **The tree grows itself.**
4
+
5
+ ---
6
+
7
+ ## 1. Executive Summary
8
+
9
+ The current Hummingbird design uses hardcoded domains. Ten skill leaves (Python, React, PostgreSQL, DevOps, Rust, TypeScript, Data Science, Security, Mobile, System Design) and six reasoning leaves (research, strategy, analysis, debate, synthesis, teaching). The PoC validates the core primitive: a lightweight semantic router selects the right leaf with 93.8% accuracy in under 1ms. The architecture works.
10
+
11
+ But it does not scale.
12
+
13
+ Ten skill leaves means ten domains. Domain eleven requires manual intervention: someone must define the domain, write a description, generate a centroid, collect training data, and trigger training. This is fine for a proof of concept. It is fatal for a network that needs to serve molecular biology, legal contract analysis, Kubernetes networking, lightning physics, audio synthesis, and a thousand domains nobody has imagined yet.
14
+
15
+ This document introduces **Dynamic Leaf Architecture**: leaves emerge organically from user behavior through embedding-space clustering. No human defines domains. The tree discovers them from usage patterns. The tree grows, prunes, splits, and merges itself.
16
+
17
+ This is the difference between building a model and growing an intelligence.
18
+
19
+ | Aspect | Fixed Domains (current) | Dynamic Leaves (this doc) |
20
+ |--------|------------------------|--------------------------|
21
+ | Domain count | 10 skill + 6 reasoning | Unlimited, emergent |
22
+ | Adding a domain | Manual definition, data curation, training | Automatic from usage clusters |
23
+ | Centroid generation | Human writes description, encodes it | Cluster centroid computed from real sessions |
24
+ | Training data curation | Manual tagging and filtering | Cluster membership defines the dataset |
25
+ | Tree depth | Flat (one level) | Hierarchical (2-3 levels, data-driven) |
26
+ | Lifecycle management | None | Automated: seed, sprout, mature, prune, split, merge |
27
+
28
+ ---
29
+
30
+ ## 2. The Problem with Fixed Domains
31
+
32
+ Ten hardcoded skill leaves means ten domains. Domain eleven requires someone to sit down and decide: what is the domain called, what is its description, what prompts belong to it, where does its training data come from, and when should it be trained. Then they must write a centroid description, encode it through all-MiniLM-L6-v2, add it to the router, collect tagged sessions, run the five-step training pipeline, evaluate, and deploy.
33
+
34
+ This process has three failure modes:
35
+
36
+ **Boundary ambiguity.** Does "Kubernetes networking" belong in `devops_docker` or a new `networking` leaf? Does "Python data science" belong in `python_expert` or `data_science_ml`? The PoC already shows this: "Write a Python script that trains a neural network" routed to `python` instead of `data_science_ml`. Fixed taxonomies create artificial boundaries that the router must navigate with imperfect cosine similarity.
37
+
38
+ **Coverage gaps.** A user working on bioinformatics pipelines gets routed to whichever existing leaf is least wrong. A user debugging Erlang concurrency gets routed to `rust` because ownership semantics are vaguely similar. The system cannot serve what it does not know exists.
39
+
40
+ **Flywheel stall.** The Hummingbird network effect depends on the tree growing with usage. If tree growth requires manual curation, the flywheel stalls. The vision document promises "federated neuroevolution" — algorithmic pruning of dormant leaves and propagation of successful ones. That requires an automated leaf lifecycle, not a hardcoded list maintained by a human.
41
+
42
+ Fixed domains are training wheels. They proved the router works. Now the training wheels come off.
43
+
44
+ ---
45
+
46
+ ## 3. Core Principle: The Embedding Is the Domain
47
+
48
+ The key insight is that you do not need domain labels. The embedding vector IS the domain definition.
49
+
50
+ A session about "debugging asyncio race conditions" and a session about "writing async HTTP clients in Python" naturally embed close to each other in the MiniLM vector space. You do not need a human to label both as "python_async." The math does it.
51
+
52
+ This eliminates domain taxonomy maintenance, edge case arguments, and granularity debates. The router already works this way — cosine similarity against centroids. We just stop hardcoding the centroids and let them emerge from data.
53
+
54
+ **How embedding-based domain discovery works:**
55
+
56
+ 1. Every session's task prompt gets encoded by the router's encoder (all-MiniLM-L6-v2) at collection time. The PoC benchmarks this at 11ms average — negligible overhead.
57
+
58
+ 2. The 384-dimensional embedding vector is stored alongside the session telemetry.
59
+
60
+ 3. As sessions accumulate, their embeddings form a point cloud in 384-dimensional vector space.
61
+
62
+ 4. Natural clusters in this space ARE domains. They are groups of sessions that are semantically similar — not because someone labeled them, but because the math of language similarity grouped them.
63
+
64
+ 5. The cluster centroid IS the leaf's routing centroid. No manual centroid generation needed.
65
+
66
+ 6. The cluster's member sessions ARE the leaf's training data. No manual data curation needed.
67
+
68
+ The domain taxonomy is not designed. It is discovered. And it is discovered from exactly the same embedding space the router already uses, which means routing accuracy improves automatically — the centroids are computed from real user prompts rather than human-authored domain descriptions.
69
+
70
+ ---
71
+
72
+ ## 4. Clustering Algorithm
73
+
74
+ ### Why HDBSCAN
75
+
76
+ The clustering algorithm must discover domains without being told how many exist. This eliminates K-means (requires pre-specifying K) and makes DBSCAN fragile (requires a fixed epsilon distance threshold that is hard to tune across diverse domains with different densities).
77
+
78
+ **HDBSCAN** (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is the right algorithm:
79
+
80
+ - Automatically discovers the number of clusters from data density
81
+ - Handles variable-density clusters — some domains will have thousands of sessions, others hundreds
82
+ - Naturally handles noise — sessions that do not belong to any clear domain are labeled as outliers rather than forced into a wrong cluster
83
+ - Produces a hierarchy of clusters at different granularities
84
+
85
+ **Parameters:**
86
+
87
+ ```python
88
+ import hdbscan
89
+
90
+ clusterer = hdbscan.HDBSCAN(
91
+ min_cluster_size=200, # minimum sessions to form a domain
92
+ min_samples=50, # density threshold — core point requirement
93
+ metric='cosine', # match the router's similarity metric
94
+ cluster_selection_method='eom', # Excess of Mass for natural hierarchy
95
+ prediction_data=True # enable approximate prediction for new points
96
+ )
97
+
98
+ labels = clusterer.fit_predict(session_embeddings)
99
+ ```
100
+
101
+ ### Hierarchical Discovery
102
+
103
+ The tree depth emerges from data density. Popular areas get deeper trees, niche areas stay shallow.
104
+
105
+ **Level 1** — broad branches. Run HDBSCAN with `min_cluster_size=1000`. Discovers top-level categories: programming, science, operations, research, design, etc.
106
+
107
+ **Level 2** — domains. Within each Level 1 cluster, run HDBSCAN with `min_cluster_size=200`. Discovers specific domains: Python, React, PostgreSQL within the programming branch.
108
+
109
+ **Level 3** — specializations. Within large Level 2 clusters (1000+ sessions), optional run with `min_cluster_size=100`. Discovers sub-specializations: async Python, Python testing, Python data pipelines within the Python domain.
110
+
111
+ ### Hierarchical Routing
112
+
113
+ The router traverses the hierarchy from most specific to most general:
114
+
115
+ 1. Prompt embeds → check Level 3 centroids first (most specific match)
116
+ 2. If best Level 3 match confidence < 0.25 → fall back to Level 2
117
+ 3. If best Level 2 match confidence < 0.20 → fall back to Level 1
118
+ 4. If nothing matches well → route to the chassis with no adapter (general-purpose fallback)
119
+
120
+ This graceful degradation means the system always has an answer. Novel domains that have not yet clustered still get a reasonable response from the base chassis or nearest parent branch. The PoC already proved the chassis produces useful output without any leaf loaded.
121
+
122
+ ---
123
+
124
+ ## 5. The Leaf Lifecycle
125
+
126
+ Leaves are not static artifacts. They are born, grow, mature, evolve, and sometimes die. Each leaf occupies exactly one lifecycle stage at any time.
127
+
128
+ ### SEED — cluster detected, not enough data
129
+
130
+ A new cluster emerges from HDBSCAN with 200+ sessions. This is a signal: users are doing something coherent in this region of embedding space, but there is not yet enough data to train a specialized leaf.
131
+
132
+ - A centroid is computed and registered in the router
133
+ - No leaf is trained yet — sessions route to the nearest existing leaf or the parent branch
134
+ - Sessions continue accumulating in the seed cluster
135
+ - The router tracks that this is an imprecise match via confidence scores
136
+
137
+ **Metadata:** `{ "stage": "seed", "centroid": [...], "session_count": 247, "created": "2026-06-15", "growth_rate": "12/day", "origin": "data_driven" }`
138
+
139
+ ### SPROUT — first training
140
+
141
+ The cluster hits the training threshold: 500 sessions for skill leaves, 300 for reasoning leaves.
142
+
143
+ - First LoRA leaf is trained using the SFT pipeline (Stage 1 only — not enough preference pairs for DPO yet)
144
+ - Router activates the leaf with a confidence gate: if routing confidence < 0.3, fall back to parent branch
145
+ - Evaluation runs against parent leaf — if the sprout does not outperform the parent on domain-specific prompts, it stays in sprout phase and collects more data
146
+ - The sprout must prove itself before earning full routing trust
147
+
148
+ **Metadata added:** `{ "stage": "sprout", "first_trained": "2026-07-01", "eval_vs_parent": "+8% keyword density", "confidence_gate": 0.3, "training_method": "sft_only" }`
149
+
150
+ ### MATURE — production quality
151
+
152
+ The leaf has been trained on 500+ sessions AND retrained at least once with fresh data. Evaluation confirms it outperforms the parent leaf on its domain.
153
+
154
+ - DPO stage applied — enough TIER_A vs TIER_C pairs now exist for preference optimization
155
+ - Confidence gate removed — full routing trust
156
+ - Leaf enters the gossip propagation pool — other nodes in the P2P mesh can request it
157
+ - Retraining is triggered periodically as new sessions accumulate (every 500 new sessions or monthly, whichever comes first)
158
+
159
+ **Metadata added:** `{ "stage": "mature", "maturity_date": "2026-08-15", "retrain_count": 3, "training_method": "sft+dpo", "gossip_propagation_count": 47 }`
160
+
161
+ ### PRUNE — dormant or obsolete
162
+
163
+ A leaf stops earning its place. Triggers:
164
+
165
+ - Not routed to in 30+ consecutive days
166
+ - New sessions for this cluster dropped below 10/month
167
+ - Evaluation on retrain shows regression vs previous version
168
+
169
+ The leaf is **archived**, not deleted. Its centroid is removed from the active router. Its accumulated sessions get absorbed back into the parent cluster's training pool. If the domain revives — new sessions start clustering in that region again — the archived leaf can be restored and retrained rather than starting from scratch.
170
+
171
+ ### SPLIT — specialization
172
+
173
+ A mature leaf's session embeddings, when re-clustered internally, reveal 2+ distinct sub-clusters with `min_cluster_size=100` each.
174
+
175
+ Example: the "Python" leaf splits into "Python backend/API," "Python data processing," and "Python testing." Each child gets its own training run on its sub-cluster's sessions. The parent leaf remains as a fallback. Child leaf centroids are checked first during routing; if confidence is low, the router falls back to the parent.
176
+
177
+ ### MERGE — consolidation
178
+
179
+ Two adjacent leaves have centroid cosine similarity > 0.92, AND neither has enough differentiated training data to justify separate existence (combined sessions < 800).
180
+
181
+ Example: "Express.js middleware" and "Node.js API development" merge into a unified "Node.js backend" leaf. Sessions from both clusters combine into one training set. Both old centroids are replaced by the merged centroid.
182
+
183
+ ---
184
+
185
+ ## 6. The Lifecycle Engine
186
+
187
+ A periodic process — daily by default, or triggered when new session count exceeds a threshold — that manages the entire tree.
188
+
189
+ ```
190
+ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
191
+ │ EMBED │───▶│ CLUSTER │───▶│ DETECT │───▶│ TRAIN │───▶│ EVALUATE │───▶│ DEPLOY │
192
+ │ │ │ │ │ │ │ │ │ │ │ │
193
+ │ Encode │ │ HDBSCAN │ │ Compare │ │ SFT/DPO │ │ Bench vs │ │ Update │
194
+ │ new │ │ on full │ │ clusters │ │ for new/ │ │ parent & │ │ router, │
195
+ │ sessions │ │ embedding│ │ against │ │ changed │ │ previous │ │ push S3, │
196
+ │ │ │ space │ │ known │ │ leaves │ │ version │ │ notify │
197
+ │ │ │ │ │ leaves │ │ │ │ │ │ mesh │
198
+ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
199
+ ```
200
+
201
+ **Step 1 — EMBED:** Encode any new sessions that do not have embeddings yet. The same all-MiniLM-L6-v2 encoder the router uses, producing 384-dim vectors. At 11ms per embedding, 1000 new sessions take ~11 seconds.
202
+
203
+ **Step 2 — CLUSTER:** Run HDBSCAN on the full embedding space (or incremental update at scale). Output: cluster labels, centroids, hierarchy, noise points.
204
+
205
+ **Step 3 — DETECT:** Compare current clusters against known leaves.
206
+
207
+ | Condition | Action |
208
+ |-----------|--------|
209
+ | New cluster, no matching leaf | Create SEED |
210
+ | Existing SEED hit training threshold | Promote to SPROUT, trigger training |
211
+ | Existing SPROUT retrained + outperforms parent | Promote to MATURE |
212
+ | Existing MATURE not routed to in 30 days | PRUNE |
213
+ | Existing MATURE showing internal sub-clusters | Evaluate SPLIT |
214
+ | Two leaves with cosine similarity > 0.92 | Evaluate MERGE |
215
+
216
+ **Step 4 — TRAIN:** Run SFT (and DPO if applicable) for any leaves that need training or retraining. Uses the same five-step pipeline from the AWS training plan, but triggered automatically rather than manually.
217
+
218
+ **Step 5 — EVALUATE:** Benchmark new/retrained leaves against the parent leaf and the previous version. A leaf that regresses stays at its current stage or gets pruned.
219
+
220
+ **Step 6 — DEPLOY:** Update router centroids, push new leaf bundles to the S3 registry, notify the P2P mesh via gossip.
221
+
222
+ This is the "gossip state" from the vision document: "During low-utilization periods (overnight), the network enters a gossip state. Nodes share localized error rates and successful tool-call trajectories. The network algorithmically prunes dormant leaves and propagates highly successful LoRA weights across the mesh."
223
+
224
+ The lifecycle engine IS the gossip protocol's brain.
225
+
226
+ ---
227
+
228
+ ## 7. Telemetry Changes
229
+
230
+ The current telemetry pipeline captures envelope records with trajectory logs and SESSION_CLOSE outcomes. Dynamic Leaf Architecture requires two additions.
231
+
232
+ ### Add: Session Embedding
233
+
234
+ Computed at collection time. 11ms average, negligible overhead.
235
+
236
+ ```json
237
+ "session_embedding": {
238
+ "model": "all-MiniLM-L6-v2",
239
+ "vector": [0.0234, -0.0891, 0.1247, 0.0562, "... 384 dimensions"],
240
+ "source_text": "Write a Python decorator that caches function results with TTL expiry"
241
+ }
242
+ ```
243
+
244
+ The embedding is computed from the first 512 characters of the task prompt. The clustering layer derives domains from embeddings — the collection layer does not need to classify anything.
245
+
246
+ ### Add: Routing Feedback
247
+
248
+ Added to SESSION_CLOSE records. Creates the feedback loop: we know which leaf handled each session and can measure per-leaf quality.
249
+
250
+ ```json
251
+ "routing": {
252
+ "leaf_id": "python_backend_v3",
253
+ "leaf_lifecycle_stage": "mature",
254
+ "routing_confidence": 0.42,
255
+ "fallback_used": false,
256
+ "parent_leaf_id": "python_v2"
257
+ }
258
+ ```
259
+
260
+ ### Remove: Fixed Domain Tags
261
+
262
+ The AWS training plan references domain tagging with fixed classifications. This is replaced by embedding-based clustering. The telemetry pipeline no longer needs to assign `domain_tags` at collection time. Domains are discovered downstream by the lifecycle engine.
263
+
264
+ ---
265
+
266
+ ## 8. Cold Start Strategy
267
+
268
+ Day one, there is no data to cluster. The tree needs starter leaves.
269
+
270
+ **Option A — Synthetic seeding:** Use an LLM to generate 100 diverse prompts per broad area. Embed them to create initial cluster seeds. Train lightweight starter leaves on synthetic data. As real data flows in, starters get retrained and eventually replaced.
271
+
272
+ **Option B — Pre-computed centroids (current PoC approach):** Use the 10 domain descriptions from the PoC as initial centroids. No trained leaves yet — route to chassis + system prompt. As data accumulates per centroid region, leaves spawn naturally.
273
+
274
+ **Option C — Hybrid (recommended):** Start with Option B's pre-computed centroids for immediate routing. The system works from day one using the PoC's proven centroids. Run HDBSCAN after 1000 total sessions. Replace pre-computed centroids with data-driven clusters. From that point forward, the tree is fully emergent.
275
+
276
+ The starter centroids are scaffolding. They are explicitly temporary. The real tree grows from real data.
277
+
278
+ ```json
279
+ {
280
+ "leaf_id": "python_starter",
281
+ "origin": "synthetic",
282
+ "stage": "sprout",
283
+ "note": "Pre-computed centroid from PoC. Will be replaced by data-driven cluster."
284
+ }
285
+ ```
286
+
287
+ vs.
288
+
289
+ ```json
290
+ {
291
+ "leaf_id": "python_backend_cluster_47",
292
+ "origin": "data_driven",
293
+ "stage": "mature",
294
+ "session_count": 4200,
295
+ "cluster_id": 47
296
+ }
297
+ ```
298
+
299
+ The `origin` field lets the system know which leaves are scaffolding and which grew from real usage. Once enough data-driven leaves exist, all synthetic-origin leaves are deprecated.
300
+
301
+ ---
302
+
303
+ ## 9. Scaling Considerations
304
+
305
+ | Session Count | Expected Clusters | HDBSCAN Time | Recommended Approach |
306
+ |---------------|-------------------|--------------|---------------------|
307
+ | 1,000 | 5-10 | < 1 second | Full re-cluster on every engine run |
308
+ | 10,000 | 15-30 | 2-5 seconds | Full re-cluster daily |
309
+ | 100,000 | 50-100 | 1-3 minutes | Full re-cluster weekly, incremental daily |
310
+ | 1,000,000 | 200-500 | 10-30 minutes | Full re-cluster monthly, FAISS approximate nearest neighbor for daily assignment |
311
+
312
+ At 10,000 sessions, HDBSCAN on 384-dimensional vectors takes seconds. The lifecycle engine runs comfortably as a daily cron job on the existing g4dn.xlarge training instance.
313
+
314
+ At 100,000 sessions, full HDBSCAN still completes in minutes. The engine switches to an incremental model: new sessions are assigned to the nearest existing centroid daily (simple cosine similarity, sub-second), and the full HDBSCAN re-cluster runs weekly to detect new clusters, splits, and merges.
315
+
316
+ At 1,000,000 sessions — the scale implied by the vision document's "1,000 node" network — full HDBSCAN becomes expensive. The system switches to FAISS (Facebook AI Similarity Search) for approximate nearest neighbor assignment on daily runs, with full HDBSCAN monthly to maintain cluster accuracy. The tree at this scale has 200-500 leaves across 3 levels of hierarchy. This is a self-organizing intelligence network.
317
+
318
+ ---
319
+
320
+ ## 10. What This Means for the Training Pipeline
321
+
322
+ The AWS training plan describes a manual pipeline: human defines domains, human curates training data, human triggers training, human evaluates and deploys. Dynamic Leaf Architecture automates this.
323
+
324
+ | Step | Before (fixed domains) | After (dynamic leaves) |
325
+ |------|----------------------|----------------------|
326
+ | Domain definition | Human writes description | Lifecycle engine discovers from embedding clusters |
327
+ | Training data curation | Human tags and filters sessions | Cluster membership defines the dataset automatically |
328
+ | Training trigger | Human runs scripts | Automated when cluster hits threshold |
329
+ | Evaluation | Human reviews benchmark output | Automated: benchmark against parent leaf |
330
+ | Deployment | Human uploads to S3 | Automated: push to S3, update router, notify mesh |
331
+ | Human role | Operate everything | Review lifecycle dashboard, approve splits/merges, set global thresholds |
332
+
333
+ **New infrastructure required:**
334
+
335
+ - **Embedding store:** An indexed structure (S3 with manifest or a lightweight vector database like ChromaDB) for session embeddings, queryable by cluster assignment
336
+ - **Clustering service:** Periodic HDBSCAN job — runs on the training GPU instance alongside training jobs, or as a CPU-only job since clustering does not require a GPU
337
+ - **Lifecycle engine:** Orchestrator that manages seed/sprout/mature/prune/split/merge state transitions and triggers the training pipeline
338
+ - **Dashboard:** Visibility into the tree — how many leaves at each stage, which are growing, which are being pruned, routing confidence distributions, cluster evolution over time
339
+
340
+ The five training scripts (`01_setup_and_preprocess.py` through `05_package_leaf.py`) remain unchanged. What changes is how they get invoked: the lifecycle engine calls them instead of a human. The preprocessing step pulls from cluster-assigned sessions instead of domain-tagged sessions.
341
+
342
+ ---
343
+
344
+ ## 11. The Tree Visualization
345
+
346
+ ### Early Stage — 1,000 sessions, ~2 months in
347
+
348
+ ```
349
+ Hummingbird Tree
350
+ ├── programming (starter centroid)
351
+ │ ├── python_general [SPROUT, 280 sessions, confidence gate: 0.3]
352
+ │ └── javascript_web [SEED, 150 sessions, waiting for 500]
353
+ ├── operations (starter centroid)
354
+ │ └── docker_k8s [SEED, 90 sessions]
355
+ ├── research (starter centroid, no children yet)
356
+ └── 340 unassigned sessions (noise — waiting for clusters to form)
357
+ ```
358
+
359
+ Most sessions are still routing to starter centroids. The first data-driven leaves are appearing as seeds and sprouts. The tree is mostly scaffolding.
360
+
361
+ ### Growth Stage — 50,000 sessions, ~6 months in
362
+
363
+ ```
364
+ Hummingbird Tree
365
+ ├── Backend Programming
366
+ │ ├── Python Backend [MATURE, 4,200 sessions]
367
+ │ │ ├── Python Async/API [SPROUT, 800 sessions]
368
+ │ │ └── Python Data Processing [SPROUT, 650 sessions]
369
+ │ ├── Rust Systems [MATURE, 1,800 sessions]
370
+ │ └── Go Services [SEED, 280 sessions]
371
+ ├── Frontend Development
372
+ │ ├── React/Next.js [MATURE, 3,500 sessions]
373
+ │ ├── Vue/Nuxt [SPROUT, 420 sessions]
374
+ │ └── CSS/Design Systems [SEED, 190 sessions]
375
+ ├── Data & ML
376
+ │ ├── ML Training Pipelines [MATURE, 2,100 sessions]
377
+ │ ├── Data Engineering [SPROUT, 680 sessions]
378
+ │ └── LLM/Prompt Engineering [MATURE, 1,900 sessions]
379
+ ├── Infrastructure
380
+ │ ├── Kubernetes [MATURE, 2,800 sessions]
381
+ │ ├── AWS/Cloud [MATURE, 1,500 sessions]
382
+ │ └── CI/CD Pipelines [SPROUT, 550 sessions]
383
+ ├── Research & Analysis
384
+ │ ├── Technical Research [MATURE, 1,200 sessions]
385
+ │ ├── Strategy/Architecture [SPROUT, 400 sessions]
386
+ │ └── Scientific Research [SEED, 220 sessions]
387
+ ├── Specialized Domains
388
+ │ ├── Bioinformatics [SEED, 180 sessions]
389
+ │ ├── Legal/Compliance [SEED, 150 sessions]
390
+ │ └── Finance/Trading [SEED, 210 sessions]
391
+ └── 8,200 unassigned/noise sessions
392
+ ```
393
+
394
+ This tree was not designed. It grew from what 50,000 real users actually did. The "Specialized Domains" branch appeared because enough bioinformatics researchers, legal analysts, and finance developers used the network to form distinct embedding clusters. Nobody decided those should exist. The data decided.
395
+
396
+ The starter centroids are gone. Every leaf is data-driven. The tree has three levels of hierarchy where data density supports it (Python splitting into async and data processing), and stays flat where it does not (Go Services still a seed).
397
+
398
+ ### Scale Stage — 500,000 sessions, ~18 months in
399
+
400
+ The tree has 200+ leaves across 3 levels. Mature leaves are being propagated across the P2P mesh via gossip. Dormant leaves (old framework-specific leaves that lost users) are being pruned. Popular domains are splitting into ever-finer specializations. The tree is a living map of what the network's users actually do.
401
+
402
+ ---
403
+
404
+ ## 12. Build Plan — Phased Implementation
405
+
406
+ ### Phase 1: Embedding Collection (Week 1-2)
407
+
408
+ **Goal:** Start collecting the raw material for clustering without changing any existing behavior.
409
+
410
+ - Add `session_embedding` field to the telemetry pipeline
411
+ - Embed every session at collection time using all-MiniLM-L6-v2 (11ms overhead)
412
+ - Store embeddings alongside JSONL telemetry in S3
413
+ - Add `routing` feedback field to SESSION_CLOSE records
414
+ - Keep existing fixed centroids for routing — fully backward compatible
415
+
416
+ **Dependencies:** None. This is purely additive.
417
+ **Effort:** ~3 days of telemetry pipeline work.
418
+
419
+ ### Phase 2: Clustering Pipeline (Week 3-4)
420
+
421
+ **Goal:** Prove that meaningful clusters emerge from real session embeddings.
422
+
423
+ - Install HDBSCAN on the training instance (`pip install hdbscan`)
424
+ - Build a clustering script that loads embeddings from S3, runs HDBSCAN, and outputs cluster definitions (centroid, member session IDs, hierarchy level)
425
+ - Build a visualization script using UMAP dimensionality reduction for 2D cluster plots
426
+ - Run on accumulated embeddings and validate that clusters align with known domains
427
+ - Store cluster assignments and centroids in a structured manifest
428
+
429
+ **Dependencies:** Phase 1 embeddings must be accumulating.
430
+ **Effort:** ~5 days. Can start as soon as Phase 1 has 1000+ embedded sessions.
431
+
432
+ ### Phase 3: Lifecycle Engine (Week 5-8)
433
+
434
+ **Goal:** Automate the seed/sprout/mature/prune/split/merge state machine.
435
+
436
+ - Build the lifecycle engine as a Python orchestrator script
437
+ - Implement stage detection logic (new cluster? threshold hit? dormant? sub-clusters?)
438
+ - Integrate with training pipeline — auto-trigger `01_setup_and_preprocess.py` through `05_package_leaf.py` when a sprout threshold is reached
439
+ - Integrate with router — auto-update centroids when leaves change
440
+ - Integrate with S3 leaf registry — auto-deploy trained leaves
441
+ - Build a lightweight CLI dashboard showing tree state, leaf stages, session counts, and growth rates
442
+
443
+ **Dependencies:** Phase 2 clustering pipeline.
444
+ **Effort:** ~15 days. This is the core engineering work.
445
+
446
+ ### Phase 4: Transition from Fixed to Dynamic (Week 8-10)
447
+
448
+ **Goal:** Replace the hardcoded domain taxonomy with data-driven clusters.
449
+
450
+ - Run clustering on all accumulated data
451
+ - Compare data-driven clusters against fixed domains — validate coverage
452
+ - Replace fixed centroids with data-driven centroids in the router
453
+ - Retrain leaves on cluster-derived training sets
454
+ - Deprecate fixed domain taxonomy
455
+ - Monitor routing accuracy during transition — rollback capability if accuracy drops
456
+
457
+ **Dependencies:** Phase 3 lifecycle engine. Sufficient accumulated data (target: 5000+ embedded sessions).
458
+ **Effort:** ~8 days. Includes careful validation and rollback planning.
459
+ **Can overlap with:** Phase 3 final integration testing.
460
+
461
+ ### Phase 5: Gossip Integration (Week 10-12)
462
+
463
+ **Goal:** Connect the lifecycle engine to the P2P mesh.
464
+
465
+ - Lifecycle engine publishes leaf state changes to the gossip protocol
466
+ - Mature leaves propagate to nodes that request them
467
+ - Pruned leaves are removed from node caches
468
+ - Split/merge events trigger mesh-wide centroid updates
469
+ - Router centroid sync becomes automatic — nodes pull updated centroid manifests periodically
470
+
471
+ **Dependencies:** Phase 4 must be stable. P2P mesh gossip protocol must be operational.
472
+ **Effort:** ~8 days.
473
+
474
+ ### Parallel Tracks
475
+
476
+ Phases 1-2 are sequential. Phases 3-4 overlap. Phase 5 can begin development in parallel with Phase 4 validation. Total timeline: approximately 12 weeks from start to full gossip integration.
477
+
478
+ ---
479
+
480
+ ## 13. Summary
481
+
482
+ Fixed domains are training wheels. Dynamic Leaf Architecture is the real system.
483
+
484
+ The embedding is the domain. HDBSCAN discovers structure in the embedding space. The lifecycle engine manages leaf evolution through seed, sprout, mature, prune, split, and merge stages. The tree grows from usage. More users produce more data. More data reveals more clusters. More clusters spawn more leaves. More leaves improve routing accuracy. Better routing produces better output. Better output attracts more users.
485
+
486
+ This is the self-improving intelligence loop that makes Hummingbird fundamentally different from static models. The PoC proved the router works. The AWS plan proved the training pipeline works. This document closes the gap between a manually managed set of 16 fixed leaves and a self-organizing tree of hundreds.
487
+
488
+ The tree is not designed. It is grown. Every session makes it smarter, whether or not a human is watching.