groove-dev 0.27.70 → 0.27.71

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. package/CLAUDE.md +0 -7
  2. package/MOE_TRAINING_PIPELINE.md +720 -0
  3. package/node_modules/@groove-dev/cli/package.json +1 -1
  4. package/node_modules/@groove-dev/daemon/package.json +1 -1
  5. package/node_modules/@groove-dev/daemon/src/api.js +272 -2
  6. package/node_modules/@groove-dev/daemon/src/index.js +3 -0
  7. package/node_modules/@groove-dev/daemon/src/providers/base.js +8 -0
  8. package/node_modules/@groove-dev/daemon/src/providers/claude-code.js +52 -0
  9. package/node_modules/@groove-dev/daemon/src/providers/codex.js +15 -0
  10. package/node_modules/@groove-dev/daemon/src/providers/gemini.js +14 -0
  11. package/node_modules/@groove-dev/daemon/src/providers/index.js +36 -0
  12. package/node_modules/@groove-dev/gui/dist/assets/index-74E3YTkT.css +1 -0
  13. package/node_modules/@groove-dev/gui/dist/assets/{index-D5BpdcWS.js → index-BK6tvmxx.js} +1736 -1735
  14. package/node_modules/@groove-dev/gui/dist/index.html +2 -2
  15. package/node_modules/@groove-dev/gui/package.json +1 -1
  16. package/node_modules/@groove-dev/gui/src/components/editor/code-editor.jsx +5 -5
  17. package/node_modules/@groove-dev/gui/src/components/editor/editor-tabs.jsx +4 -4
  18. package/node_modules/@groove-dev/gui/src/components/settings/ProviderSetupWizard.jsx +480 -0
  19. package/node_modules/@groove-dev/gui/src/stores/groove.js +107 -2
  20. package/node_modules/@groove-dev/gui/src/views/settings.jsx +258 -84
  21. package/package.json +1 -1
  22. package/packages/cli/package.json +1 -1
  23. package/packages/daemon/package.json +1 -1
  24. package/packages/daemon/src/api.js +272 -2
  25. package/packages/daemon/src/index.js +3 -0
  26. package/packages/daemon/src/providers/base.js +8 -0
  27. package/packages/daemon/src/providers/claude-code.js +52 -0
  28. package/packages/daemon/src/providers/codex.js +15 -0
  29. package/packages/daemon/src/providers/gemini.js +14 -0
  30. package/packages/daemon/src/providers/index.js +36 -0
  31. package/packages/gui/dist/assets/index-74E3YTkT.css +1 -0
  32. package/packages/gui/dist/assets/{index-D5BpdcWS.js → index-BK6tvmxx.js} +1736 -1735
  33. package/packages/gui/dist/index.html +2 -2
  34. package/packages/gui/package.json +1 -1
  35. package/packages/gui/src/components/editor/code-editor.jsx +5 -5
  36. package/packages/gui/src/components/editor/editor-tabs.jsx +4 -4
  37. package/packages/gui/src/components/settings/ProviderSetupWizard.jsx +480 -0
  38. package/packages/gui/src/stores/groove.js +107 -2
  39. package/packages/gui/src/views/settings.jsx +258 -84
  40. package/node_modules/@groove-dev/gui/dist/assets/index-oQ0ejlfH.css +0 -1
  41. package/packages/gui/dist/assets/index-oQ0ejlfH.css +0 -1
@@ -0,0 +1,720 @@
1
+ # MoE Training Data Pipeline — Daemon Team Integration Guide
2
+
3
+ Handoff spec from the Network Team to the Daemon Team.
4
+ Last updated: 2026-04-23 | Target: HN launch April 26, 2026
5
+
6
+ ---
7
+
8
+ ## 1. Overview
9
+
10
+ The MoE training pipeline captures opted-in user workflow data (prompts,
11
+ completions, conversations) and stores it as PII-scrubbed JSONL for future
12
+ MoE expert training. The pipeline has two halves:
13
+
14
+ **Network team built (DONE):**
15
+ - Consent management (SQLite-backed, GDPR-style)
16
+ - PII scrubber (regex-based, 13 pattern categories)
17
+ - Domain tagger (journalism, code, research, planning, general)
18
+ - Training corpus storage (daily-rotated JSONL)
19
+ - Intake API and CaptureSession — the entry points the daemon calls
20
+ - Corpus statistics and reporting
21
+
22
+ **Daemon team builds (THIS DOC):**
23
+ - User-facing opt-in toggle in the Groove app
24
+ - Data capture hooks at prompt/completion/session boundaries
25
+ - Background capture queue (non-blocking)
26
+ - Data management UI (view stats, download, delete)
27
+
28
+ The daemon team does NOT need to implement PII scrubbing, consent
29
+ checking, domain classification, or storage. The network team's API
30
+ handles all of that internally. The daemon team's job is to wire up
31
+ the capture points and build the user-facing opt-in experience.
32
+
33
+ **Timeline:** Must be ready for HN launch on April 26, 2026. New users
34
+ who opt in should immediately start contributing training data.
35
+
36
+ ---
37
+
38
+ ## 2. What the Network Team Built (Backend API)
39
+
40
+ All source code lives in `moe-team/src/training/`. The daemon team
41
+ interacts with two main classes: `TrainingDataIntake` (simple per-call
42
+ API) and `CaptureSession` (session-oriented wrapper with lifecycle).
43
+
44
+ ### 2.1 TrainingDataIntake — Single Entry Point
45
+
46
+ ```python
47
+ from moe-team.src.training.intake import TrainingDataIntake
48
+ ```
49
+
50
+ Constructor:
51
+
52
+ ```python
53
+ class TrainingDataIntake:
54
+ def __init__(
55
+ self,
56
+ consent_manager: ConsentManager,
57
+ corpus: TrainingCorpus,
58
+ scrubber: PIIScrubber,
59
+ tagger: DomainTagger,
60
+ ) -> None
61
+ ```
62
+
63
+ #### submit()
64
+
65
+ ```python
66
+ def submit(
67
+ self,
68
+ user_id: str,
69
+ session_id: str,
70
+ content: str,
71
+ content_type: str,
72
+ metadata: dict[str, Any] | None = None,
73
+ ) -> str | None
74
+ ```
75
+
76
+ Returns `record_id` (hex UUID) on success, `None` if user is not opted in.
77
+
78
+ What happens inside submit():
79
+ 1. Checks `consent_manager.is_opted_in(user_id)` — returns None if not
80
+ 2. Runs `scrubber.scrub(content)` — strips all PII
81
+ 3. Runs `tagger.tag(scrubbed_content)` — classifies domain
82
+ 4. Looks up current `consent_version` from consent history
83
+ 5. Creates a `TrainingRecord` and writes it to JSONL via `corpus.add_record()`
84
+
85
+ The daemon team does NOT need to scrub PII or check consent — submit()
86
+ handles both internally.
87
+
88
+ #### submit_batch()
89
+
90
+ ```python
91
+ def submit_batch(
92
+ self,
93
+ records: list[dict[str, Any]],
94
+ ) -> dict[str, Any]
95
+ # Returns: {"accepted": int, "rejected": int, "record_ids": list[str]}
96
+ ```
97
+
98
+ Each record in the list must have keys: `user_id`, `session_id`,
99
+ `content`, `content_type`, and optionally `metadata`.
100
+
101
+ #### delete_user()
102
+
103
+ ```python
104
+ def delete_user(self, user_id: str) -> dict[str, Any]
105
+ # Returns: {"consent_revoked": bool, "records_deleted": int}
106
+ ```
107
+
108
+ Revokes consent AND deletes all stored training data for the user.
109
+ Call this when a user opts out and chooses to delete their data.
110
+
111
+ ### 2.2 CaptureSession — Session-Oriented Wrapper
112
+
113
+ ```python
114
+ from moe-team.src.training import CaptureSession
115
+ ```
116
+
117
+ Designed for per-session capture with start/end lifecycle and built-in
118
+ stats tracking.
119
+
120
+ ```python
121
+ class CaptureSession:
122
+ def __init__(
123
+ self,
124
+ user_id: str,
125
+ session_id: str,
126
+ consent_manager: ConsentManager,
127
+ corpus: TrainingCorpus,
128
+ scrubber: PIIScrubber,
129
+ tagger: DomainTagger,
130
+ ) -> None
131
+
132
+ def start(self) -> None
133
+ # Raises PermissionError if user not opted in
134
+
135
+ def record_prompt(self, text: str, metadata: dict[str, Any] | None = None) -> None
136
+ def record_completion(self, text: str, metadata: dict[str, Any] | None = None) -> None
137
+ def record_conversation(self, messages: list[dict[str, Any]], metadata: dict[str, Any] | None = None) -> None
138
+
139
+ def end(self) -> dict[str, Any]
140
+ # Returns: {"session_id", "records_captured", "bytes_captured", "domains", "duration_seconds"}
141
+
142
+ def is_active(self) -> bool
143
+ # Re-checks consent on every call — stops if user revokes mid-session
144
+ ```
145
+
146
+ CaptureSession is the recommended API for the daemon team. It handles
147
+ consent re-checking on every capture call, so if a user revokes consent
148
+ mid-session, capture stops immediately.
149
+
150
+ ### 2.3 ConsentManager — Consent Operations
151
+
152
+ ```python
153
+ from moe-team.src.training import ConsentManager
154
+ ```
155
+
156
+ ```python
157
+ class ConsentManager:
158
+ def __init__(self, db_path: str = "~/.groove/consent.db") -> None
159
+
160
+ def record_consent(
161
+ self,
162
+ user_id: str,
163
+ opted_in: bool,
164
+ consent_version: str,
165
+ metadata: dict[str, Any] | None = None,
166
+ ) -> None
167
+
168
+ def is_opted_in(self, user_id: str) -> bool
169
+
170
+ def revoke_consent(self, user_id: str) -> bool
171
+ # Returns False if user had no consent records
172
+
173
+ def request_deletion(self, user_id: str) -> dict[str, Any]
174
+ # Returns: {"user_id", "consent_records", "status": "pending"}
175
+
176
+ def get_opted_in_count(self) -> int
177
+
178
+ def get_consent_history(self, user_id: str) -> list[ConsentRecord]
179
+ ```
180
+
181
+ Key behavior of `is_opted_in()`:
182
+ - Returns `False` if no consent record exists (default opt-out)
183
+ - Returns `False` if the latest consent version doesn't match
184
+ `CURRENT_CONSENT_VERSION` (currently "1.0") — this forces re-consent
185
+ when terms change
186
+
187
+ ### 2.4 CorpusStats — Monitoring
188
+
189
+ ```python
190
+ from moe-team.src.training import CorpusStats
191
+
192
+ class CorpusStats:
193
+ def __init__(self, corpus: TrainingCorpus, consent_manager: ConsentManager) -> None
194
+
195
+ def summary(self) -> dict[str, Any]
196
+ # Returns: {"total_records", "storage_size_mb", "opted_in_users", "domains"}
197
+
198
+ def daily_growth(self, days: int = 7) -> list[dict[str, Any]]
199
+ # Returns: [{"date": "2026-04-26", "records": 142}, ...]
200
+
201
+ def domain_breakdown(self) -> dict[str, Any]
202
+ # Returns: {"journalism": {"count": 50, "percentage": 33.3}, ...}
203
+
204
+ def print_report(self) -> None
205
+ # Prints formatted report to stdout
206
+ ```
207
+
208
+ ---
209
+
210
+ ## 3. What the Daemon Team Needs to Build
211
+
212
+ ### 3.1 User-Facing Opt-In
213
+
214
+ Add a settings toggle in the Groove app/IDE:
215
+
216
+ **Label:** "Share usage data to improve Groove"
217
+ **Default state:** OFF. No data collected without explicit user action.
218
+
219
+ **When toggled ON:**
220
+
221
+ 1. Show a clear disclosure dialog before enabling. The disclosure must state:
222
+ - What data is collected: prompts, completions, workflow metadata
223
+ - How it's used: training MoE expert models to improve Groove
224
+ - PII is automatically scrubbed before storage (emails, phone numbers,
225
+ API keys, file paths, etc. are all replaced with placeholders)
226
+ - How to opt out: toggle the setting OFF at any time
227
+ - How to delete data: option available in settings to delete all
228
+ previously collected data
229
+
230
+ 2. Generate or load a persistent `user_id`:
231
+ - Store at `~/.groove/user_id`
232
+ - Generate as a random UUID (e.g., `uuid.uuid4().hex`)
233
+ - Generated once per install, reused across sessions
234
+ - NOT tied to identity: no hardware IDs, no email, no name, no IP
235
+
236
+ 3. Record consent:
237
+ ```python
238
+ consent_manager.record_consent(
239
+ user_id=user_id,
240
+ opted_in=True,
241
+ consent_version="1.0",
242
+ )
243
+ ```
244
+
245
+ **When toggled OFF:**
246
+
247
+ 1. Revoke consent:
248
+ ```python
249
+ consent_manager.revoke_consent(user_id)
250
+ ```
251
+
252
+ 2. Stop all capture immediately. Any active CaptureSession will
253
+ self-deactivate on the next `is_active()` check.
254
+
255
+ 3. Offer the option to delete previously collected data:
256
+ ```python
257
+ intake.delete_user(user_id)
258
+ # This revokes consent AND deletes all JSONL records for the user
259
+ ```
260
+
261
+ ### 3.2 Data Capture Points
262
+
263
+ Hook capture at these points in the daemon/app:
264
+
265
+ #### PROMPTS
266
+
267
+ When user submits a prompt for inference, capture the prompt text.
268
+
269
+ ```python
270
+ session.record_prompt(prompt_text, metadata={
271
+ "model_name": model_name, # e.g., "qwen3-moe-a3b"
272
+ "timestamp": time.time(),
273
+ "workflow_type": workflow_type, # e.g., "journalism", "code_editor"
274
+ })
275
+ ```
276
+
277
+ Or with TrainingDataIntake:
278
+ ```python
279
+ intake.submit(user_id, session_id, prompt_text, "prompt", metadata={
280
+ "model_name": model_name,
281
+ "timestamp": time.time(),
282
+ "workflow_type": workflow_type,
283
+ })
284
+ ```
285
+
286
+ #### COMPLETIONS
287
+
288
+ When the model returns a completion, capture the output text.
289
+
290
+ ```python
291
+ session.record_completion(completion_text, metadata={
292
+ "model_name": model_name,
293
+ "token_count": token_count,
294
+ "latency_ms": latency_ms,
295
+ "finish_reason": finish_reason, # e.g., "stop", "length", "eos"
296
+ })
297
+ ```
298
+
299
+ Or with TrainingDataIntake:
300
+ ```python
301
+ intake.submit(user_id, session_id, completion_text, "completion", metadata={
302
+ "model_name": model_name,
303
+ "token_count": token_count,
304
+ "latency_ms": latency_ms,
305
+ "finish_reason": finish_reason,
306
+ })
307
+ ```
308
+
309
+ #### CONVERSATIONS
310
+
311
+ For multi-turn sessions, capture the full conversation at session end.
312
+
313
+ ```python
314
+ messages = [
315
+ {"role": "user", "content": "..."},
316
+ {"role": "assistant", "content": "..."},
317
+ {"role": "user", "content": "..."},
318
+ {"role": "assistant", "content": "..."},
319
+ ]
320
+ session.record_conversation(messages, metadata={
321
+ "turn_count": len(messages),
322
+ "total_tokens": total_tokens,
323
+ "session_duration_s": duration_seconds,
324
+ })
325
+ ```
326
+
327
+ Or with TrainingDataIntake:
328
+ ```python
329
+ import json
330
+ intake.submit(user_id, session_id, json.dumps(messages), "conversation", metadata={
331
+ "turn_count": len(messages),
332
+ "total_tokens": total_tokens,
333
+ "session_duration_s": duration_seconds,
334
+ })
335
+ ```
336
+
337
+ #### WORKFLOW METADATA
338
+
339
+ Capture what type of work the user is doing (journalism, coding,
340
+ research, planning). The domain tagger classifies content automatically
341
+ based on keywords, but explicit `workflow_type` in metadata helps
342
+ validate and improve the tagger.
343
+
344
+ If the app knows the user is in "journalist mode" or "code editor",
345
+ pass that as metadata on every submit call:
346
+
347
+ ```python
348
+ metadata={"workflow_type": "journalism"}
349
+ ```
350
+
351
+ ### 3.3 Capture Requirements
352
+
353
+ **ZERO OVERHEAD when opted out:**
354
+ Don't import `moe-team.src.training` modules if the user hasn't opted in.
355
+ Check a local flag file or config value first. Recommended pattern:
356
+
357
+ ```python
358
+ import os
359
+
360
+ def is_capture_enabled() -> bool:
361
+ """Check local flag before touching any training modules."""
362
+ user_id_path = os.path.expanduser("~/.groove/user_id")
363
+ if not os.path.exists(user_id_path):
364
+ return False
365
+ # Only import consent manager if user_id exists
366
+ from moe-team.src.training import ConsentManager
367
+ consent = ConsentManager()
368
+ with open(user_id_path) as f:
369
+ user_id = f.read().strip()
370
+ return consent.is_opted_in(user_id)
371
+ ```
372
+
373
+ **NON-BLOCKING:**
374
+ Capture calls must not slow down inference. The `intake.submit()` and
375
+ `CaptureSession.record_*()` calls do synchronous I/O (SQLite reads +
376
+ JSONL file writes). Do NOT call them in the inference hot path.
377
+
378
+ Recommended pattern — background queue:
379
+
380
+ ```python
381
+ import queue
382
+ import threading
383
+
384
+ capture_queue = queue.Queue(maxsize=10000)
385
+
386
+ def capture_worker(intake, user_id):
387
+ """Background thread that drains the capture queue."""
388
+ while True:
389
+ item = capture_queue.get()
390
+ if item is None:
391
+ break
392
+ try:
393
+ intake.submit(
394
+ user_id=user_id,
395
+ session_id=item["session_id"],
396
+ content=item["content"],
397
+ content_type=item["content_type"],
398
+ metadata=item.get("metadata"),
399
+ )
400
+ except Exception:
401
+ pass # fail silent — never crash over capture
402
+ capture_queue.task_done()
403
+
404
+ # Start worker thread at app startup (only if opted in)
405
+ worker = threading.Thread(target=capture_worker, args=(intake, user_id), daemon=True)
406
+ worker.start()
407
+
408
+ # In the inference path — non-blocking enqueue
409
+ def on_prompt(session_id, prompt_text, model_name):
410
+ try:
411
+ capture_queue.put_nowait({
412
+ "session_id": session_id,
413
+ "content": prompt_text,
414
+ "content_type": "prompt",
415
+ "metadata": {"model_name": model_name},
416
+ })
417
+ except queue.Full:
418
+ pass # drop if queue is full — never block inference
419
+ ```
420
+
421
+ If using asyncio:
422
+
423
+ ```python
424
+ import asyncio
425
+
426
+ async def capture_submit(intake, user_id, session_id, content, content_type, metadata):
427
+ loop = asyncio.get_event_loop()
428
+ await loop.run_in_executor(None, intake.submit, user_id, session_id, content, content_type, metadata)
429
+ ```
430
+
431
+ **FAIL SILENT:**
432
+ If capture fails (disk full, permission error, SQLite locked), log a
433
+ warning and continue. Never crash the app over training data capture.
434
+ Wrap every capture call in a try/except that swallows all exceptions.
435
+
436
+ ### 3.4 User Data Management UI
437
+
438
+ Provide a way for users to:
439
+
440
+ **View contribution stats:**
441
+ ```python
442
+ from moe-team.src.training import CorpusStats, TrainingCorpus, ConsentManager
443
+
444
+ corpus = TrainingCorpus()
445
+ consent = ConsentManager()
446
+ stats = CorpusStats(corpus, consent)
447
+
448
+ summary = stats.summary()
449
+ # {"total_records": 1423, "storage_size_mb": 2.34, "opted_in_users": 1, "domains": {...}}
450
+
451
+ growth = stats.daily_growth(days=7)
452
+ # [{"date": "2026-04-26", "records": 142}, {"date": "2026-04-27", "records": 203}, ...]
453
+ ```
454
+
455
+ **Download their data:**
456
+ ```python
457
+ corpus = TrainingCorpus()
458
+ count = corpus.export_jsonl(
459
+ output_path="/tmp/my_groove_data.jsonl",
460
+ domain=None, # all domains, or filter by "journalism", "code", etc.
461
+ after=None, # all time, or filter by timestamp
462
+ )
463
+ # count = number of records exported
464
+ ```
465
+
466
+ Note: `export_jsonl()` exports ALL users' data. To filter by user_id,
467
+ the daemon team should read the JSONL and filter client-side, or the
468
+ network team can add a user_id filter (file a request).
469
+
470
+ **Delete their data:**
471
+ ```python
472
+ intake.delete_user(user_id)
473
+ # Revokes consent AND deletes all JSONL records for the user
474
+ ```
475
+
476
+ This can be a settings panel, CLI command (`groove data --stats`,
477
+ `groove data --export`, `groove data --delete`), or both.
478
+
479
+ ---
480
+
481
+ ## 4. Data Format Reference
482
+
483
+ ### TrainingRecord Schema
484
+
485
+ Each record stored in the JSONL corpus has these fields:
486
+
487
+ | Field | Type | Description |
488
+ |-------------------|-------------------|----------------------------------------------------------|
489
+ | `record_id` | `str` | Hex UUID, unique per record |
490
+ | `user_id` | `str` | Anonymized install ID (random UUID from ~/.groove/user_id)|
491
+ | `session_id` | `str` | Groups records from the same inference session |
492
+ | `timestamp` | `float` | Unix timestamp (time.time()) |
493
+ | `domain` | `str` | Auto-tagged: journalism, code, research, planning, general|
494
+ | `content_type` | `str` | prompt, completion, conversation, workflow |
495
+ | `content` | `str` | PII-scrubbed text |
496
+ | `metadata` | `dict[str, Any]` | Model name, token count, latency, workflow_type, etc. |
497
+ | `consent_version` | `str` | Version of consent terms (currently "1.0") |
498
+
499
+ ### Storage Location
500
+
501
+ JSONL files at `~/.groove/training_data/training_YYYY-MM-DD.jsonl`
502
+
503
+ One JSON object per line, daily rotation. Standard format for LLM
504
+ training pipelines.
505
+
506
+ ### Example JSONL Line
507
+
508
+ ```json
509
+ {"record_id":"a1b2c3d4e5f6","user_id":"f8e7d6c5b4a3","session_id":"sess_001","timestamp":1745625600.0,"domain":"journalism","content_type":"prompt","content":"Write a lead paragraph about the city council vote on [EMAIL] proposed budget","metadata":{"model_name":"qwen3-moe-a3b","workflow_type":"journalism"},"consent_version":"1.0"}
510
+ ```
511
+
512
+ ---
513
+
514
+ ## 5. Content Types and What to Capture
515
+
516
+ | content_type | When to capture | What to include |
517
+ |----------------|----------------------------------------|----------------------------------------|
518
+ | `prompt` | User submits text for inference | The raw prompt text |
519
+ | `completion` | Model returns generated text | The generated output |
520
+ | `conversation` | Multi-turn session ends | Full message history as JSON |
521
+ | `workflow` | User completes a workflow/task | Summary of the workflow steps |
522
+
523
+ Notes:
524
+ - `prompt` and `completion` are captured in real-time during inference
525
+ - `conversation` is captured once at session end — the full message array
526
+ - `workflow` is optional — capture if the app has a concept of discrete
527
+ tasks or workflows (e.g., "user finished writing an article")
528
+
529
+ ---
530
+
531
+ ## 6. Integration Checklist
532
+
533
+ - [ ] Generate persistent `user_id` (random UUID) at `~/.groove/user_id`
534
+ - [ ] Add opt-in toggle to settings UI (default: OFF)
535
+ - [ ] Show data collection disclosure dialog on first opt-in
536
+ - [ ] Call `consent_manager.record_consent(user_id, opted_in=True, consent_version="1.0")` on opt-in
537
+ - [ ] Call `consent_manager.revoke_consent(user_id)` on opt-out
538
+ - [ ] Hook capture at prompt submission point
539
+ - [ ] Hook capture at completion receipt point
540
+ - [ ] Hook capture at session end (conversation type)
541
+ - [ ] Implement background capture queue (non-blocking)
542
+ - [ ] Guard all capture behind opted-in check (zero overhead when off)
543
+ - [ ] Add data management UI (view stats, download, delete)
544
+ - [ ] Test: opt-in -> capture -> opt-out -> capture stops
545
+ - [ ] Test: data deletion removes all user records
546
+ - [ ] Test: no PII leaks through (the intake API scrubs, but verify)
547
+
548
+ ---
549
+
550
+ ## 7. Quick Start Code Example
551
+
552
+ Minimal end-to-end integration using CaptureSession:
553
+
554
+ ```python
555
+ import os
556
+ import uuid
557
+
558
+ from moe-team.src.training import (
559
+ CaptureSession,
560
+ ConsentManager,
561
+ TrainingCorpus,
562
+ PIIScrubber,
563
+ DomainTagger,
564
+ )
565
+
566
+ # --- One-time setup (app startup) ---
567
+
568
+ USER_ID_PATH = os.path.expanduser("~/.groove/user_id")
569
+
570
+ def get_or_create_user_id() -> str:
571
+ if os.path.exists(USER_ID_PATH):
572
+ with open(USER_ID_PATH) as f:
573
+ return f.read().strip()
574
+ uid = uuid.uuid4().hex
575
+ os.makedirs(os.path.dirname(USER_ID_PATH), exist_ok=True)
576
+ with open(USER_ID_PATH, "w") as f:
577
+ f.write(uid)
578
+ return uid
579
+
580
+ user_id = get_or_create_user_id()
581
+ consent = ConsentManager() # db at ~/.groove/consent.db
582
+ corpus = TrainingCorpus() # JSONL at ~/.groove/training_data/
583
+ scrubber = PIIScrubber()
584
+ tagger = DomainTagger()
585
+
586
+ # --- User opts in (settings toggle) ---
587
+
588
+ consent.record_consent(user_id, opted_in=True, consent_version="1.0")
589
+
590
+ # --- Per-session capture ---
591
+
592
+ session_id = uuid.uuid4().hex
593
+ capture = CaptureSession(user_id, session_id, consent, corpus, scrubber, tagger)
594
+ capture.start()
595
+
596
+ # User sends a prompt
597
+ capture.record_prompt("Summarize the city council meeting notes", metadata={
598
+ "model_name": "qwen3-moe-a3b",
599
+ })
600
+
601
+ # Model returns a completion
602
+ capture.record_completion("The city council voted 7-2 to approve...", metadata={
603
+ "model_name": "qwen3-moe-a3b",
604
+ "token_count": 142,
605
+ "latency_ms": 820,
606
+ })
607
+
608
+ # Session ends — capture full conversation
609
+ capture.record_conversation([
610
+ {"role": "user", "content": "Summarize the city council meeting notes"},
611
+ {"role": "assistant", "content": "The city council voted 7-2 to approve..."},
612
+ ], metadata={"turn_count": 2, "total_tokens": 168})
613
+
614
+ summary = capture.end()
615
+ # {"session_id": "...", "records_captured": 3, "bytes_captured": 312, ...}
616
+
617
+ # --- User opts out and deletes data ---
618
+
619
+ from moe-team.src.training.intake import TrainingDataIntake
620
+
621
+ intake = TrainingDataIntake(consent, corpus, scrubber, tagger)
622
+ result = intake.delete_user(user_id)
623
+ # {"consent_revoked": True, "records_deleted": 3}
624
+ ```
625
+
626
+ Alternative using TrainingDataIntake directly (simpler, no session lifecycle):
627
+
628
+ ```python
629
+ from moe-team.src.training.intake import TrainingDataIntake
630
+
631
+ intake = TrainingDataIntake(consent, corpus, scrubber, tagger)
632
+
633
+ # Submit individual records
634
+ intake.submit(user_id, session_id, prompt_text, "prompt", {"model_name": "qwen3-moe-a3b"})
635
+ intake.submit(user_id, session_id, completion_text, "completion", {"token_count": 142})
636
+
637
+ # User opts out — revokes consent + deletes all data
638
+ intake.delete_user(user_id)
639
+ ```
640
+
641
+ ---
642
+
643
+ ## 8. PII Scrubbing — What the Daemon Team Does NOT Need to Do
644
+
645
+ The intake API and CaptureSession scrub ALL content before storage.
646
+ The daemon team does NOT need to scrub PII. Just pass raw text.
647
+
648
+ The scrubber (`PIIScrubber` in `moe-team/src/training/scrubber.py`)
649
+ handles these PII categories:
650
+
651
+ | PII Type | Replacement | Example |
652
+ |--------------------|------------------|--------------------------------------------|
653
+ | Email addresses | `[EMAIL]` | `user@example.com` -> `[EMAIL]` |
654
+ | Phone numbers | `[PHONE]` | `(555) 123-4567` -> `[PHONE]` |
655
+ | IPv4 addresses | `[IP]` | `192.168.1.1` -> `[IP]` |
656
+ | IPv6 addresses | `[IP]` | `2001:db8::1` -> `[IP]` |
657
+ | SSNs | `[SSN]` | `123-45-6789` -> `[SSN]` |
658
+ | Credit cards | `[CREDIT_CARD]` | `4111-1111-1111-1111` -> `[CREDIT_CARD]` |
659
+ | AWS access keys | `[AWS_KEY]` | `AKIAIOSFODNN7EXAMPLE` -> `[AWS_KEY]` |
660
+ | Private keys | `[PRIVATE_KEY]` | `-----BEGIN RSA PRIVATE KEY-----...` |
661
+ | Bearer tokens | `[API_KEY]` | `Bearer eyJhbGc...` -> `[API_KEY]` |
662
+ | sk_/pk_ API keys | `[API_KEY]` | `sk-abc123...` -> `[API_KEY]` |
663
+ | Long hex strings | `[API_KEY]` | 40+ char hex strings -> `[API_KEY]` |
664
+ | User file paths | `[FILE_PATH]` | `/Users/john/docs/...` -> `[FILE_PATH]` |
665
+ | URLs with tokens | `[REDACTED_URL]` | `https://...?token=abc` -> `[REDACTED_URL]`|
666
+
667
+ Credit card detection includes Luhn checksum validation to reduce false
668
+ positives.
669
+
670
+ If the daemon team discovers PII types the scrubber misses, report to
671
+ the network team to add patterns to `scrubber.py`.
672
+
673
+ ---
674
+
675
+ ## 9. Privacy Principles
676
+
677
+ 1. **Default opt-out:** No data collection without explicit user action.
678
+ The toggle defaults to OFF. `is_opted_in()` returns `False` when no
679
+ consent record exists.
680
+
681
+ 2. **Anonymous user_id:** The `user_id` is a random UUID generated once
682
+ per install. It is NOT derived from any personal information — no
683
+ hardware IDs, no email, no name, no IP address.
684
+
685
+ 3. **PII scrubbed before storage:** All content passes through the
686
+ PIIScrubber before hitting disk. The scrubber replaces 13 categories
687
+ of PII with placeholder tokens.
688
+
689
+ 4. **User can delete all data at any time:** `intake.delete_user(user_id)`
690
+ removes all JSONL records and revokes consent. The consent database
691
+ also supports `request_deletion()` for formal deletion requests.
692
+
693
+ 5. **Consent is versioned:** `is_opted_in()` checks that the user's
694
+ consent version matches `CURRENT_CONSENT_VERSION` (currently "1.0").
695
+ If terms change and the version is bumped, all users must re-consent.
696
+ Old consent is treated as not-opted-in until re-consented.
697
+
698
+ 6. **Training data stays local:** Data is stored on the machine running
699
+ the Groove service (`~/.groove/training_data/`). No data leaves the
700
+ machine without explicit export. Future network-level aggregation
701
+ will be a separate opt-in.
702
+
703
+ 7. **Mid-session revocation:** If a user revokes consent during an active
704
+ session, `CaptureSession.is_active()` detects this on the next call
705
+ and stops capture immediately. No buffered data is flushed.
706
+
707
+ ---
708
+
709
+ ## 10. File Reference
710
+
711
+ | File | Description |
712
+ |------|-------------|
713
+ | `moe-team/src/training/__init__.py` | Package exports: CaptureSession, ConsentManager, ConsentRecord, CorpusStats, DomainTagger, PIIScrubber, TrainingCorpus, TrainingRecord |
714
+ | `moe-team/src/training/intake.py` | TrainingDataIntake — simple submit/batch/delete API |
715
+ | `moe-team/src/training/consent.py` | ConsentManager — SQLite consent storage, versioned consent |
716
+ | `moe-team/src/training/scrubber.py` | PIIScrubber — 13 compiled regex patterns for PII removal |
717
+ | `moe-team/src/training/domain_tagger.py` | DomainTagger — keyword-based domain classification |
718
+ | `moe-team/src/training/corpus.py` | TrainingCorpus — JSONL storage with daily rotation |
719
+ | `moe-team/src/training/capture.py` | CaptureSession — session-oriented capture with lifecycle |
720
+ | `moe-team/src/training/stats.py` | CorpusStats — summary, daily growth, domain breakdown |