groove-dev 0.27.70 → 0.27.72
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +0 -7
- package/MOE_TRAINING_PIPELINE.md +720 -0
- package/node_modules/@groove-dev/cli/package.json +1 -1
- package/node_modules/@groove-dev/daemon/package.json +1 -1
- package/node_modules/@groove-dev/daemon/src/api.js +299 -21
- package/node_modules/@groove-dev/daemon/src/index.js +3 -0
- package/node_modules/@groove-dev/daemon/src/providers/base.js +8 -0
- package/node_modules/@groove-dev/daemon/src/providers/claude-code.js +54 -0
- package/node_modules/@groove-dev/daemon/src/providers/codex.js +16 -0
- package/node_modules/@groove-dev/daemon/src/providers/gemini.js +14 -0
- package/node_modules/@groove-dev/daemon/src/providers/index.js +36 -0
- package/node_modules/@groove-dev/gui/dist/assets/index-74E3YTkT.css +1 -0
- package/node_modules/@groove-dev/gui/dist/assets/{index-D5BpdcWS.js → index-CHSXqfwy.js} +1736 -1735
- package/node_modules/@groove-dev/gui/dist/index.html +2 -2
- package/node_modules/@groove-dev/gui/package.json +1 -1
- package/node_modules/@groove-dev/gui/src/components/editor/code-editor.jsx +5 -5
- package/node_modules/@groove-dev/gui/src/components/editor/editor-tabs.jsx +4 -4
- package/node_modules/@groove-dev/gui/src/components/editor/file-tree.jsx +11 -0
- package/node_modules/@groove-dev/gui/src/components/settings/ProviderSetupWizard.jsx +480 -0
- package/node_modules/@groove-dev/gui/src/stores/groove.js +112 -4
- package/node_modules/@groove-dev/gui/src/views/editor.jsx +10 -2
- package/node_modules/@groove-dev/gui/src/views/settings.jsx +258 -84
- package/package.json +1 -1
- package/packages/cli/package.json +1 -1
- package/packages/daemon/package.json +1 -1
- package/packages/daemon/src/api.js +299 -21
- package/packages/daemon/src/index.js +3 -0
- package/packages/daemon/src/providers/base.js +8 -0
- package/packages/daemon/src/providers/claude-code.js +54 -0
- package/packages/daemon/src/providers/codex.js +16 -0
- package/packages/daemon/src/providers/gemini.js +14 -0
- package/packages/daemon/src/providers/index.js +36 -0
- package/packages/gui/dist/assets/index-74E3YTkT.css +1 -0
- package/packages/gui/dist/assets/{index-D5BpdcWS.js → index-CHSXqfwy.js} +1736 -1735
- package/packages/gui/dist/index.html +2 -2
- package/packages/gui/package.json +1 -1
- package/packages/gui/src/components/editor/code-editor.jsx +5 -5
- package/packages/gui/src/components/editor/editor-tabs.jsx +4 -4
- package/packages/gui/src/components/editor/file-tree.jsx +11 -0
- package/packages/gui/src/components/settings/ProviderSetupWizard.jsx +480 -0
- package/packages/gui/src/stores/groove.js +112 -4
- package/packages/gui/src/views/editor.jsx +10 -2
- package/packages/gui/src/views/settings.jsx +258 -84
- package/node_modules/@groove-dev/gui/dist/assets/index-oQ0ejlfH.css +0 -1
- package/packages/gui/dist/assets/index-oQ0ejlfH.css +0 -1
|
@@ -0,0 +1,720 @@
|
|
|
1
|
+
# MoE Training Data Pipeline — Daemon Team Integration Guide
|
|
2
|
+
|
|
3
|
+
Handoff spec from the Network Team to the Daemon Team.
|
|
4
|
+
Last updated: 2026-04-23 | Target: HN launch April 26, 2026
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## 1. Overview
|
|
9
|
+
|
|
10
|
+
The MoE training pipeline captures opted-in user workflow data (prompts,
|
|
11
|
+
completions, conversations) and stores it as PII-scrubbed JSONL for future
|
|
12
|
+
MoE expert training. The pipeline has two halves:
|
|
13
|
+
|
|
14
|
+
**Network team built (DONE):**
|
|
15
|
+
- Consent management (SQLite-backed, GDPR-style)
|
|
16
|
+
- PII scrubber (regex-based, 13 pattern categories)
|
|
17
|
+
- Domain tagger (journalism, code, research, planning, general)
|
|
18
|
+
- Training corpus storage (daily-rotated JSONL)
|
|
19
|
+
- Intake API and CaptureSession — the entry points the daemon calls
|
|
20
|
+
- Corpus statistics and reporting
|
|
21
|
+
|
|
22
|
+
**Daemon team builds (THIS DOC):**
|
|
23
|
+
- User-facing opt-in toggle in the Groove app
|
|
24
|
+
- Data capture hooks at prompt/completion/session boundaries
|
|
25
|
+
- Background capture queue (non-blocking)
|
|
26
|
+
- Data management UI (view stats, download, delete)
|
|
27
|
+
|
|
28
|
+
The daemon team does NOT need to implement PII scrubbing, consent
|
|
29
|
+
checking, domain classification, or storage. The network team's API
|
|
30
|
+
handles all of that internally. The daemon team's job is to wire up
|
|
31
|
+
the capture points and build the user-facing opt-in experience.
|
|
32
|
+
|
|
33
|
+
**Timeline:** Must be ready for HN launch on April 26, 2026. New users
|
|
34
|
+
who opt in should immediately start contributing training data.
|
|
35
|
+
|
|
36
|
+
---
|
|
37
|
+
|
|
38
|
+
## 2. What the Network Team Built (Backend API)
|
|
39
|
+
|
|
40
|
+
All source code lives in `moe-team/src/training/`. The daemon team
|
|
41
|
+
interacts with two main classes: `TrainingDataIntake` (simple per-call
|
|
42
|
+
API) and `CaptureSession` (session-oriented wrapper with lifecycle).
|
|
43
|
+
|
|
44
|
+
### 2.1 TrainingDataIntake — Single Entry Point
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
from moe-team.src.training.intake import TrainingDataIntake
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
Constructor:
|
|
51
|
+
|
|
52
|
+
```python
|
|
53
|
+
class TrainingDataIntake:
|
|
54
|
+
def __init__(
|
|
55
|
+
self,
|
|
56
|
+
consent_manager: ConsentManager,
|
|
57
|
+
corpus: TrainingCorpus,
|
|
58
|
+
scrubber: PIIScrubber,
|
|
59
|
+
tagger: DomainTagger,
|
|
60
|
+
) -> None
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
#### submit()
|
|
64
|
+
|
|
65
|
+
```python
|
|
66
|
+
def submit(
|
|
67
|
+
self,
|
|
68
|
+
user_id: str,
|
|
69
|
+
session_id: str,
|
|
70
|
+
content: str,
|
|
71
|
+
content_type: str,
|
|
72
|
+
metadata: dict[str, Any] | None = None,
|
|
73
|
+
) -> str | None
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
Returns `record_id` (hex UUID) on success, `None` if user is not opted in.
|
|
77
|
+
|
|
78
|
+
What happens inside submit():
|
|
79
|
+
1. Checks `consent_manager.is_opted_in(user_id)` — returns None if not
|
|
80
|
+
2. Runs `scrubber.scrub(content)` — strips all PII
|
|
81
|
+
3. Runs `tagger.tag(scrubbed_content)` — classifies domain
|
|
82
|
+
4. Looks up current `consent_version` from consent history
|
|
83
|
+
5. Creates a `TrainingRecord` and writes it to JSONL via `corpus.add_record()`
|
|
84
|
+
|
|
85
|
+
The daemon team does NOT need to scrub PII or check consent — submit()
|
|
86
|
+
handles both internally.
|
|
87
|
+
|
|
88
|
+
#### submit_batch()
|
|
89
|
+
|
|
90
|
+
```python
|
|
91
|
+
def submit_batch(
|
|
92
|
+
self,
|
|
93
|
+
records: list[dict[str, Any]],
|
|
94
|
+
) -> dict[str, Any]
|
|
95
|
+
# Returns: {"accepted": int, "rejected": int, "record_ids": list[str]}
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
Each record in the list must have keys: `user_id`, `session_id`,
|
|
99
|
+
`content`, `content_type`, and optionally `metadata`.
|
|
100
|
+
|
|
101
|
+
#### delete_user()
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
def delete_user(self, user_id: str) -> dict[str, Any]
|
|
105
|
+
# Returns: {"consent_revoked": bool, "records_deleted": int}
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Revokes consent AND deletes all stored training data for the user.
|
|
109
|
+
Call this when a user opts out and chooses to delete their data.
|
|
110
|
+
|
|
111
|
+
### 2.2 CaptureSession — Session-Oriented Wrapper
|
|
112
|
+
|
|
113
|
+
```python
|
|
114
|
+
from moe-team.src.training import CaptureSession
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Designed for per-session capture with start/end lifecycle and built-in
|
|
118
|
+
stats tracking.
|
|
119
|
+
|
|
120
|
+
```python
|
|
121
|
+
class CaptureSession:
|
|
122
|
+
def __init__(
|
|
123
|
+
self,
|
|
124
|
+
user_id: str,
|
|
125
|
+
session_id: str,
|
|
126
|
+
consent_manager: ConsentManager,
|
|
127
|
+
corpus: TrainingCorpus,
|
|
128
|
+
scrubber: PIIScrubber,
|
|
129
|
+
tagger: DomainTagger,
|
|
130
|
+
) -> None
|
|
131
|
+
|
|
132
|
+
def start(self) -> None
|
|
133
|
+
# Raises PermissionError if user not opted in
|
|
134
|
+
|
|
135
|
+
def record_prompt(self, text: str, metadata: dict[str, Any] | None = None) -> None
|
|
136
|
+
def record_completion(self, text: str, metadata: dict[str, Any] | None = None) -> None
|
|
137
|
+
def record_conversation(self, messages: list[dict[str, Any]], metadata: dict[str, Any] | None = None) -> None
|
|
138
|
+
|
|
139
|
+
def end(self) -> dict[str, Any]
|
|
140
|
+
# Returns: {"session_id", "records_captured", "bytes_captured", "domains", "duration_seconds"}
|
|
141
|
+
|
|
142
|
+
def is_active(self) -> bool
|
|
143
|
+
# Re-checks consent on every call — stops if user revokes mid-session
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
CaptureSession is the recommended API for the daemon team. It handles
|
|
147
|
+
consent re-checking on every capture call, so if a user revokes consent
|
|
148
|
+
mid-session, capture stops immediately.
|
|
149
|
+
|
|
150
|
+
### 2.3 ConsentManager — Consent Operations
|
|
151
|
+
|
|
152
|
+
```python
|
|
153
|
+
from moe-team.src.training import ConsentManager
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
```python
|
|
157
|
+
class ConsentManager:
|
|
158
|
+
def __init__(self, db_path: str = "~/.groove/consent.db") -> None
|
|
159
|
+
|
|
160
|
+
def record_consent(
|
|
161
|
+
self,
|
|
162
|
+
user_id: str,
|
|
163
|
+
opted_in: bool,
|
|
164
|
+
consent_version: str,
|
|
165
|
+
metadata: dict[str, Any] | None = None,
|
|
166
|
+
) -> None
|
|
167
|
+
|
|
168
|
+
def is_opted_in(self, user_id: str) -> bool
|
|
169
|
+
|
|
170
|
+
def revoke_consent(self, user_id: str) -> bool
|
|
171
|
+
# Returns False if user had no consent records
|
|
172
|
+
|
|
173
|
+
def request_deletion(self, user_id: str) -> dict[str, Any]
|
|
174
|
+
# Returns: {"user_id", "consent_records", "status": "pending"}
|
|
175
|
+
|
|
176
|
+
def get_opted_in_count(self) -> int
|
|
177
|
+
|
|
178
|
+
def get_consent_history(self, user_id: str) -> list[ConsentRecord]
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
Key behavior of `is_opted_in()`:
|
|
182
|
+
- Returns `False` if no consent record exists (default opt-out)
|
|
183
|
+
- Returns `False` if the latest consent version doesn't match
|
|
184
|
+
`CURRENT_CONSENT_VERSION` (currently "1.0") — this forces re-consent
|
|
185
|
+
when terms change
|
|
186
|
+
|
|
187
|
+
### 2.4 CorpusStats — Monitoring
|
|
188
|
+
|
|
189
|
+
```python
|
|
190
|
+
from moe-team.src.training import CorpusStats
|
|
191
|
+
|
|
192
|
+
class CorpusStats:
|
|
193
|
+
def __init__(self, corpus: TrainingCorpus, consent_manager: ConsentManager) -> None
|
|
194
|
+
|
|
195
|
+
def summary(self) -> dict[str, Any]
|
|
196
|
+
# Returns: {"total_records", "storage_size_mb", "opted_in_users", "domains"}
|
|
197
|
+
|
|
198
|
+
def daily_growth(self, days: int = 7) -> list[dict[str, Any]]
|
|
199
|
+
# Returns: [{"date": "2026-04-26", "records": 142}, ...]
|
|
200
|
+
|
|
201
|
+
def domain_breakdown(self) -> dict[str, Any]
|
|
202
|
+
# Returns: {"journalism": {"count": 50, "percentage": 33.3}, ...}
|
|
203
|
+
|
|
204
|
+
def print_report(self) -> None
|
|
205
|
+
# Prints formatted report to stdout
|
|
206
|
+
```
|
|
207
|
+
|
|
208
|
+
---
|
|
209
|
+
|
|
210
|
+
## 3. What the Daemon Team Needs to Build
|
|
211
|
+
|
|
212
|
+
### 3.1 User-Facing Opt-In
|
|
213
|
+
|
|
214
|
+
Add a settings toggle in the Groove app/IDE:
|
|
215
|
+
|
|
216
|
+
**Label:** "Share usage data to improve Groove"
|
|
217
|
+
**Default state:** OFF. No data collected without explicit user action.
|
|
218
|
+
|
|
219
|
+
**When toggled ON:**
|
|
220
|
+
|
|
221
|
+
1. Show a clear disclosure dialog before enabling. The disclosure must state:
|
|
222
|
+
- What data is collected: prompts, completions, workflow metadata
|
|
223
|
+
- How it's used: training MoE expert models to improve Groove
|
|
224
|
+
- PII is automatically scrubbed before storage (emails, phone numbers,
|
|
225
|
+
API keys, file paths, etc. are all replaced with placeholders)
|
|
226
|
+
- How to opt out: toggle the setting OFF at any time
|
|
227
|
+
- How to delete data: option available in settings to delete all
|
|
228
|
+
previously collected data
|
|
229
|
+
|
|
230
|
+
2. Generate or load a persistent `user_id`:
|
|
231
|
+
- Store at `~/.groove/user_id`
|
|
232
|
+
- Generate as a random UUID (e.g., `uuid.uuid4().hex`)
|
|
233
|
+
- Generated once per install, reused across sessions
|
|
234
|
+
- NOT tied to identity: no hardware IDs, no email, no name, no IP
|
|
235
|
+
|
|
236
|
+
3. Record consent:
|
|
237
|
+
```python
|
|
238
|
+
consent_manager.record_consent(
|
|
239
|
+
user_id=user_id,
|
|
240
|
+
opted_in=True,
|
|
241
|
+
consent_version="1.0",
|
|
242
|
+
)
|
|
243
|
+
```
|
|
244
|
+
|
|
245
|
+
**When toggled OFF:**
|
|
246
|
+
|
|
247
|
+
1. Revoke consent:
|
|
248
|
+
```python
|
|
249
|
+
consent_manager.revoke_consent(user_id)
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
2. Stop all capture immediately. Any active CaptureSession will
|
|
253
|
+
self-deactivate on the next `is_active()` check.
|
|
254
|
+
|
|
255
|
+
3. Offer the option to delete previously collected data:
|
|
256
|
+
```python
|
|
257
|
+
intake.delete_user(user_id)
|
|
258
|
+
# This revokes consent AND deletes all JSONL records for the user
|
|
259
|
+
```
|
|
260
|
+
|
|
261
|
+
### 3.2 Data Capture Points
|
|
262
|
+
|
|
263
|
+
Hook capture at these points in the daemon/app:
|
|
264
|
+
|
|
265
|
+
#### PROMPTS
|
|
266
|
+
|
|
267
|
+
When user submits a prompt for inference, capture the prompt text.
|
|
268
|
+
|
|
269
|
+
```python
|
|
270
|
+
session.record_prompt(prompt_text, metadata={
|
|
271
|
+
"model_name": model_name, # e.g., "qwen3-moe-a3b"
|
|
272
|
+
"timestamp": time.time(),
|
|
273
|
+
"workflow_type": workflow_type, # e.g., "journalism", "code_editor"
|
|
274
|
+
})
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
Or with TrainingDataIntake:
|
|
278
|
+
```python
|
|
279
|
+
intake.submit(user_id, session_id, prompt_text, "prompt", metadata={
|
|
280
|
+
"model_name": model_name,
|
|
281
|
+
"timestamp": time.time(),
|
|
282
|
+
"workflow_type": workflow_type,
|
|
283
|
+
})
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
#### COMPLETIONS
|
|
287
|
+
|
|
288
|
+
When the model returns a completion, capture the output text.
|
|
289
|
+
|
|
290
|
+
```python
|
|
291
|
+
session.record_completion(completion_text, metadata={
|
|
292
|
+
"model_name": model_name,
|
|
293
|
+
"token_count": token_count,
|
|
294
|
+
"latency_ms": latency_ms,
|
|
295
|
+
"finish_reason": finish_reason, # e.g., "stop", "length", "eos"
|
|
296
|
+
})
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
Or with TrainingDataIntake:
|
|
300
|
+
```python
|
|
301
|
+
intake.submit(user_id, session_id, completion_text, "completion", metadata={
|
|
302
|
+
"model_name": model_name,
|
|
303
|
+
"token_count": token_count,
|
|
304
|
+
"latency_ms": latency_ms,
|
|
305
|
+
"finish_reason": finish_reason,
|
|
306
|
+
})
|
|
307
|
+
```
|
|
308
|
+
|
|
309
|
+
#### CONVERSATIONS
|
|
310
|
+
|
|
311
|
+
For multi-turn sessions, capture the full conversation at session end.
|
|
312
|
+
|
|
313
|
+
```python
|
|
314
|
+
messages = [
|
|
315
|
+
{"role": "user", "content": "..."},
|
|
316
|
+
{"role": "assistant", "content": "..."},
|
|
317
|
+
{"role": "user", "content": "..."},
|
|
318
|
+
{"role": "assistant", "content": "..."},
|
|
319
|
+
]
|
|
320
|
+
session.record_conversation(messages, metadata={
|
|
321
|
+
"turn_count": len(messages),
|
|
322
|
+
"total_tokens": total_tokens,
|
|
323
|
+
"session_duration_s": duration_seconds,
|
|
324
|
+
})
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
Or with TrainingDataIntake:
|
|
328
|
+
```python
|
|
329
|
+
import json
|
|
330
|
+
intake.submit(user_id, session_id, json.dumps(messages), "conversation", metadata={
|
|
331
|
+
"turn_count": len(messages),
|
|
332
|
+
"total_tokens": total_tokens,
|
|
333
|
+
"session_duration_s": duration_seconds,
|
|
334
|
+
})
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
#### WORKFLOW METADATA
|
|
338
|
+
|
|
339
|
+
Capture what type of work the user is doing (journalism, coding,
|
|
340
|
+
research, planning). The domain tagger classifies content automatically
|
|
341
|
+
based on keywords, but explicit `workflow_type` in metadata helps
|
|
342
|
+
validate and improve the tagger.
|
|
343
|
+
|
|
344
|
+
If the app knows the user is in "journalist mode" or "code editor",
|
|
345
|
+
pass that as metadata on every submit call:
|
|
346
|
+
|
|
347
|
+
```python
|
|
348
|
+
metadata={"workflow_type": "journalism"}
|
|
349
|
+
```
|
|
350
|
+
|
|
351
|
+
### 3.3 Capture Requirements
|
|
352
|
+
|
|
353
|
+
**ZERO OVERHEAD when opted out:**
|
|
354
|
+
Don't import `moe-team.src.training` modules if the user hasn't opted in.
|
|
355
|
+
Check a local flag file or config value first. Recommended pattern:
|
|
356
|
+
|
|
357
|
+
```python
|
|
358
|
+
import os
|
|
359
|
+
|
|
360
|
+
def is_capture_enabled() -> bool:
|
|
361
|
+
"""Check local flag before touching any training modules."""
|
|
362
|
+
user_id_path = os.path.expanduser("~/.groove/user_id")
|
|
363
|
+
if not os.path.exists(user_id_path):
|
|
364
|
+
return False
|
|
365
|
+
# Only import consent manager if user_id exists
|
|
366
|
+
from moe-team.src.training import ConsentManager
|
|
367
|
+
consent = ConsentManager()
|
|
368
|
+
with open(user_id_path) as f:
|
|
369
|
+
user_id = f.read().strip()
|
|
370
|
+
return consent.is_opted_in(user_id)
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
**NON-BLOCKING:**
|
|
374
|
+
Capture calls must not slow down inference. The `intake.submit()` and
|
|
375
|
+
`CaptureSession.record_*()` calls do synchronous I/O (SQLite reads +
|
|
376
|
+
JSONL file writes). Do NOT call them in the inference hot path.
|
|
377
|
+
|
|
378
|
+
Recommended pattern — background queue:
|
|
379
|
+
|
|
380
|
+
```python
|
|
381
|
+
import queue
|
|
382
|
+
import threading
|
|
383
|
+
|
|
384
|
+
capture_queue = queue.Queue(maxsize=10000)
|
|
385
|
+
|
|
386
|
+
def capture_worker(intake, user_id):
|
|
387
|
+
"""Background thread that drains the capture queue."""
|
|
388
|
+
while True:
|
|
389
|
+
item = capture_queue.get()
|
|
390
|
+
if item is None:
|
|
391
|
+
break
|
|
392
|
+
try:
|
|
393
|
+
intake.submit(
|
|
394
|
+
user_id=user_id,
|
|
395
|
+
session_id=item["session_id"],
|
|
396
|
+
content=item["content"],
|
|
397
|
+
content_type=item["content_type"],
|
|
398
|
+
metadata=item.get("metadata"),
|
|
399
|
+
)
|
|
400
|
+
except Exception:
|
|
401
|
+
pass # fail silent — never crash over capture
|
|
402
|
+
capture_queue.task_done()
|
|
403
|
+
|
|
404
|
+
# Start worker thread at app startup (only if opted in)
|
|
405
|
+
worker = threading.Thread(target=capture_worker, args=(intake, user_id), daemon=True)
|
|
406
|
+
worker.start()
|
|
407
|
+
|
|
408
|
+
# In the inference path — non-blocking enqueue
|
|
409
|
+
def on_prompt(session_id, prompt_text, model_name):
|
|
410
|
+
try:
|
|
411
|
+
capture_queue.put_nowait({
|
|
412
|
+
"session_id": session_id,
|
|
413
|
+
"content": prompt_text,
|
|
414
|
+
"content_type": "prompt",
|
|
415
|
+
"metadata": {"model_name": model_name},
|
|
416
|
+
})
|
|
417
|
+
except queue.Full:
|
|
418
|
+
pass # drop if queue is full — never block inference
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
If using asyncio:
|
|
422
|
+
|
|
423
|
+
```python
|
|
424
|
+
import asyncio
|
|
425
|
+
|
|
426
|
+
async def capture_submit(intake, user_id, session_id, content, content_type, metadata):
|
|
427
|
+
loop = asyncio.get_event_loop()
|
|
428
|
+
await loop.run_in_executor(None, intake.submit, user_id, session_id, content, content_type, metadata)
|
|
429
|
+
```
|
|
430
|
+
|
|
431
|
+
**FAIL SILENT:**
|
|
432
|
+
If capture fails (disk full, permission error, SQLite locked), log a
|
|
433
|
+
warning and continue. Never crash the app over training data capture.
|
|
434
|
+
Wrap every capture call in a try/except that swallows all exceptions.
|
|
435
|
+
|
|
436
|
+
### 3.4 User Data Management UI
|
|
437
|
+
|
|
438
|
+
Provide a way for users to:
|
|
439
|
+
|
|
440
|
+
**View contribution stats:**
|
|
441
|
+
```python
|
|
442
|
+
from moe-team.src.training import CorpusStats, TrainingCorpus, ConsentManager
|
|
443
|
+
|
|
444
|
+
corpus = TrainingCorpus()
|
|
445
|
+
consent = ConsentManager()
|
|
446
|
+
stats = CorpusStats(corpus, consent)
|
|
447
|
+
|
|
448
|
+
summary = stats.summary()
|
|
449
|
+
# {"total_records": 1423, "storage_size_mb": 2.34, "opted_in_users": 1, "domains": {...}}
|
|
450
|
+
|
|
451
|
+
growth = stats.daily_growth(days=7)
|
|
452
|
+
# [{"date": "2026-04-26", "records": 142}, {"date": "2026-04-27", "records": 203}, ...]
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
**Download their data:**
|
|
456
|
+
```python
|
|
457
|
+
corpus = TrainingCorpus()
|
|
458
|
+
count = corpus.export_jsonl(
|
|
459
|
+
output_path="/tmp/my_groove_data.jsonl",
|
|
460
|
+
domain=None, # all domains, or filter by "journalism", "code", etc.
|
|
461
|
+
after=None, # all time, or filter by timestamp
|
|
462
|
+
)
|
|
463
|
+
# count = number of records exported
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
Note: `export_jsonl()` exports ALL users' data. To filter by user_id,
|
|
467
|
+
the daemon team should read the JSONL and filter client-side, or the
|
|
468
|
+
network team can add a user_id filter (file a request).
|
|
469
|
+
|
|
470
|
+
**Delete their data:**
|
|
471
|
+
```python
|
|
472
|
+
intake.delete_user(user_id)
|
|
473
|
+
# Revokes consent AND deletes all JSONL records for the user
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
This can be a settings panel, CLI command (`groove data --stats`,
|
|
477
|
+
`groove data --export`, `groove data --delete`), or both.
|
|
478
|
+
|
|
479
|
+
---
|
|
480
|
+
|
|
481
|
+
## 4. Data Format Reference
|
|
482
|
+
|
|
483
|
+
### TrainingRecord Schema
|
|
484
|
+
|
|
485
|
+
Each record stored in the JSONL corpus has these fields:
|
|
486
|
+
|
|
487
|
+
| Field | Type | Description |
|
|
488
|
+
|-------------------|-------------------|----------------------------------------------------------|
|
|
489
|
+
| `record_id` | `str` | Hex UUID, unique per record |
|
|
490
|
+
| `user_id` | `str` | Anonymized install ID (random UUID from ~/.groove/user_id)|
|
|
491
|
+
| `session_id` | `str` | Groups records from the same inference session |
|
|
492
|
+
| `timestamp` | `float` | Unix timestamp (time.time()) |
|
|
493
|
+
| `domain` | `str` | Auto-tagged: journalism, code, research, planning, general|
|
|
494
|
+
| `content_type` | `str` | prompt, completion, conversation, workflow |
|
|
495
|
+
| `content` | `str` | PII-scrubbed text |
|
|
496
|
+
| `metadata` | `dict[str, Any]` | Model name, token count, latency, workflow_type, etc. |
|
|
497
|
+
| `consent_version` | `str` | Version of consent terms (currently "1.0") |
|
|
498
|
+
|
|
499
|
+
### Storage Location
|
|
500
|
+
|
|
501
|
+
JSONL files at `~/.groove/training_data/training_YYYY-MM-DD.jsonl`
|
|
502
|
+
|
|
503
|
+
One JSON object per line, daily rotation. Standard format for LLM
|
|
504
|
+
training pipelines.
|
|
505
|
+
|
|
506
|
+
### Example JSONL Line
|
|
507
|
+
|
|
508
|
+
```json
|
|
509
|
+
{"record_id":"a1b2c3d4e5f6","user_id":"f8e7d6c5b4a3","session_id":"sess_001","timestamp":1745625600.0,"domain":"journalism","content_type":"prompt","content":"Write a lead paragraph about the city council vote on [EMAIL] proposed budget","metadata":{"model_name":"qwen3-moe-a3b","workflow_type":"journalism"},"consent_version":"1.0"}
|
|
510
|
+
```
|
|
511
|
+
|
|
512
|
+
---
|
|
513
|
+
|
|
514
|
+
## 5. Content Types and What to Capture
|
|
515
|
+
|
|
516
|
+
| content_type | When to capture | What to include |
|
|
517
|
+
|----------------|----------------------------------------|----------------------------------------|
|
|
518
|
+
| `prompt` | User submits text for inference | The raw prompt text |
|
|
519
|
+
| `completion` | Model returns generated text | The generated output |
|
|
520
|
+
| `conversation` | Multi-turn session ends | Full message history as JSON |
|
|
521
|
+
| `workflow` | User completes a workflow/task | Summary of the workflow steps |
|
|
522
|
+
|
|
523
|
+
Notes:
|
|
524
|
+
- `prompt` and `completion` are captured in real-time during inference
|
|
525
|
+
- `conversation` is captured once at session end — the full message array
|
|
526
|
+
- `workflow` is optional — capture if the app has a concept of discrete
|
|
527
|
+
tasks or workflows (e.g., "user finished writing an article")
|
|
528
|
+
|
|
529
|
+
---
|
|
530
|
+
|
|
531
|
+
## 6. Integration Checklist
|
|
532
|
+
|
|
533
|
+
- [ ] Generate persistent `user_id` (random UUID) at `~/.groove/user_id`
|
|
534
|
+
- [ ] Add opt-in toggle to settings UI (default: OFF)
|
|
535
|
+
- [ ] Show data collection disclosure dialog on first opt-in
|
|
536
|
+
- [ ] Call `consent_manager.record_consent(user_id, opted_in=True, consent_version="1.0")` on opt-in
|
|
537
|
+
- [ ] Call `consent_manager.revoke_consent(user_id)` on opt-out
|
|
538
|
+
- [ ] Hook capture at prompt submission point
|
|
539
|
+
- [ ] Hook capture at completion receipt point
|
|
540
|
+
- [ ] Hook capture at session end (conversation type)
|
|
541
|
+
- [ ] Implement background capture queue (non-blocking)
|
|
542
|
+
- [ ] Guard all capture behind opted-in check (zero overhead when off)
|
|
543
|
+
- [ ] Add data management UI (view stats, download, delete)
|
|
544
|
+
- [ ] Test: opt-in -> capture -> opt-out -> capture stops
|
|
545
|
+
- [ ] Test: data deletion removes all user records
|
|
546
|
+
- [ ] Test: no PII leaks through (the intake API scrubs, but verify)
|
|
547
|
+
|
|
548
|
+
---
|
|
549
|
+
|
|
550
|
+
## 7. Quick Start Code Example
|
|
551
|
+
|
|
552
|
+
Minimal end-to-end integration using CaptureSession:
|
|
553
|
+
|
|
554
|
+
```python
|
|
555
|
+
import os
|
|
556
|
+
import uuid
|
|
557
|
+
|
|
558
|
+
from moe-team.src.training import (
|
|
559
|
+
CaptureSession,
|
|
560
|
+
ConsentManager,
|
|
561
|
+
TrainingCorpus,
|
|
562
|
+
PIIScrubber,
|
|
563
|
+
DomainTagger,
|
|
564
|
+
)
|
|
565
|
+
|
|
566
|
+
# --- One-time setup (app startup) ---
|
|
567
|
+
|
|
568
|
+
USER_ID_PATH = os.path.expanduser("~/.groove/user_id")
|
|
569
|
+
|
|
570
|
+
def get_or_create_user_id() -> str:
|
|
571
|
+
if os.path.exists(USER_ID_PATH):
|
|
572
|
+
with open(USER_ID_PATH) as f:
|
|
573
|
+
return f.read().strip()
|
|
574
|
+
uid = uuid.uuid4().hex
|
|
575
|
+
os.makedirs(os.path.dirname(USER_ID_PATH), exist_ok=True)
|
|
576
|
+
with open(USER_ID_PATH, "w") as f:
|
|
577
|
+
f.write(uid)
|
|
578
|
+
return uid
|
|
579
|
+
|
|
580
|
+
user_id = get_or_create_user_id()
|
|
581
|
+
consent = ConsentManager() # db at ~/.groove/consent.db
|
|
582
|
+
corpus = TrainingCorpus() # JSONL at ~/.groove/training_data/
|
|
583
|
+
scrubber = PIIScrubber()
|
|
584
|
+
tagger = DomainTagger()
|
|
585
|
+
|
|
586
|
+
# --- User opts in (settings toggle) ---
|
|
587
|
+
|
|
588
|
+
consent.record_consent(user_id, opted_in=True, consent_version="1.0")
|
|
589
|
+
|
|
590
|
+
# --- Per-session capture ---
|
|
591
|
+
|
|
592
|
+
session_id = uuid.uuid4().hex
|
|
593
|
+
capture = CaptureSession(user_id, session_id, consent, corpus, scrubber, tagger)
|
|
594
|
+
capture.start()
|
|
595
|
+
|
|
596
|
+
# User sends a prompt
|
|
597
|
+
capture.record_prompt("Summarize the city council meeting notes", metadata={
|
|
598
|
+
"model_name": "qwen3-moe-a3b",
|
|
599
|
+
})
|
|
600
|
+
|
|
601
|
+
# Model returns a completion
|
|
602
|
+
capture.record_completion("The city council voted 7-2 to approve...", metadata={
|
|
603
|
+
"model_name": "qwen3-moe-a3b",
|
|
604
|
+
"token_count": 142,
|
|
605
|
+
"latency_ms": 820,
|
|
606
|
+
})
|
|
607
|
+
|
|
608
|
+
# Session ends — capture full conversation
|
|
609
|
+
capture.record_conversation([
|
|
610
|
+
{"role": "user", "content": "Summarize the city council meeting notes"},
|
|
611
|
+
{"role": "assistant", "content": "The city council voted 7-2 to approve..."},
|
|
612
|
+
], metadata={"turn_count": 2, "total_tokens": 168})
|
|
613
|
+
|
|
614
|
+
summary = capture.end()
|
|
615
|
+
# {"session_id": "...", "records_captured": 3, "bytes_captured": 312, ...}
|
|
616
|
+
|
|
617
|
+
# --- User opts out and deletes data ---
|
|
618
|
+
|
|
619
|
+
from moe-team.src.training.intake import TrainingDataIntake
|
|
620
|
+
|
|
621
|
+
intake = TrainingDataIntake(consent, corpus, scrubber, tagger)
|
|
622
|
+
result = intake.delete_user(user_id)
|
|
623
|
+
# {"consent_revoked": True, "records_deleted": 3}
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
Alternative using TrainingDataIntake directly (simpler, no session lifecycle):
|
|
627
|
+
|
|
628
|
+
```python
|
|
629
|
+
from moe-team.src.training.intake import TrainingDataIntake
|
|
630
|
+
|
|
631
|
+
intake = TrainingDataIntake(consent, corpus, scrubber, tagger)
|
|
632
|
+
|
|
633
|
+
# Submit individual records
|
|
634
|
+
intake.submit(user_id, session_id, prompt_text, "prompt", {"model_name": "qwen3-moe-a3b"})
|
|
635
|
+
intake.submit(user_id, session_id, completion_text, "completion", {"token_count": 142})
|
|
636
|
+
|
|
637
|
+
# User opts out — revokes consent + deletes all data
|
|
638
|
+
intake.delete_user(user_id)
|
|
639
|
+
```
|
|
640
|
+
|
|
641
|
+
---
|
|
642
|
+
|
|
643
|
+
## 8. PII Scrubbing — What the Daemon Team Does NOT Need to Do
|
|
644
|
+
|
|
645
|
+
The intake API and CaptureSession scrub ALL content before storage.
|
|
646
|
+
The daemon team does NOT need to scrub PII. Just pass raw text.
|
|
647
|
+
|
|
648
|
+
The scrubber (`PIIScrubber` in `moe-team/src/training/scrubber.py`)
|
|
649
|
+
handles these PII categories:
|
|
650
|
+
|
|
651
|
+
| PII Type | Replacement | Example |
|
|
652
|
+
|--------------------|------------------|--------------------------------------------|
|
|
653
|
+
| Email addresses | `[EMAIL]` | `user@example.com` -> `[EMAIL]` |
|
|
654
|
+
| Phone numbers | `[PHONE]` | `(555) 123-4567` -> `[PHONE]` |
|
|
655
|
+
| IPv4 addresses | `[IP]` | `192.168.1.1` -> `[IP]` |
|
|
656
|
+
| IPv6 addresses | `[IP]` | `2001:db8::1` -> `[IP]` |
|
|
657
|
+
| SSNs | `[SSN]` | `123-45-6789` -> `[SSN]` |
|
|
658
|
+
| Credit cards | `[CREDIT_CARD]` | `4111-1111-1111-1111` -> `[CREDIT_CARD]` |
|
|
659
|
+
| AWS access keys | `[AWS_KEY]` | `AKIAIOSFODNN7EXAMPLE` -> `[AWS_KEY]` |
|
|
660
|
+
| Private keys | `[PRIVATE_KEY]` | `-----BEGIN RSA PRIVATE KEY-----...` |
|
|
661
|
+
| Bearer tokens | `[API_KEY]` | `Bearer eyJhbGc...` -> `[API_KEY]` |
|
|
662
|
+
| sk_/pk_ API keys | `[API_KEY]` | `sk-abc123...` -> `[API_KEY]` |
|
|
663
|
+
| Long hex strings | `[API_KEY]` | 40+ char hex strings -> `[API_KEY]` |
|
|
664
|
+
| User file paths | `[FILE_PATH]` | `/Users/john/docs/...` -> `[FILE_PATH]` |
|
|
665
|
+
| URLs with tokens | `[REDACTED_URL]` | `https://...?token=abc` -> `[REDACTED_URL]`|
|
|
666
|
+
|
|
667
|
+
Credit card detection includes Luhn checksum validation to reduce false
|
|
668
|
+
positives.
|
|
669
|
+
|
|
670
|
+
If the daemon team discovers PII types the scrubber misses, report to
|
|
671
|
+
the network team to add patterns to `scrubber.py`.
|
|
672
|
+
|
|
673
|
+
---
|
|
674
|
+
|
|
675
|
+
## 9. Privacy Principles
|
|
676
|
+
|
|
677
|
+
1. **Default opt-out:** No data collection without explicit user action.
|
|
678
|
+
The toggle defaults to OFF. `is_opted_in()` returns `False` when no
|
|
679
|
+
consent record exists.
|
|
680
|
+
|
|
681
|
+
2. **Anonymous user_id:** The `user_id` is a random UUID generated once
|
|
682
|
+
per install. It is NOT derived from any personal information — no
|
|
683
|
+
hardware IDs, no email, no name, no IP address.
|
|
684
|
+
|
|
685
|
+
3. **PII scrubbed before storage:** All content passes through the
|
|
686
|
+
PIIScrubber before hitting disk. The scrubber replaces 13 categories
|
|
687
|
+
of PII with placeholder tokens.
|
|
688
|
+
|
|
689
|
+
4. **User can delete all data at any time:** `intake.delete_user(user_id)`
|
|
690
|
+
removes all JSONL records and revokes consent. The consent database
|
|
691
|
+
also supports `request_deletion()` for formal deletion requests.
|
|
692
|
+
|
|
693
|
+
5. **Consent is versioned:** `is_opted_in()` checks that the user's
|
|
694
|
+
consent version matches `CURRENT_CONSENT_VERSION` (currently "1.0").
|
|
695
|
+
If terms change and the version is bumped, all users must re-consent.
|
|
696
|
+
Old consent is treated as not-opted-in until re-consented.
|
|
697
|
+
|
|
698
|
+
6. **Training data stays local:** Data is stored on the machine running
|
|
699
|
+
the Groove service (`~/.groove/training_data/`). No data leaves the
|
|
700
|
+
machine without explicit export. Future network-level aggregation
|
|
701
|
+
will be a separate opt-in.
|
|
702
|
+
|
|
703
|
+
7. **Mid-session revocation:** If a user revokes consent during an active
|
|
704
|
+
session, `CaptureSession.is_active()` detects this on the next call
|
|
705
|
+
and stops capture immediately. No buffered data is flushed.
|
|
706
|
+
|
|
707
|
+
---
|
|
708
|
+
|
|
709
|
+
## 10. File Reference
|
|
710
|
+
|
|
711
|
+
| File | Description |
|
|
712
|
+
|------|-------------|
|
|
713
|
+
| `moe-team/src/training/__init__.py` | Package exports: CaptureSession, ConsentManager, ConsentRecord, CorpusStats, DomainTagger, PIIScrubber, TrainingCorpus, TrainingRecord |
|
|
714
|
+
| `moe-team/src/training/intake.py` | TrainingDataIntake — simple submit/batch/delete API |
|
|
715
|
+
| `moe-team/src/training/consent.py` | ConsentManager — SQLite consent storage, versioned consent |
|
|
716
|
+
| `moe-team/src/training/scrubber.py` | PIIScrubber — 13 compiled regex patterns for PII removal |
|
|
717
|
+
| `moe-team/src/training/domain_tagger.py` | DomainTagger — keyword-based domain classification |
|
|
718
|
+
| `moe-team/src/training/corpus.py` | TrainingCorpus — JSONL storage with daily rotation |
|
|
719
|
+
| `moe-team/src/training/capture.py` | CaptureSession — session-oriented capture with lifecycle |
|
|
720
|
+
| `moe-team/src/training/stats.py` | CorpusStats — summary, daily growth, domain breakdown |
|