loki-mode 7.18.1 → 7.18.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,527 @@
1
+ # Crash Reporting and Auto-Fix Pipeline -- Implementation Plan
2
+
3
+ Status: DESIGN ONLY. No code, no version bumps, no commits in this pass.
4
+ Author role: Architect. Date: 2026-06-06. Target repo: asklokesh/loki-mode.
5
+
6
+ This plan designs a disclosed, anonymous, frictionless crash-reporting plus
7
+ auto-fix pipeline. It is grounded in the existing codebase; every file path
8
+ and function name below was verified by reading the repo. Where a thing does
9
+ NOT yet exist, it is marked MUST ADD. Where the spec wording and the best
10
+ engineering choice diverge, the deviation is called out explicitly.
11
+
12
+ --------------------------------------------------------------------------------
13
+ ## 1. What already exists (verified, not assumed)
14
+ --------------------------------------------------------------------------------
15
+
16
+ ### 1a. Telemetry already ships today -- and is currently UNDISCLOSED on first run
17
+ This is the headline finding. Loki Mode already collects anonymous usage data
18
+ via PostHog. There is no first-run disclosure line in the code today.
19
+
20
+ - `autonomy/telemetry.sh` -- bash PostHog client. Hardcoded ingest key
21
+ `phc_ya0vGBru41AJWtGNfZZ8H9W4yjoZy4KON0nnayS7s87`, host
22
+ `https://us.i.posthog.com`, path `/capture/`. Fire-and-forget curl with
23
+ `--max-time 3`. Distinct id persisted at `~/.loki-telemetry-id`.
24
+ Gated only by `LOKI_TELEMETRY_DISABLED=true` and `DO_NOT_TRACK=1`
25
+ (`_loki_telemetry_enabled`, line 9). Sourced by `autonomy/run.sh:648-651`.
26
+ Events fired: `session_start` (run.sh:13310), `session_end` (run.sh:13410).
27
+ - `dashboard/telemetry.py` -- Python equivalent. Same key/host, same opt-out
28
+ vars, `send_telemetry()` on a daemon thread. Called from
29
+ `dashboard/server.py:755` (`dashboard_start`).
30
+ - `bin/postinstall.js:182-209` -- npm install-time event to the same PostHog
31
+ host/key, same opt-out vars.
32
+ - `docs/WELCOME-OPENER-PLAN.md` -- an EXISTING (unimplemented in code) plan for
33
+ a first-run welcome that reuses this same PostHog contract and an opt-in form.
34
+ No sentinel/welcome is wired in code yet; only the plan doc exists.
35
+
36
+ Consequence for this feature: the honesty invariant must cover the PostHog path
37
+ too. PRIVACY.md and the first-run line cannot describe only the new crash
38
+ pipeline while `session_start` / `session_end` / install events fire silently.
39
+ The opt-out must be UNIFIED so one switch disables both PostHog usage telemetry
40
+ and crash reporting. See section 6.
41
+
42
+ ### 1b. There is a production-grade redactor already -- the keystone of this plan
43
+ - `autonomy/lib/proof_redact.py` -- the single security chokepoint for the
44
+ proof-of-run feature. Verified contents:
45
+ - `RULES_VERSION = "1.0"` (frozen, bump-on-behavior-change).
46
+ - `redact_value(s)` -- pure function, redacts one string.
47
+ - `redact_tree(obj)` -- recurses dict keys+values, lists, nested structures;
48
+ returns `(new_obj, total_redactions_count)`.
49
+ - Ordered, ReDoS-hardened patterns: Anthropic `sk-ant-`, GitHub `gh[pousr]_`,
50
+ Slack `xox[baprs]-`, AWS `AKIA...` + typed secret assign, JWT `eyJ...`,
51
+ Google `AI...`, generic OpenAI `sk-`, Bearer (keeps scheme), PEM PRIVATE KEY
52
+ blocks (dropped whole), `_ENV_ASSIGN` secret-keyed assignments (bare / JSON /
53
+ YAML quoted), `_URI_CREDENTIAL` (scheme://user:PASS@host), and
54
+ `_UNIX_HOME` `/(Users|home)/<name>` + `_WIN_HOME` `C:\Users\<name>` ->
55
+ `~`, with optional `set_context(home, repo_root)` for repo-relative paths.
56
+ - Parity model is ALREADY a shared Python module, not a TS port. `loki-ts`
57
+ reaches the redactor by shelling out: `loki-ts/src/runner/proof.ts:27` resolves
58
+ `autonomy/lib/proof-generator.py` via `findPython3` (`loki-ts/src/util/python.ts`)
59
+ and bash calls `"$SCRIPT_DIR/lib/proof-generator.py"`. Both routes call the
60
+ SAME python, so redaction can never drift between routes.
61
+ - Fail-closed precedent: `loki-ts/src/commands/proof.ts:216-227` refuses to
62
+ publish unless `redaction.applied` is confirmed -- "never publish an
63
+ unredacted artifact." This plan adopts the identical posture.
64
+ - An older bash inline privacy guard exists around `autonomy/run.sh:9047`
65
+ (referenced in `proof_redact.py` comments as the source the ENV-assign rule
66
+ mirrors).
67
+
68
+ ### 1c. Event bus surface
69
+ - `events/bus.py`, `events/bus.ts`, `events/emit.sh`. TS `EventType` enum
70
+ (bus.ts:12): `state | memory | task | metric | error | session | command |
71
+ user`. `EventSource`: `cli | api | vscode | mcp | skill | hook | dashboard |
72
+ memory | runner`. Exports include `emitErrorEvent` (bus.ts:467).
73
+ - `emit.sh` exposes `safe_append_event_jsonl()` (flock or mkdir-mutex serialized
74
+ append to `.loki/events.jsonl`), sourceable with `LOKI_EMIT_LIB_ONLY=1`.
75
+ - `autonomy/run.sh:1138` defines `emit_event_json()`; `emit_event_pending` is
76
+ used at the iteration-complete site.
77
+
78
+ ### 1d. Metrics / KPI collector
79
+ - `loki-ts/src/metrics/kpis.ts` and `loki-ts/src/metrics/trust.ts` exist, with
80
+ command parity in `loki-ts/src/commands/kpis.ts`, `stats.ts`, `trust.ts`.
81
+ - `.loki/metrics/` holds efficiency + reward data. `autonomy/context-tracker.py`
82
+ exists. No crash-specific collector exists. MUST ADD.
83
+
84
+ ### 1e. The doctor command
85
+ - bash `cmd_doctor` / `cmd_doctor_json` (`autonomy/loki`, per header in
86
+ `loki-ts/src/commands/doctor.ts:1` referencing bash line 6216 / 6534).
87
+ - TS port: `loki-ts/src/commands/doctor.ts`. Good surface to add a "telemetry:
88
+ on/off, crash buffer: N pending" line later.
89
+
90
+ ### 1f. Naming and dispatch collisions (already resolved below)
91
+ - `loki report` is TAKEN: `cmd_report` (`autonomy/loki:25091`) is a SESSION
92
+ report generator (text/markdown/html). The manual crash submit command must
93
+ NOT reuse `report`. Decision: use `loki crash` with subcommands
94
+ (`loki crash` = show pending, `loki crash submit`, `loki crash show <id>`).
95
+ - `loki telemetry` is TAKEN: `cmd_telemetry` (`autonomy/loki:17946`) is the
96
+ OTEL tracing config (`status` / `enable`), dispatched `telemetry|otel`
97
+ (`autonomy/loki:13437`). Decision: ADD `off` / `on` / `status` (extended)
98
+ subcommands to the EXISTING `cmd_telemetry`. The spec-mandated
99
+ `loki telemetry off` thus lives inside the existing command and drives the
100
+ unified opt-out. Do not create a second `telemetry` command.
101
+
102
+ ### 1g. Capture hook points (verified; mostly MUST ADD)
103
+ - TS: `loki-ts/src/cli.ts` has `process.on("SIGINT", ...)` (line 224),
104
+ `process.on("SIGTERM", ...)` (line 225), and a single terminal
105
+ `process.exit(code)` (line 228). There is NO `uncaughtException` /
106
+ `unhandledRejection` handler. MUST ADD both, plus a wrapper around the
107
+ terminal exit to capture nonzero exits.
108
+ - bash: traps are EXIT/INT/TERM cleanups only (run.sh:186, 199, 2891-2892,
109
+ 12734, 12843, 12914, 13192; loki:6160). There is a natural capture point at
110
+ the iteration-complete block (run.sh ~11968-11989) where `$exit_code` is
111
+ known and `status=...error` is already emitted, and `auto_capture_episode`
112
+ (run.sh:12206) already records per-iteration outcome. MUST ADD an ERR/EXIT
113
+ crash hook in `main()` (run.sh:12913) and a friction hook at the existing
114
+ retry/rate-limit/gate sites.
115
+
116
+ ### 1h. Issue-mode (auto-fix trigger primitive) already exists
117
+ - `gh issue ...` plumbing in `autonomy/run.sh` (create at 2200, comment at
118
+ 2078/2087, close at 2092, list at 1828). The product statement that
119
+ `loki start owner/repo#123` runs in issue-mode is consistent with this
120
+ surface; the auto-fix loop reuses it rather than inventing a runner.
121
+
122
+ ### 1i. Release mechanics (do not reconstruct from memory)
123
+ - `scripts/release.sh` is the canonical bump tool. It bumps `VERSION`,
124
+ `package.json`, `vscode-extension/package.json` directly (release.sh:209-211),
125
+ then runs `scripts/update-changelog.sh`. The "14 version files" figure is the
126
+ full release process across docs/wiki/mcp/dashboard `__init__.py`/SKILL.md etc.
127
+ Each phase below says "follow the standard release bump + CHANGELOG"; it does
128
+ NOT enumerate the 14 files from memory. Use the canonical scripts.
129
+ - Gate: `bash scripts/local-ci.sh` must pass (the bun-parity matrix is at
130
+ local-ci.sh:250). 3-reviewer council (2 Opus + 1 Sonnet) unanimous.
131
+
132
+ --------------------------------------------------------------------------------
133
+ ## 2. Architecture (ASCII)
134
+ --------------------------------------------------------------------------------
135
+
136
+ CLIENT (bash route OR Bun route -- identical behavior via shared python)
137
+ +--------------------------------------------------------------+
138
+ | capture hook |
139
+ | - TS: uncaughtException / unhandledRejection / nonzero exit | MUST ADD
140
+ | - bash: ERR/EXIT trap in main(); iteration-complete error | MUST ADD
141
+ | - provider invocation failure; friction (retry/ratelimit/gate)|
142
+ +----------------------------+---------------------------------+
143
+ | raw context (in-process only)
144
+ v
145
+ +--------------------------------------------------------------+
146
+ | SHARED SCRUBBER autonomy/lib/crash_redact.py | MUST ADD
147
+ | imports proof_redact.redact_tree (1b) + crash allow/deny |
148
+ | -> emits WHITELIST-ONLY dict + stable fingerprint |
149
+ | FAIL CLOSED: if python3 missing -> write local, NO egress |
150
+ +----------------------------+---------------------------------+
151
+ |
152
+ +----------------+-----------------+
153
+ v v
154
+ +-----------------------+ +-----------------------------+
155
+ | LOCAL SELF-INSPECT | | OUTBOUND QUEUE (later phase) |
156
+ | .loki/crash/<id>.json | | .loki/crash/outbox/*.json |
157
+ | exactly what would be | | drained by `loki crash |
158
+ | sent (Phase 0 proof) | | submit` / background flush |
159
+ +-----------------------+ +--------------+--------------+
160
+ | HTTPS POST (Phase 1+)
161
+ v
162
+ +--------------------------------------------------------------+
163
+ | INGESTION BACKEND (FastAPI, reuse dashboard/ python stack) | MUST ADD
164
+ | POST /v1/crash (anon, rate-limited, no client write token) |
165
+ | 1. SECOND server-side scrub: import crash_redact.redact_tree |
166
+ | 2. validate against whitelist schema; reject unknown fields |
167
+ | 3. fingerprint -> dedup store (sqlite/KV) |
168
+ | 4. holds GitHub App / PAT token (never on clients) |
169
+ +----------------------------+---------------------------------+
170
+ | novel fingerprint -> create
171
+ | known fingerprint -> bump counter
172
+ v
173
+ +--------------------------------------------------------------+
174
+ | PRIVATE TRIAGE REPO asklokesh/loki-telemetry (raw intake) |
175
+ | one issue per novel fingerprint + occurrence counter |
176
+ +----------------------------+---------------------------------+
177
+ | human or rule confirms "real bug"
178
+ v PROMOTION (sanitized title/body only)
179
+ +--------------------------------------------------------------+
180
+ | AUTO-FIX AGENT loki start asklokesh/loki-telemetry#<n> |
181
+ | reproduce -> fix -> bash scripts/local-ci.sh -> open PR |
182
+ +----------------------------+---------------------------------+
183
+ | PR targets PUBLIC repo, sanitized desc
184
+ v
185
+ +--------------------------------------------------------------+
186
+ | PUBLIC REPO github.com/asklokesh/loki-mode |
187
+ | auto-created PR, NOT auto-merged. council + local-ci gate. |
188
+ | human merge approval (CLAUDE.md). Public issue mirrors the |
189
+ | promise shown in the first-run line. |
190
+ +--------------------------------------------------------------+
191
+
192
+ --------------------------------------------------------------------------------
193
+ ## 3. Phased ship plan (smallest-first; each phase = one PATCH, shippable in a day)
194
+ --------------------------------------------------------------------------------
195
+
196
+ ### Phase 0 -- LOCAL ONLY: capture + scrub + .loki/crash/ + `loki crash` (NO egress)
197
+ Goal: prove the capture+scrub layer with ZERO backend and ZERO network egress.
198
+ Resolves the spec's apparent tension ("manual-submit" vs "no egress"): the
199
+ manual command writes the scrubbed artifact locally and shows the user exactly
200
+ what WOULD be sent; it can optionally open a prefilled GitHub issue URL the user
201
+ submits by hand. No backend POST exists yet.
202
+
203
+ Behavior:
204
+ - On a captured crash/friction event, write the scrubbed whitelist payload to
205
+ `.loki/crash/<fingerprint>-<ts>.json`. Never write unscrubbed data anywhere.
206
+ - `loki crash` lists pending local reports; `loki crash show <id>` prints one;
207
+ `loki crash submit` (Phase 0) prints the payload and a prefilled
208
+ `github.com/asklokesh/loki-mode/issues/new?...` URL for manual submission.
209
+ - FAIL CLOSED: if python3 unavailable, capture still writes nothing to egress;
210
+ local file is written only if scrub ran.
211
+
212
+ Files to ADD:
213
+ - `autonomy/lib/crash_redact.py` -- shared scrubber + fingerprint (section 5).
214
+ Imports `proof_redact.redact_tree` / `redact_value`.
215
+ - `autonomy/lib/crash_capture.py` -- builds the raw context dict, calls
216
+ crash_redact, writes `.loki/crash/...`. Pure-ish, no network in Phase 0.
217
+ - `autonomy/crash.sh` -- bash hook helpers: `loki_crash_capture` (sourced by
218
+ run.sh), `loki_crash_friction`. Calls python3 crash_capture.
219
+ - `loki-ts/src/runner/crash.ts` -- TS hook: registers `uncaughtException` /
220
+ `unhandledRejection` in cli.ts and on nonzero exit; shells to crash_capture.py
221
+ via `findPython3` (mirrors proof.ts:19,27).
222
+ - `loki-ts/src/commands/crash.ts` -- `loki crash` command (Bun route).
223
+ - `docs/PRIVACY.md` -- honest disclosure doc (ships in Phase 0).
224
+
225
+ Files to MODIFY:
226
+ - `autonomy/run.sh` -- source `crash.sh` near telemetry source (648-651); add
227
+ ERR/EXIT crash hook in `main()` (12913); call `loki_crash_capture` at the
228
+ iteration-complete error branch (~11968-11989) and `loki_crash_friction` at
229
+ existing retry/rate-limit sites.
230
+ - `autonomy/loki` -- add `crash)` to dispatch (near report at 13472); add
231
+ `cmd_crash`.
232
+ - `loki-ts/src/cli.ts` -- register crash handlers; wrap terminal `process.exit`
233
+ (228); route `crash` command.
234
+ - `autonomy/telemetry.sh` + `dashboard/telemetry.py` + `bin/postinstall.js` --
235
+ NO behavior change yet, but add a code comment pointer to the unified opt-out
236
+ (full unification lands in section 6, can be Phase 0 since it is local).
237
+
238
+ New tests:
239
+ - `tests/crash/test_crash_redact.py` -- golden vectors: every secret class from
240
+ proof_redact PLUS new crash fields; assert WHITELIST-only output and stable
241
+ fingerprint across two synthetic machines (different home paths -> same hash).
242
+ - `tests/crash/test_crash_redact_negative.py` -- ReDoS / huge-stack guard;
243
+ assert no `/Users/`, no env values, no emails/IPs survive.
244
+ - `loki-ts/tests/commands/crash.test.ts` -- `loki crash` lists/show/submit.
245
+ - Add a bun-parity entry so `loki crash --help` and `loki crash show` match
246
+ byte-for-byte across routes (local-ci.sh:250 matrix).
247
+
248
+ CHANGELOG honest "NOT tested" disclosure (Phase 0):
249
+ - Tested: client-side scrub on golden vectors; local artifact write;
250
+ `loki crash` list/show/submit; fingerprint stability.
251
+ - NOT tested: network egress (none exists in Phase 0); backend dedup;
252
+ cross-machine real-world fingerprint collisions beyond synthetic fixtures;
253
+ auto-fix loop.
254
+
255
+ ### Phase 1 -- BACKEND ingest with SECOND scrub (egress behind unified opt-out)
256
+ Goal: add the ingestion backend and turn on opt-in-by-disclosure egress.
257
+ - Stand up FastAPI backend (section 7) reusing the dashboard Python stack so it
258
+ can `import crash_redact` for the mandated second scrub.
259
+ - Client gains a background flush of `.loki/crash/outbox/*.json` to `POST
260
+ /v1/crash`, gated by the unified opt-out (section 6) and rate-limited.
261
+ - Still NO issue creation server-side beyond storing; or create issues in the
262
+ PRIVATE triage repo only.
263
+
264
+ Files to ADD: `dashboard/crash_ingest.py` (or `web-app/` route -- see 7 for the
265
+ host decision), `dashboard/crash_store.py` (dedup store), backend tests.
266
+ Files to MODIFY: `autonomy/crash.sh`, `loki-ts/src/runner/crash.ts` (add flush),
267
+ `loki-ts/src/commands/crash.ts` + `cmd_crash` (add `submit` real POST).
268
+ CHANGELOG NOT tested: production GitHub token custody under load; abuse/spam at
269
+ scale; promotion path (not yet built).
270
+
271
+ ### Phase 2 -- DEDUP + fingerprint store + PRIVATE issue creation + counter
272
+ - Backend creates one private issue per novel fingerprint; bumps an occurrence
273
+ counter comment on repeats. GitHub token held server-side only.
274
+ Files: extend `dashboard/crash_store.py`, add `dashboard/crash_github.py`.
275
+ CHANGELOG NOT tested: promotion to public repo; auto-fix.
276
+
277
+ ### Phase 3 -- PROMOTION path (private -> public, sanitized) + AUTO-FIX loop
278
+ - Confirmed bugs promoted to public `asklokesh/loki-mode` with sanitized
279
+ title/body. Auto-fix agent runs `loki start asklokesh/loki-telemetry#<n>`,
280
+ fixes, runs local-ci, opens a PR to PUBLIC repo. NOT auto-merged.
281
+ Files: `dashboard/crash_promote.py`, `scripts/crash-autofix.sh` (or a routine).
282
+ CHANGELOG NOT tested: human-merge gate behavior in the wild; regression rate of
283
+ auto-fixes (council + local-ci gate is the guard, see section 8).
284
+
285
+ --------------------------------------------------------------------------------
286
+ ## 4. (folded into section 3 per-phase: files + tests + CHANGELOG disclosure)
287
+ --------------------------------------------------------------------------------
288
+
289
+ --------------------------------------------------------------------------------
290
+ ## 5. The scrubber spec (allow/deny, regex, fingerprint)
291
+ --------------------------------------------------------------------------------
292
+
293
+ ### 5a. Deliberate deviation from spec wording (state plainly)
294
+ The spec asks for "the scrubber as a testable pure function in BOTH routes,"
295
+ "designed identically for bash and TS." The chosen design is ONE shared Python
296
+ module, `autonomy/lib/crash_redact.py`, that BOTH routes call -- bash via
297
+ python3, Bun via `findPython3` (exactly the `proof_redact.py` /
298
+ `proof-generator.py` precedent, verified at proof.ts:27).
299
+ - Why this is better: it makes drift between routes impossible, and the SAME
300
+ module is importable by the FastAPI backend for the mandated SECOND scrub.
301
+ One module, three call sites: bash client, Bun client, backend.
302
+ - Spirit of the requirement is met: `redact_value` / `redact_tree` ARE pure
303
+ functions, covered by shared golden-vector tests.
304
+ - Contingency if the council demands strict per-route code: port the rules to
305
+ `loki-ts/src/util/crash_redact.ts` and test it against the IDENTICAL fixture
306
+ set as the Python module (same golden vectors), so parity is proven by tests.
307
+
308
+ ### 5b. FAIL CLOSED
309
+ If python3 is unavailable, the no-leak guarantee cannot be enforced, so NO
310
+ egress happens. The client may write a local note that capture was skipped, but
311
+ must never POST. This mirrors proof.ts:216-227 ("never publish an unredacted
312
+ artifact").
313
+
314
+ ### 5c. WHITELIST-only emit (deny-by-default)
315
+ The payload that leaves the machine contains ONLY these fields. Anything not on
316
+ this list is dropped, not redacted:
317
+ - os (uname -s), arch (uname -m)
318
+ - loki_version (from VERSION)
319
+ - runtime: node_version and/or bun_version
320
+ - error_class (e.g. TypeError, ENOENT, NonZeroExit)
321
+ - stack_signature: list of top N (default 5) normalized frame signatures
322
+ (function/symbol names only; file paths, line numbers, columns stripped)
323
+ - rarv_phase (REASON/ACT/REVIEW/VERIFY/CLOSE/iteration)
324
+ - exit_code
325
+ - friction_kind (retry_loop | rate_limit_loop | gate_failure) when applicable
326
+ - project_id_hash (section 5e)
327
+ - fingerprint (section 5d)
328
+ - rules_version (from crash_redact) and redactions_count
329
+ - captured_at (UTC, second precision)
330
+
331
+ ### 5d. Deny rules (reuse proof_redact, plus crash additions)
332
+ crash_redact imports and applies `proof_redact.redact_tree` first (all rules in
333
+ 1b: keys, Bearer, PEM, env-assign, URI creds, /Users//home/ -> ~, Windows
334
+ home). Then crash-specific additions BEFORE whitelisting:
335
+ - emails: `[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}` -> [REDACTED:EMAIL]
336
+ - IPv4: `\b(?:\d{1,3}\.){3}\d{1,3}\b` -> [REDACTED:IP]
337
+ - IPv6: standard colon-hex form -> [REDACTED:IP]
338
+ - repo names: any `owner/repo` derived from the local git remote and any value
339
+ matching the configured public/private repo names -> [REDACTED:REPO]
340
+ - prompt/PRD/code/file-content fields: never whitelisted, so dropped by 5c.
341
+ Because emit is whitelist-only, free-text fields (briefs, prompts, diffs) can
342
+ never reach the payload even if a deny rule missed them.
343
+
344
+ ### 5e. Hashed non-reversible project id
345
+ - Do NOT hash the local filesystem path (it contains `/Users/<name>/`, which is
346
+ reversible-ish and stripped anyway).
347
+ - Hash the git remote origin URL (normalized: strip scheme, `.git` suffix,
348
+ trailing slash, lowercase host).
349
+ - Use SHA-256, UNSALTED. Tradeoff stated explicitly: unsalted gives cross-user
350
+ dedup (two users hitting the same bug in the same public repo collapse to one
351
+ triage issue, which is the whole point of the occurrence counter); a per-user
352
+ salt would kill that dedup. Unsalted is dictionary-attackable for known public
353
+ repos, but the project id reveals only "which public repo," which is already
354
+ public, so the privacy cost is acceptable. Private-repo origins still hash to
355
+ an opaque value with no path/name leakage.
356
+
357
+ ### 5f. Fingerprint (dedup key)
358
+ - Computed in crash_redact AFTER scrub, on the REDACTED data, so client and
359
+ backend derive the identical value.
360
+ - `fingerprint = sha256(error_class + "\n" + "\n".join(top_N_stack_signatures))`
361
+ where each stack signature is the symbol/function name only, with file paths,
362
+ line numbers, columns, and addresses stripped (or it would differ per machine).
363
+ - N defaults to 5; configurable constant in the module. Same hash function and N
364
+ on client and server -> stable cross-machine dedup.
365
+
366
+ --------------------------------------------------------------------------------
367
+ ## 6. First-run disclosure UX + unified opt-out
368
+ --------------------------------------------------------------------------------
369
+
370
+ ### 6a. Copy (no emojis, no em dashes -- both are banned)
371
+ Shown once, on the first run, before any egress:
372
+
373
+ Loki Mode auto-creates the issues you hit at github.com/asklokesh/loki-mode
374
+ and tries to auto-resolve them. If it cannot, we encourage you to open an
375
+ issue for anything causing hesitation.
376
+ We send anonymous diagnostics only (os, arch, version, error type, sanitized
377
+ stack signatures). Never your code, prompts, paths, keys, or repo names.
378
+ See docs/PRIVACY.md. Turn this off anytime with: loki telemetry off
379
+
380
+ ### 6b. Where the one-time flag lives (separate from opt-out)
381
+ - Disclosure sentinel: a `DISCLOSURE_SHOWN=true` key in `~/.loki/config` (the
382
+ same global config already read at `autonomy/run.sh:643`). Shown once
383
+ regardless of enable/disable state. Never re-shown after opt-out.
384
+ - Reuse `~/.loki/config`; do NOT invent a new sentinel file.
385
+
386
+ ### 6c. Unified opt-out (gates BOTH PostHog usage telemetry AND crash reporting)
387
+ Map all switches to the SAME persisted key, and keep honoring the community
388
+ standard:
389
+ - `LOKI_TELEMETRY=off` (new, spec-mandated) -> treated as disabled.
390
+ - `loki telemetry off` (new subcommand on existing `cmd_telemetry`,
391
+ autonomy/loki:17946) -> writes `TELEMETRY_DISABLED=true` to `~/.loki/config`.
392
+ - `loki telemetry on` -> removes/sets it false; `loki telemetry status` shows
393
+ BOTH OTEL state (existing) and collection state (new) + pending crash count.
394
+ - Existing `LOKI_TELEMETRY_DISABLED=true` and `DO_NOT_TRACK=1` -> still honored.
395
+ - The unified check (a single helper, e.g. `loki_collection_enabled` in
396
+ crash.sh and a TS mirror) must be consulted by: `autonomy/telemetry.sh`
397
+ (`_loki_telemetry_enabled`), `dashboard/telemetry.py` (`_is_enabled`),
398
+ `bin/postinstall.js`, AND the new crash flush. Otherwise the disclosure is a
399
+ lie about PostHog events that keep firing.
400
+ - Never re-prompt once disabled (the sentinel is independent and never re-shown).
401
+
402
+ --------------------------------------------------------------------------------
403
+ ## 7. Backend design
404
+ --------------------------------------------------------------------------------
405
+
406
+ ### 7a. Host recommendation: reuse the FastAPI Python stack (dashboard/)
407
+ Recommendation: a small FastAPI service, deployed separately from the local
408
+ dashboard (dashboard/server.py runs on port 57374 locally; the ingest service
409
+ is a hosted deployment, e.g. on the existing `web-app/` / `deploy/` infra).
410
+ Justification (strongest single argument): the scrubber is Python, so the
411
+ backend can `import crash_redact` and run the EXACT SAME second scrub the client
412
+ ran -- no reimplementation, no drift, identical RULES_VERSION. A serverless
413
+ function in another language would force a second, divergent scrubber, which is
414
+ the primary data-leak risk. Reusing Python keeps one source of truth.
415
+ - Alternative considered: serverless (e.g. a single function). Rejected as the
416
+ default precisely because it tends toward a re-implemented scrubber; acceptable
417
+ ONLY if it runs the same Python module.
418
+
419
+ ### 7b. Endpoints
420
+ - `POST /v1/crash` -- accept one scrubbed report. Returns 202 always (privacy:
421
+ no confirmation that reveals dedup state to the client).
422
+ - `GET /healthz` -- liveness.
423
+ - (internal/admin, authn-gated) `GET /v1/crash/stats`, `POST /v1/crash/promote`.
424
+
425
+ ### 7c. Auth model (clients carry NO write secret)
426
+ - Clients POST UNAUTHENTICATED but heavily constrained:
427
+ - Strict rate limiting per source IP and per project_id_hash.
428
+ - Body size cap; whitelist-schema validation; reject unknown fields.
429
+ - Server-side second scrub regardless of client claims.
430
+ - Optionally a PUBLIC anon ingest key (like the PostHog public key already in
431
+ the repo) for coarse routing/quota -- it is not a secret and grants no write
432
+ access to GitHub. The GitHub write token is NEVER on clients.
433
+ - Rationale: any secret shipped in the client is exfiltratable (the repo already
434
+ treats `phc_...` as public for this reason). So the trust boundary is: clients
435
+ can only enqueue scrubbed diagnostics; only the server can write to GitHub.
436
+
437
+ ### 7d. Dedup store
438
+ - Start with SQLite (file-backed) keyed by `fingerprint`: columns fingerprint,
439
+ first_seen, last_seen, occurrence_count, private_issue_number, status
440
+ (new|confirmed|promoted|fixed). Trivially swappable for a hosted KV later.
441
+
442
+ ### 7e. GitHub token custody
443
+ - A GitHub App installation token or a fine-grained PAT, scoped to the PRIVATE
444
+ triage repo (issues:write) and the PUBLIC repo (pull-requests:write for the
445
+ auto-fix PR). Stored in the backend secret store / env, never returned to
446
+ clients, never logged. Token rotation documented in PRIVACY.md ops notes.
447
+
448
+ ### 7f. Second server-side scrub (mandatory)
449
+ - On ingest, before any storage or issue creation: `crash_redact.redact_tree`
450
+ the entire body, then validate against the whitelist schema and DROP unknown
451
+ fields. Record server `redactions_count`; if it is > 0 on a payload the client
452
+ claimed was clean, log a scrubber-miss metric (a client-rule gap to fix).
453
+
454
+ --------------------------------------------------------------------------------
455
+ ## 8. The auto-fix loop
456
+ --------------------------------------------------------------------------------
457
+
458
+ Trigger and flow:
459
+ 1. A novel fingerprint creates a PRIVATE triage issue
460
+ (asklokesh/loki-telemetry). A repeat bumps an occurrence-counter comment.
461
+ 2. Confirmation gate: a bug is PROMOTED only when confirmed (rule-based on
462
+ occurrence threshold + reproducibility, or a maintainer label). Promotion
463
+ creates/links a sanitized PUBLIC issue (title/body scrubbed; no triage-repo
464
+ internals leak to the public repo).
465
+ 3. Auto-fix run: the backend (or a scheduled routine) invokes
466
+ `loki start asklokesh/loki-telemetry#<n>` in issue-mode, reusing the existing
467
+ issue plumbing (gh create/comment/close in run.sh: 2200/2078/2092). The agent
468
+ reproduces, fixes, and runs `bash scripts/local-ci.sh`.
469
+ 4. PR: the resulting PR TARGETS the PUBLIC repo with a SANITIZED description
470
+ (links the public issue, never the raw triage payload). The PR is
471
+ auto-created but NOT auto-merged.
472
+
473
+ Guards that prevent a bad auto-fix from shipping:
474
+ - local-ci gate: PR cannot be opened unless `bash scripts/local-ci.sh` passes
475
+ (the same 42/42 + bun-parity matrix at local-ci.sh:250) on the fix branch.
476
+ - council: the standard 3-reviewer council (2 Opus + 1 Sonnet) unanimous, per
477
+ CLAUDE.md, applies to the auto-fix PR like any other.
478
+ - NO auto-merge: human merge approval still gates (CLAUDE.md). The pipeline ends
479
+ at "PR open + green + council-approved."
480
+ - Idempotency: one open auto-fix PR per fingerprint; the dedup store records
481
+ private_issue_number and PR linkage to avoid duplicate PR storms.
482
+ - Sanitization at the boundary: the promotion step re-runs crash_redact on any
483
+ text copied from private -> public, so the public PR/issue can never carry
484
+ raw triage content.
485
+
486
+ --------------------------------------------------------------------------------
487
+ ## 9. Risks (named + mitigated)
488
+ --------------------------------------------------------------------------------
489
+
490
+ | Risk | Mitigation |
491
+ | --- | --- |
492
+ | Scrubber miss -> data leak | Whitelist-ONLY emit (deny by default, 5c) so unlisted fields never ship even if a regex misses; reuse hardened proof_redact (1b); SECOND server-side scrub via the same module (7f); fail closed if python3 missing (5b); golden-vector + negative tests. |
493
+ | Undisclosed existing PostHog telemetry | Unified opt-out (6c) gates PostHog AND crash; PRIVACY.md + first-run line describe BOTH; honesty invariant covers existing events. |
494
+ | Backend abuse / spam | Unauthenticated-but-rate-limited (per IP + project_id_hash), body-size cap, whitelist-schema validation, 202-always (no oracle), occurrence counter collapses floods into one issue. |
495
+ | Auto-fix ships a regression | local-ci gate (42/42 + bun-parity) BEFORE PR; unanimous council; NO auto-merge; one PR per fingerprint; human merge gate per CLAUDE.md. |
496
+ | GitHub token exfiltration | Token only on backend (7c/7e); clients carry no write secret; only a public anon ingest key at most. |
497
+ | GDPR / CCPA compliance | Anonymous-by-design (no PII in whitelist; emails/IPs denied); disclosed default-on with friction-free persistent opt-out (LOKI_TELEMETRY=off, loki telemetry off, DO_NOT_TRACK=1); project_id_hash is non-reversible; PRIVACY.md documents data categories, retention, opt-out, and deletion-by-fingerprint on request. Default-on is defensible only WITH the disclosure; covert would not be. |
498
+ | Dual-route parity burden | Single shared Python scrubber called by both routes (5a) eliminates redaction drift; bun-parity matrix entries for `loki crash` (local-ci.sh:250); commands ported in both `autonomy/loki` and `loki-ts/src/commands/`. |
499
+ | Over-reporting normal operation | Conservative capture: only uncaught/ nonzero-exit/provider-failure/explicit friction signals (retry loop, rate-limit loop, gate failure), not routine retries; thresholds before friction fires. |
500
+ | python3 absence breaks capture | Fail closed: local write only if scrub ran, never egress without scrub (5b). |
501
+ | Fingerprint instability across machines | Hash computed post-scrub on path/line-stripped frame signatures (5f); synthetic two-machine test in Phase 0. |
502
+
503
+ --------------------------------------------------------------------------------
504
+ ## 10. Non-goals (explicit)
505
+ --------------------------------------------------------------------------------
506
+
507
+ - NOT auto-merging auto-fix PRs. Human merge approval always gates.
508
+ - NOT collecting any PII, code, prompts, PRDs, file contents, paths, or repo
509
+ names. Whitelist-only.
510
+ - NOT replacing the existing OTEL `cmd_telemetry` tracing feature; this adds
511
+ subcommands and a separate crash pipeline.
512
+ - NOT replacing the existing PostHog usage telemetry; this UNIFIES its opt-out
513
+ and discloses it, but does not rip it out.
514
+ - NOT building a real-time crash dashboard UI in this plan (the local dashboard
515
+ may surface a count later; out of scope here).
516
+ - NOT a public bug bounty or external contributor intake flow.
517
+ - NOT cross-product telemetry beyond loki-mode.
518
+ - NOT shipping egress in Phase 0 (local-only proof first).
519
+
520
+ --------------------------------------------------------------------------------
521
+ ## Critical Files for Implementation
522
+ --------------------------------------------------------------------------------
523
+ - /Users/lokesh/git/loki-mode/autonomy/lib/proof_redact.py (reuse / import; the keystone redactor)
524
+ - /Users/lokesh/git/loki-mode/autonomy/run.sh (bash capture hooks + telemetry sourcing)
525
+ - /Users/lokesh/git/loki-mode/loki-ts/src/cli.ts (TS uncaughtException/unhandledRejection/exit hook + command routing)
526
+ - /Users/lokesh/git/loki-mode/autonomy/loki (cmd_crash + telemetry off/on subcommands + dispatch)
527
+ - /Users/lokesh/git/loki-mode/dashboard/server.py (FastAPI host to extend for the ingest backend + second scrub)
@@ -2,7 +2,7 @@
2
2
 
3
3
  The flagship product of [Autonomi](https://www.autonomi.dev/). Complete installation instructions for all platforms and use cases.
4
4
 
5
- **Version:** v7.18.1
5
+ **Version:** v7.18.3
6
6
 
7
7
  ---
8
8