loki-mode 7.19.0 → 7.19.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,462 @@
1
+ # Verified Completion Plan (v7.19.1, MINOR)
2
+
3
+ Status: DESIGN ONLY. No implementation, no version bump, no commit.
4
+ Author: Architect (Loki Mode "verified completion" release)
5
+ Target version: 7.19.1 (current VERSION is 7.19.0)
6
+
7
+ ---
8
+
9
+ ## 1. Goal and threat model
10
+
11
+ Loki must PROVE a run is actually done before the completion council lets the run
12
+ STOP. It must refuse to accept a fabricated "done." Concretely, block the
13
+ completion-approval path unless there is REAL on-disk evidence that:
14
+
15
+ - (a) **files actually changed** -- a nonzero git diff between the run-start SHA
16
+ and HEAD, AND
17
+ - (b) **tests actually passed** -- a green test-results signal (where a test
18
+ suite exists).
19
+
20
+ This attacks the #2 documented user trust-killer: agents claiming "done" when
21
+ nothing shipped.
22
+
23
+ Default-on, opt-out via `LOKI_EVIDENCE_GATE=0`, because a false block would stop
24
+ a legitimate completion (high cost). The design is built around the principle:
25
+ **block only on positive evidence of fabrication; treat inconclusive as
26
+ pass-through.**
27
+
28
+ ### Honest limit (state plainly)
29
+
30
+ This gate proves "something changed and the test suite is green." It does NOT
31
+ prove PRD-semantic correctness -- it cannot tell whether the right thing was
32
+ built, only that *a* thing was built and tests pass. Semantic judgment stays
33
+ with the council votes and the Devil's Advocate. The evidence gate is a cheap,
34
+ deterministic floor under the expensive, fallible LLM votes -- not a replacement
35
+ for them.
36
+
37
+ ---
38
+
39
+ ## 2. Verified terrain (re-grepped; line numbers drift)
40
+
41
+ Confirmed by reading source, not assumed:
42
+
43
+ - **Bash is the live council.** `autonomy/run.sh` sources
44
+ `autonomy/completion-council.sh` (run.sh:617-619) and calls
45
+ `council_should_stop` (run.sh:12382). `loki-ts/src/runner/council.ts` is a
46
+ port slice ("second slice of completion-council.sh port", council.ts:1-9) and
47
+ is NOT wired into the runtime loop. **No Bun/TS change is needed for this
48
+ release** (see Section 7).
49
+
50
+ - **`council_evaluate`** (completion-council.sh ~1511-1591): Phase 1
51
+ `council_reverify_checklist` (~1519); Phase 2 `council_checklist_gate`
52
+ (~1522, HARD gate, `return 1` blocks STOP); Phase 3 aggregate votes; Phase 4
53
+ unanimous + Devil's Advocate. Returns 0 = COMPLETE/STOP, 1 = CONTINUE.
54
+
55
+ - **`council_checklist_gate`** (~804-894): reads
56
+ `.loki/checklist/verification-results.json` and `.loki/checklist/waivers.json`;
57
+ `return 0` = pass (no file => no gate, backwards compatible), `return 1` =
58
+ block; on block writes `$COUNCIL_STATE_DIR/gate-block.json` (atomic
59
+ temp+mv); on pass removes a stale `gate-block.json`. **This is the pattern we
60
+ clone.** `COUNCIL_STATE_DIR` = `<loki_dir>/council` (set at
61
+ completion-council.sh:119).
62
+
63
+ - **`council_should_stop`** (~1811-1914): the only place that writes the
64
+ `COMPLETED` marker on real approval is the `if council_evaluate` branch
65
+ (completion-council.sh:1863). The two force-stop safety valves -- stagnation
66
+ (1899-1903) and done-signal (1907-1911, "agent keeps saying done") -- `return 0`
67
+ but do **NOT** write `COMPLETED`. They are *give-up / resource-protection*
68
+ exits, not *approved-done* claims. (Verified: the only `COMPLETED` writes are
69
+ completion-council.sh:1863 and run.sh:12773; neither valve writes it.)
70
+
71
+ - **Second approval path (force-review).** `run.sh:12762-12784` handles a
72
+ dashboard-triggered `COUNCIL_REVIEW_REQUESTED` signal. It already calls
73
+ `council_checklist_gate` before approving (run.sh:12766) and writes `COMPLETED`
74
+ directly (run.sh:12773), bypassing `council_evaluate`. **This is a second
75
+ insertion point** the gate must cover for parity (see Section 1, insertion B).
76
+
77
+ - **Per-iteration SHA exists; run-start SHA does NOT.** run.sh:11560 captures
78
+ `_LOKI_ITER_START_SHA=$(git rev-parse HEAD)` per attempt; there is no run-wide
79
+ baseline. We must capture one (Section 2-design).
80
+
81
+ - **`_git_diffstat` is NOT a bash helper.** It is a Python function inside
82
+ `autonomy/lib/proof-generator.py:199`, reading `_LOKI_ITER_START_SHA` and
83
+ diffing `base..HEAD` (committed only). It is not callable from bash. (See
84
+ Section 3 deviation note for the mechanism we actually use.)
85
+
86
+ - **The authoritative green-test signal is `.loki/quality/test-results.json`,
87
+ NOT `verification-results.json`.** Written by `enforce_test_coverage`
88
+ (run.sh:6220-6396), shape:
89
+ `{"timestamp","runner","pass":true|false,"min_coverage","summary"}`, with the
90
+ special **no-suite** case `{"runner":"none","pass":true,"summary":"No test
91
+ runner detected"}` (run.sh:6373-6379). `enforce_test_coverage` runs earlier in
92
+ the same iteration (run.sh:12231, gated by `PHASE_UNIT_TESTS`, default true),
93
+ before the council check (run.sh:12382), so this file is reasonably fresh.
94
+ (See Section 3 deviation note for why we do NOT use verification-results.json
95
+ for the test signal.)
96
+
97
+ ---
98
+
99
+ ## 3. Deviations from the task's stated terrain (flagged, not papered over)
100
+
101
+ The task instructed reuse of two things that, on reading source, do not work as
102
+ described. Stating both:
103
+
104
+ ### Deviation A -- test-green source
105
+
106
+ Task said: read tests-green from `.loki/checklist/verification-results.json`,
107
+ "reuse the existing parse." **Insufficient.** That file (written by
108
+ checklist-verify.py:336-358) stores per item only `id`, `title`, `priority`,
109
+ `status` (`verified|failing|pending`). It does NOT store the check `type`, so it
110
+ cannot distinguish a real test (`tests_pass`/`command` check) from a
111
+ `file_exists` check. Using it would conflate "a file exists" with "tests pass."
112
+
113
+ **Decision:** use `.loki/quality/test-results.json` (`runner`/`pass`) as the
114
+ authoritative green-test signal. It records the actual runner that ran and a
115
+ boolean pass, and explicitly encodes the no-suite case (`runner:"none"`). This is
116
+ still "reuse existing on-disk evidence" -- just the correct file.
117
+
118
+ ### Deviation B -- git diff helper
119
+
120
+ Task said: reuse `_git_diffstat`. **Not callable from bash** (it is a Python
121
+ function in proof-generator.py keyed on `_LOKI_ITER_START_SHA`, the per-iteration
122
+ baseline, and counts only committed changes).
123
+
124
+ **Decision:** the gate computes the diff inline with
125
+ `git diff --numstat <start-sha> HEAD`, matching the existing council convergence
126
+ convention already in completion-council.sh (`git diff --stat HEAD` at ~165,
127
+ `git diff --name-only HEAD` at ~208). We diff against the **run-start** SHA, not
128
+ the per-iteration SHA, because Loki auto-commits per iteration -- by the time the
129
+ council runs, the per-iteration working tree is clean and `git diff HEAD` is
130
+ empty even on a productive run. The run-start baseline is the only SHA that
131
+ answers "did this run ship anything." We count committed changes
132
+ (`<start-sha>..HEAD`), which is correct post-auto-commit; if uncommitted changes
133
+ remain they are additive evidence, not required.
134
+
135
+ ---
136
+
137
+ ## 4. Design
138
+
139
+ ### 4-design.1 -- Capture the run-start SHA (fresh-run aware)
140
+
141
+ Add, in `run_autonomous()` (run.sh ~11412, AFTER `load_state`, BEFORE the
142
+ `while [ $retry -lt $MAX_RETRIES ]` loop at run.sh:11495), a capture that
143
+ persists to `.loki/state/start-sha`.
144
+
145
+ **Critical lifecycle rule (do NOT "set if absent" alone).** A naive "set only if
146
+ the file is missing" makes the gate toothless on any repo Loki has run before:
147
+ the stale baseline from the *first* run persists, so on every later run
148
+ `base..HEAD` shows the entire prior history => nonzero diff => gate passes
149
+ trivially even if the new run shipped nothing. The baseline must be (re)captured
150
+ on a **fresh run** and preserved only on a **genuine resume**.
151
+
152
+ The fresh-vs-resume signal already exists: `load_state` (run.sh:9790-9834)
153
+ restores `ITERATION_COUNT` from `.loki/autonomy-state.json`. After `load_state`:
154
+
155
+ - `ITERATION_COUNT == 0` => fresh run (new invocation, or state was
156
+ reset/corrupted) => **recapture** start-sha (overwrite).
157
+ - `ITERATION_COUNT > 0` => genuine resume of an in-flight run => **keep** the
158
+ existing start-sha (do not move the baseline mid-run).
159
+
160
+ This also naturally handles the "previously COMPLETED, now re-run" case: a
161
+ re-run after completion starts a fresh invocation with `ITERATION_COUNT == 0`
162
+ (the prior `.loki/COMPLETED` is removed on the reset path at run.sh:3186), so the
163
+ baseline is recaptured at HEAD-of-this-run.
164
+
165
+ Capture details:
166
+ - `git rev-parse HEAD` in `${TARGET_DIR:-.}`; on non-git or failure, write an
167
+ empty file, which the gate treats as inconclusive => pass-through.
168
+ - Export `_LOKI_RUN_START_SHA` for the current process so the gate reads it
169
+ without a file round-trip; the file is the durable source of truth across
170
+ resumes.
171
+
172
+ Pseudocode (illustrative, not final code):
173
+
174
+ ```
175
+ local _start_sha_file=".loki/state/start-sha"
176
+ mkdir -p ".loki/state"
177
+ if [ "${ITERATION_COUNT:-0}" -eq 0 ] || [ ! -s "$_start_sha_file" ]; then
178
+ # Fresh run (or no baseline yet): (re)capture HEAD as the run baseline.
179
+ (cd "${TARGET_DIR:-.}" && git rev-parse HEAD 2>/dev/null) > "$_start_sha_file" 2>/dev/null || true
180
+ fi
181
+ # else: genuine resume (ITERATION_COUNT > 0) -- keep the existing baseline.
182
+ _LOKI_RUN_START_SHA="$(cat "$_start_sha_file" 2>/dev/null || echo "")"
183
+ export _LOKI_RUN_START_SHA
184
+ ```
185
+
186
+ Edge case: a brand-new repo with zero commits has no HEAD => empty SHA =>
187
+ inconclusive => pass-through (do not block a legit first-commit run on a baseline
188
+ we never had).
189
+
190
+ ### 4-design.2 -- `council_evidence_gate` (cloned from `council_checklist_gate`)
191
+
192
+ New function in completion-council.sh, placed immediately after
193
+ `council_checklist_gate` (after ~894). Contract identical to the checklist gate:
194
+
195
+ - `return 0` => gate passes (OK to complete).
196
+ - `return 1` => gate blocks (treated by callers as CONTINUE / block-stop).
197
+
198
+ Behavior:
199
+
200
+ 1. **Knob first (exact-as-today when off).** If `LOKI_EVIDENCE_GATE` (default 1)
201
+ is `0`, `return 0` immediately -- before any file read or write -- so behavior
202
+ is byte-for-byte today's behavior.
203
+
204
+ ```
205
+ [ "${LOKI_EVIDENCE_GATE:-1}" = "0" ] && return 0
206
+ ```
207
+
208
+ 2. **Evidence check (a) -- nonzero diff vs run-start SHA (committed UNION working tree).**
209
+ - Resolve base: `_LOKI_RUN_START_SHA`, else `cat .loki/state/start-sha`.
210
+ - If no git repo (`git rev-parse --is-inside-work-tree` fails) => **inconclusive
211
+ => pass-through** (cannot prove fabrication).
212
+ - **Do NOT count committed-only.** Loki's per-iteration auto-commit is NOT
213
+ guaranteed -- run.sh itself guards for this ("Also include unstaged changes
214
+ (in case auto-commit didn't run)", run.sh:9392). A dirty working tree full
215
+ of real edits is legitimate work, not fabrication; committed-only would
216
+ false-block it. Count the UNION of four sources, block only when ALL are
217
+ empty:
218
+ - committed since baseline: `git diff --name-only <base> HEAD`
219
+ (when base is empty/invalid, fall back to `git diff --name-only HEAD`,
220
+ mirroring proof-generator.py's own shallow/first-commit fallback),
221
+ - unstaged: `git diff --name-only HEAD`,
222
+ - staged: `git diff --cached --name-only`,
223
+ - untracked new files: `git ls-files --others --exclude-standard`. A
224
+ greenfield first run creates brand-new files that are not yet committed,
225
+ staged, or visible to `git diff HEAD`; without this fourth source the
226
+ union would be empty and the gate would false-block legitimate new work.
227
+ `--exclude-standard` respects .gitignore so build artifacts and
228
+ node_modules do not count as evidence.
229
+ - The union EXCLUDES any path under `.loki/` (Loki's own runtime state). The
230
+ gate's own inputs live there (`.loki/quality/test-results.json` is always
231
+ present at gate time) and several `.loki/*` files are not gitignored, so
232
+ counting them would make the gate toothless: the union would never be empty.
233
+ Loki's runtime state is not project work / completion evidence.
234
+ - If the union of changed paths is empty => **DIFF EVIDENCE FAILS** (nothing
235
+ shipped anywhere). This strictly reduces false-blocks and cannot let
236
+ fabrication through: if nothing was built, all four sources are empty.
237
+
238
+ 3. **Evidence check (b) -- tests green.**
239
+ - Read `.loki/quality/test-results.json`. Missing or unparseable =>
240
+ **inconclusive => pass-through** for the test dimension (mirrors checklist
241
+ gate's "no file = no gate").
242
+ - `runner == "none"` => **no suite exists => pass-through** for the test
243
+ dimension (the no-suite case is explicitly legitimate).
244
+ - `runner != "none"` AND `pass == false` => **TEST EVIDENCE FAILS**
245
+ (a runner ran and was red).
246
+ - `runner != "none"` AND `pass == true` => test evidence GREEN.
247
+
248
+ 4. **Block decision (truth table, Section 5).** The gate `return 1` (blocks) iff:
249
+ - DIFF EVIDENCE FAILS (empty diff vs run-start, where git+base were
250
+ available), OR
251
+ - TEST EVIDENCE FAILS (a runner actually ran and was red).
252
+ Otherwise `return 0`.
253
+
254
+ 5. **On block:** write `$COUNCIL_STATE_DIR/evidence-block.json` (atomic
255
+ temp+mv, mirroring gate-block.json at ~858-885) with the reason(s), then
256
+ `return 1`. Schema (Section 6).
257
+
258
+ 6. **On pass:** if `$COUNCIL_STATE_DIR/evidence-block.json` exists, `rm -f` it
259
+ (mirrors checklist gate cleanup at ~890-892), then `return 0`. Stale block
260
+ reports must not linger and mislead the dashboard.
261
+
262
+ Implementation note: like `council_checklist_gate`, do the JSON parse in an
263
+ inline `python3 -c` with the file paths passed via env (`_TR_FILE`,
264
+ `_START_SHA`) -- never string-interpolated into the script -- matching the
265
+ existing safe-parse convention.
266
+
267
+ ### 4-design.3 -- Insertion point A: `council_evaluate`
268
+
269
+ In `council_evaluate`, immediately AFTER the checklist gate block
270
+ (completion-council.sh:1522-1525) and BEFORE the threshold/aggregate computation
271
+ (~1527):
272
+
273
+ ```
274
+ # Phase 2.5 (v7.19.1): evidence hard gate -- block completion unless there is
275
+ # real evidence that files changed AND tests are green.
276
+ if ! council_evidence_gate; then
277
+ log_info "[Council] Completion blocked by evidence hard gate"
278
+ return 1 # CONTINUE - cannot complete without real evidence
279
+ fi
280
+ ```
281
+
282
+ This makes it a hard pre-vote gate, sequenced exactly like the checklist gate:
283
+ checklist gate -> evidence gate -> votes -> DA. Members never vote when the
284
+ evidence gate blocks, so no LLM cost is spent rubber-stamping a fabricated done.
285
+
286
+ ### 4-design.4 -- Insertion point B: force-review path (parity)
287
+
288
+ The dashboard force-review path (run.sh:12762-12784) is a second approval path
289
+ that writes `COMPLETED` (run.sh:12773) and already gates on
290
+ `council_checklist_gate` (run.sh:12766). Add the evidence gate alongside it so
291
+ the two approval paths are symmetric:
292
+
293
+ ```
294
+ if type council_checklist_gate &>/dev/null && ! council_checklist_gate; then
295
+ log_info "Council force-review: blocked by checklist hard gate"
296
+ elif type council_evidence_gate &>/dev/null && ! council_evidence_gate; then
297
+ log_info "Council force-review: blocked by evidence hard gate"
298
+ elif type council_vote &>/dev/null && council_vote; then
299
+ ... existing approval (writes COMPLETED) ...
300
+ ```
301
+
302
+ Without insertion B, a user could click "force review" on the dashboard to
303
+ bypass the evidence gate entirely.
304
+
305
+ ---
306
+
307
+ ## 5. What it blocks / what it must NOT falsely block (truth table)
308
+
309
+ | Scenario | Diff (start..HEAD) | test-results.json | Gate | Rationale |
310
+ |---|---|---|---|---|
311
+ | Legit completion: real changes + green tests | nonzero | runner=X, pass=true | PASS | the happy path |
312
+ | Greenfield first run: only untracked new files (no commit/stage yet) | nonzero (untracked) | any non-red | PASS | new files are real work; counted via `git ls-files --others --exclude-standard` |
313
+ | Fabricated "done", nothing built (not even untracked) | empty | any | **BLOCK** | nothing shipped anywhere |
314
+ | Real changes but tests red | nonzero | runner=X, pass=false | **BLOCK** | a runner ran and failed |
315
+ | Docs-only change, no test suite | nonzero (docs files) | runner=none, pass=true | PASS | nonzero diff; no suite to fail |
316
+ | Project with no test suite, real code | nonzero | runner=none, pass=true | PASS | code shipped; tests not expected |
317
+ | No git repo | inconclusive | any non-red | PASS | cannot prove fabrication |
318
+ | Empty/missing run-start SHA (new repo, zero commits) | inconclusive | any non-red | PASS | never had a baseline |
319
+ | test-results.json missing/unparseable | nonzero | inconclusive | PASS | mirror "no file = no gate" |
320
+ | `LOKI_EVIDENCE_GATE=0` | n/a | n/a | PASS (no read/write) | exactly today's behavior |
321
+
322
+ The only two BLOCK rows are positive fabrication evidence: empty diff, or a
323
+ runner that actually ran and was red. Everything inconclusive passes through.
324
+
325
+ ---
326
+
327
+ ## 6. evidence-block.json schema (mirror gate-block.json)
328
+
329
+ Written to `$COUNCIL_STATE_DIR/evidence-block.json` (`<loki_dir>/council/`),
330
+ atomic temp+mv, so the dashboard and handoff can surface *why* a run did not
331
+ complete:
332
+
333
+ ```json
334
+ {
335
+ "status": "blocked",
336
+ "blocked": true,
337
+ "blocked_at": "2026-06-07T00:00:00Z",
338
+ "iteration": 12,
339
+ "reason": "no_evidence_of_completion",
340
+ "checks": {
341
+ "diff": {"ok": false, "base_sha": "abc123", "files_changed": 0, "sources": "committed|unstaged|staged union empty"},
342
+ "tests": {"ok": true, "runner": "pytest", "pass": true}
343
+ },
344
+ "failures": ["empty git diff vs run-start SHA (nothing shipped)"]
345
+ }
346
+ ```
347
+
348
+ `reason` is one of `empty_diff`, `tests_red`, or `empty_diff_and_tests_red`.
349
+ `failures` is a short human-readable list (cap 5, like gate-block). On gate pass
350
+ the file is removed.
351
+
352
+ ---
353
+
354
+ ## 7. Dual-route (Bun/TS) parity
355
+
356
+ **No Bun change needed.** The live runtime is the bash council (run.sh sources
357
+ completion-council.sh and calls council_should_stop). `loki-ts` council.ts is an
358
+ unwired port slice. State this in the CHANGELOG NOT-tested section so a future
359
+ TS port author knows to mirror `council_evidence_gate` when the TS council is
360
+ made live. Adding a TS stub now would be dead code with no runtime to exercise
361
+ it.
362
+
363
+ ---
364
+
365
+ ## 8. Tests
366
+
367
+ New shell test `tests/test-evidence-gate.sh`, following the source-extraction
368
+ pattern of `tests/test-pytest-gate-timeout.sh` (awk-extract the function from
369
+ completion-council.sh into a minimal harness with stubbed `log_*`,
370
+ `COUNCIL_STATE_DIR`, and `ITERATION_COUNT`) OR an end-to-end fixture that sets up
371
+ a throwaway git repo + `.loki/quality/test-results.json` and calls the sourced
372
+ function directly. Cases:
373
+
374
+ 1. **Empty diff -> blocked.** Repo with run-start SHA == HEAD (no commits since),
375
+ test-results green => `council_evidence_gate` returns 1; `evidence-block.json`
376
+ written with `reason: empty_diff`.
377
+ 2. **Real diff + green tests -> allowed.** Commit a change after start-sha,
378
+ `runner=pytest,pass=true` => returns 0; no `evidence-block.json` (and a
379
+ pre-existing one is removed).
380
+ 3. **Real diff + red tests -> blocked.** `runner=pytest,pass=false` => returns 1;
381
+ `reason: tests_red`.
382
+ 4. **No-test project -> not falsely blocked.** Real diff,
383
+ `runner=none,pass=true` => returns 0.
384
+ 5. **No git repo -> not falsely blocked.** Run in a non-git dir => returns 0.
385
+ 6. **Knob off -> behaves as before.** `LOKI_EVIDENCE_GATE=0` with an empty diff =>
386
+ returns 0 and writes NO file.
387
+ 7. **Stale block cleanup.** Pre-create `evidence-block.json`, then call with
388
+ passing evidence => file removed, returns 0.
389
+ 8. **Repeat-run baseline recapture.** Simulate a completed prior run (stale
390
+ `.loki/state/start-sha` pointing at an old SHA), invoke `run_autonomous`
391
+ with `ITERATION_COUNT == 0`, commit nothing new => start-sha is recaptured at
392
+ current HEAD and the empty-diff path blocks (proves the gate is not toothless
393
+ on run 2+). With `ITERATION_COUNT > 0` (resume), the baseline is preserved.
394
+
395
+ Register in `tests/run-all-tests.sh`. Skip gracefully (exit 0 with SKIP) when
396
+ `git` or `python3` is unavailable, matching existing test conventions.
397
+
398
+ ---
399
+
400
+ ## 9. CHANGELOG entry (honest) + NOT-tested
401
+
402
+ Under `## [7.19.1] - <date>`:
403
+
404
+ ```
405
+ ### Added
406
+ - Verified completion / evidence hard gate (default-on, opt out with
407
+ `LOKI_EVIDENCE_GATE=0`). The completion council now refuses to approve STOP
408
+ unless there is real on-disk evidence that the run actually shipped: a nonzero
409
+ git diff between the run-start SHA (newly captured to `.loki/state/start-sha`)
410
+ and HEAD, AND a green test signal from `.loki/quality/test-results.json` where
411
+ a test suite exists. Cloned from the existing `council_checklist_gate` pattern
412
+ and slotted into `council_evaluate` right after the checklist gate, plus the
413
+ dashboard force-review approval path for parity. On block it writes
414
+ `.loki/council/evidence-block.json` (mirroring gate-block.json) so the
415
+ dashboard/handoff can surface why. Attacks the "agent claims done when nothing
416
+ shipped" trust-killer. Blocks only on positive fabrication evidence (empty
417
+ diff, or a runner that actually ran and was red); every inconclusive case (no
418
+ git repo, no baseline, missing/unparseable test results, no test suite,
419
+ docs-only changes) passes through so a legitimate completion is never falsely
420
+ stopped.
421
+
422
+ ### Honest limits / NOT tested
423
+ - This gate proves "something changed and tests are green," NOT PRD-semantic
424
+ correctness. Semantic judgment remains with the council votes + Devil's
425
+ Advocate.
426
+ - The force-stop safety valves in `council_should_stop` (stagnation, repeated
427
+ done-signals) are deliberately NOT gated: they are resource-protection exits
428
+ that do NOT write the `COMPLETED` marker, so they cannot launder a fabricated
429
+ "done" as an approved completion.
430
+ - Bun/TS council (`loki-ts`) is an unwired port slice; the live runtime is bash.
431
+ No TS change shipped. A future TS port must mirror `council_evidence_gate`.
432
+ - Not exercised on Windows; relies on POSIX `git`/`python3`.
433
+ - Not exercised against shallow clones beyond the `git diff HEAD` fallback path.
434
+ ```
435
+
436
+ ---
437
+
438
+ ## 10. Risks
439
+
440
+ | Risk | Likelihood | Impact | Mitigation |
441
+ |---|---|---|---|
442
+ | False block stops a legit run | Med | High | Block only on positive fabrication evidence; all inconclusive => pass; opt-out `LOKI_EVIDENCE_GATE=0`; evidence-block.json explains why so user can act fast |
443
+ | Run-start SHA never captured (no git / zero-commit repo) | Med | Med | Empty/missing baseline => inconclusive => pass-through (never block) |
444
+ | Run-start SHA reset on resume/pause | Med | High (would zero the diff window) | Capture once; only set `.loki/state/start-sha` if not already present |
445
+ | test-results.json stale (PHASE_UNIT_TESTS off, or gate skipped) | Med | Med | Stale/missing => inconclusive => pass; gate runs before council in the same iteration so normally fresh; note freshness dependency |
446
+ | Cannot detect "tests were expected" reliably | Med | Med | Use `runner` field: `none` = no suite (pass), non-`none` = a runner ran (its `pass` bool is authoritative). Do NOT infer test expectation from checklist results.json |
447
+ | Force-review path bypasses gate | High (without fix) | High | Insertion point B adds evidence gate to run.sh:12762-12784 alongside checklist gate |
448
+ | Interaction with checklist gate | Low | Low | Sequenced strictly after it; independent `return 1` semantics; each writes its own block file |
449
+ | Interaction with Devil's Advocate | Low | Low | Gate runs pre-vote; DA only runs on unanimous COMPLETE, which the gate can prevent from being reached. DA's own skeptical test/diff checks remain as a second layer |
450
+ | Diff against unreachable base (shallow) | Low | Low | Fall back to `git diff --numstat HEAD` (mirrors proof-generator.py); if that also fails => inconclusive => pass |
451
+ | Auto-commit makes `git diff HEAD` empty | High | High (would false-block) | Diff = UNION of `base..HEAD` + unstaged + staged; block only when all empty. Run-start baseline (not HEAD working tree) is what makes committed changes count post-auto-commit |
452
+ | Stale baseline on repeat runs (gate no-ops on run 2+) | High (without fix) | High (defeats feature) | Recapture start-sha when `ITERATION_COUNT == 0` (fresh run); keep only on genuine resume (`ITERATION_COUNT > 0`) |
453
+
454
+ ---
455
+
456
+ ## Critical files for implementation
457
+
458
+ - /Users/lokesh/git/loki-mode/autonomy/completion-council.sh (clone council_checklist_gate -> council_evidence_gate; insertion A in council_evaluate)
459
+ - /Users/lokesh/git/loki-mode/autonomy/run.sh (run-start SHA capture in run_autonomous; insertion B in force-review path)
460
+ - /Users/lokesh/git/loki-mode/tests/test-evidence-gate.sh (new test, pattern from test-pytest-gate-timeout.sh)
461
+ - /Users/lokesh/git/loki-mode/CHANGELOG.md (honest entry + NOT-tested)
462
+ - /Users/lokesh/git/loki-mode/tests/run-all-tests.sh (register new test)
@@ -1,5 +1,5 @@
1
1
  // @bun
2
- var f8=Object.defineProperty;var u8=($)=>$;function c8($,Q){this[$]=u8.bind(null,Q)}var g=($,Q)=>{for(var K in Q)f8($,K,{get:Q[K],enumerable:!0,configurable:!0,set:c8.bind(Q,K)})};var k=($,Q)=>()=>($&&(Q=$($=0)),Q);var X1=import.meta.require;var F$={};g(F$,{lokiDir:()=>P,homeLokiDir:()=>o1,findRepoRootForVersion:()=>d1,REPO_ROOT:()=>f});import{resolve as n,dirname as l1}from"path";import{fileURLToPath as p8}from"url";import{existsSync as L1}from"fs";import{homedir as l8}from"os";function d8(){let $=j$;for(let Q=0;Q<6;Q++){if(L1(n($,"VERSION"))&&L1(n($,"autonomy/run.sh")))return $;let K=l1($);if(K===$)break;$=K}return n(j$,"..","..","..")}function d1($){let Q=$;for(let K=0;K<6;K++){if(L1(n(Q,"VERSION"))&&L1(n(Q,"autonomy/run.sh")))return Q;let Z=l1(Q);if(Z===Q)break;Q=Z}return n($,"..","..","..")}function P(){return process.env.LOKI_DIR??n(process.cwd(),".loki")}function o1(){return n(l8(),".loki")}var j$,f;var y=k(()=>{j$=l1(p8(import.meta.url));f=d8()});import{readFileSync as o8}from"fs";import{resolve as n8,dirname as a8}from"path";import{fileURLToPath as s8}from"url";function k1(){if($1!==null)return $1;let $="7.19.0";if(typeof $==="string"&&$.length>0)return $1=$,$1;try{let Q=a8(s8(import.meta.url)),K=d1(Q);$1=o8(n8(K,"VERSION"),"utf-8").trim()}catch{$1="unknown"}return $1}var $1=null;var n1=k(()=>{y()});var E$={};g(E$,{runOrThrow:()=>t8,run:()=>j,commandVersion:()=>i8,commandExists:()=>v,ShellError:()=>a1});async function j($,Q={}){let K=Bun.spawn({cmd:[...$],stdout:"pipe",stderr:"pipe",env:Q.env?{...process.env,...Q.env}:process.env,cwd:Q.cwd}),Z,z;if(Q.timeoutMs&&Q.timeoutMs>0)Z=setTimeout(()=>{try{K.kill("SIGTERM")}catch{}z=setTimeout(()=>{try{K.kill("SIGKILL")}catch{}},2000)},Q.timeoutMs);try{let[H,X,q]=await Promise.all([new Response(K.stdout).text(),new Response(K.stderr).text(),K.exited]);return{stdout:H,stderr:X,exitCode:q}}finally{if(Z)clearTimeout(Z);if(z)clearTimeout(z)}}async function t8($,Q={}){let K=await j($,Q);if(K.exitCode!==0)throw new a1(`command failed (${K.exitCode}): ${$.join(" ")}`,K.exitCode,K.stdout,K.stderr);return K}async function v($){let Q=r8($),K=await j(["sh","-c",`command -v ${Q}`],{timeoutMs:5000});if(K.exitCode===0)return K.stdout.trim()||null;return null}function r8($){if(!/^[A-Za-z0-9._/-]+$/.test($))throw Error(`refused to shell-escape suspect token: ${$}`);return $}async function i8($,Q="--version"){if(!await v($))return null;let Z=await j([$,Q],{timeoutMs:5000});if(Z.exitCode!==0)return null;return((Z.stdout||Z.stderr).split(/\r?\n/)[0]?.trim()??"")||null}var a1;var d=k(()=>{a1=class a1 extends Error{message;exitCode;stdout;stderr;constructor($,Q,K,Z){super($);this.message=$;this.exitCode=Q;this.stdout=K;this.stderr=Z;this.name="ShellError"}}});function a($){return e8?"":$}var e8,T,N,w,ZK,_,R,h,J;var c=k(()=>{e8=(process.env.NO_COLOR??"").length>0;T=a("\x1B[0;31m"),N=a("\x1B[0;32m"),w=a("\x1B[1;33m"),ZK=a("\x1B[0;34m"),_=a("\x1B[0;36m"),R=a("\x1B[1m"),h=a("\x1B[2m"),J=a("\x1B[0m")});import{existsSync as U7}from"fs";async function Q1(){if(B1!==void 0)return B1;let $="/opt/homebrew/bin/python3.12";if(U7($))return B1=$,$;let Q=await v("python3.12");if(Q)return B1=Q,Q;let K=await v("python3");return B1=K,K}async function K1($,Q={}){let K=await Q1();if(!K)return{stdout:"",stderr:"python3 not found",exitCode:127};return j([K,"-c",$],Q)}var B1;var H1=k(()=>{d()});var d$={};g(d$,{runStatus:()=>N7});import{existsSync as b,readFileSync as q1,readdirSync as v$,statSync as f$}from"fs";import{resolve as D,basename as P7}from"path";import{homedir as L7}from"os";async function j7(){if(await v("jq"))return!0;return process.stdout.write(`${T}Error: jq is required but not installed.${J}
2
+ var f8=Object.defineProperty;var u8=($)=>$;function c8($,Q){this[$]=u8.bind(null,Q)}var g=($,Q)=>{for(var K in Q)f8($,K,{get:Q[K],enumerable:!0,configurable:!0,set:c8.bind(Q,K)})};var k=($,Q)=>()=>($&&(Q=$($=0)),Q);var X1=import.meta.require;var F$={};g(F$,{lokiDir:()=>P,homeLokiDir:()=>o1,findRepoRootForVersion:()=>d1,REPO_ROOT:()=>f});import{resolve as n,dirname as l1}from"path";import{fileURLToPath as p8}from"url";import{existsSync as L1}from"fs";import{homedir as l8}from"os";function d8(){let $=j$;for(let Q=0;Q<6;Q++){if(L1(n($,"VERSION"))&&L1(n($,"autonomy/run.sh")))return $;let K=l1($);if(K===$)break;$=K}return n(j$,"..","..","..")}function d1($){let Q=$;for(let K=0;K<6;K++){if(L1(n(Q,"VERSION"))&&L1(n(Q,"autonomy/run.sh")))return Q;let Z=l1(Q);if(Z===Q)break;Q=Z}return n($,"..","..","..")}function P(){return process.env.LOKI_DIR??n(process.cwd(),".loki")}function o1(){return n(l8(),".loki")}var j$,f;var y=k(()=>{j$=l1(p8(import.meta.url));f=d8()});import{readFileSync as o8}from"fs";import{resolve as n8,dirname as a8}from"path";import{fileURLToPath as s8}from"url";function k1(){if($1!==null)return $1;let $="7.19.2";if(typeof $==="string"&&$.length>0)return $1=$,$1;try{let Q=a8(s8(import.meta.url)),K=d1(Q);$1=o8(n8(K,"VERSION"),"utf-8").trim()}catch{$1="unknown"}return $1}var $1=null;var n1=k(()=>{y()});var E$={};g(E$,{runOrThrow:()=>t8,run:()=>j,commandVersion:()=>i8,commandExists:()=>v,ShellError:()=>a1});async function j($,Q={}){let K=Bun.spawn({cmd:[...$],stdout:"pipe",stderr:"pipe",env:Q.env?{...process.env,...Q.env}:process.env,cwd:Q.cwd}),Z,z;if(Q.timeoutMs&&Q.timeoutMs>0)Z=setTimeout(()=>{try{K.kill("SIGTERM")}catch{}z=setTimeout(()=>{try{K.kill("SIGKILL")}catch{}},2000)},Q.timeoutMs);try{let[H,X,q]=await Promise.all([new Response(K.stdout).text(),new Response(K.stderr).text(),K.exited]);return{stdout:H,stderr:X,exitCode:q}}finally{if(Z)clearTimeout(Z);if(z)clearTimeout(z)}}async function t8($,Q={}){let K=await j($,Q);if(K.exitCode!==0)throw new a1(`command failed (${K.exitCode}): ${$.join(" ")}`,K.exitCode,K.stdout,K.stderr);return K}async function v($){let Q=r8($),K=await j(["sh","-c",`command -v ${Q}`],{timeoutMs:5000});if(K.exitCode===0)return K.stdout.trim()||null;return null}function r8($){if(!/^[A-Za-z0-9._/-]+$/.test($))throw Error(`refused to shell-escape suspect token: ${$}`);return $}async function i8($,Q="--version"){if(!await v($))return null;let Z=await j([$,Q],{timeoutMs:5000});if(Z.exitCode!==0)return null;return((Z.stdout||Z.stderr).split(/\r?\n/)[0]?.trim()??"")||null}var a1;var d=k(()=>{a1=class a1 extends Error{message;exitCode;stdout;stderr;constructor($,Q,K,Z){super($);this.message=$;this.exitCode=Q;this.stdout=K;this.stderr=Z;this.name="ShellError"}}});function a($){return e8?"":$}var e8,T,N,w,ZK,_,R,h,J;var c=k(()=>{e8=(process.env.NO_COLOR??"").length>0;T=a("\x1B[0;31m"),N=a("\x1B[0;32m"),w=a("\x1B[1;33m"),ZK=a("\x1B[0;34m"),_=a("\x1B[0;36m"),R=a("\x1B[1m"),h=a("\x1B[2m"),J=a("\x1B[0m")});import{existsSync as U7}from"fs";async function Q1(){if(B1!==void 0)return B1;let $="/opt/homebrew/bin/python3.12";if(U7($))return B1=$,$;let Q=await v("python3.12");if(Q)return B1=Q,Q;let K=await v("python3");return B1=K,K}async function K1($,Q={}){let K=await Q1();if(!K)return{stdout:"",stderr:"python3 not found",exitCode:127};return j([K,"-c",$],Q)}var B1;var H1=k(()=>{d()});var d$={};g(d$,{runStatus:()=>N7});import{existsSync as b,readFileSync as q1,readdirSync as v$,statSync as f$}from"fs";import{resolve as D,basename as P7}from"path";import{homedir as L7}from"os";async function j7(){if(await v("jq"))return!0;return process.stdout.write(`${T}Error: jq is required but not installed.${J}
3
3
  `),process.stdout.write(`Install with:
4
4
  `),process.stdout.write(` brew install jq (macOS)
5
5
  `),process.stdout.write(` apt install jq (Debian/Ubuntu)
@@ -787,4 +787,4 @@ Set LOKI_LEGACY_BASH=1 to force the bash CLI for every command.
787
787
  `),2}default:return process.stderr.write(`Unknown command: ${Q}
788
788
  `),process.stderr.write(v8),2}}g$();process.on("SIGINT",()=>process.exit(130));process.on("SIGTERM",()=>process.exit(143));var p3=await c3(Bun.argv.slice(2));process.exit(p3);
789
789
 
790
- //# debugId=5322302C43EF9AB364756E2164756E21
790
+ //# debugId=A2F8B15FD75062F064756E2164756E21
package/mcp/__init__.py CHANGED
@@ -57,4 +57,4 @@ try:
57
57
  except ImportError:
58
58
  __all__ = ['mcp']
59
59
 
60
- __version__ = '7.19.0'
60
+ __version__ = '7.19.2'
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "loki-mode",
3
- "version": "7.19.0",
3
+ "version": "7.19.2",
4
4
  "description": "Loki Mode by Autonomi. Autonomous spec-to-product system: takes a PRD, GitHub issue, OpenAPI/JSON/YAML, or one-line brief to a deployed app via the RARV-C closure loop with 11 quality gates. Provider-agnostic (Claude Code, OpenAI Codex, Cline, Aider).",
5
5
  "keywords": [
6
6
  "agent",
@@ -110,6 +110,36 @@ LOKI_HANDOFF_MD=1 # write a structured handoff doc to
110
110
  Optional: `LOKI_AUTO_LEARNINGS_EPISODE=1` also writes the learning into
111
111
  the Python episodic memory layer via `memory.engine.save_episode`.
112
112
 
113
+ ## Verified-completion evidence gate (v7.19.1, default-on)
114
+
115
+ The completion council will not accept a "done" claim without evidence. Before
116
+ completion is honored (on BOTH the council path AND the default
117
+ completion-promise route), `council_evidence_gate` requires:
118
+
119
+ - a nonzero git diff vs the run-start SHA (something was actually shipped), AND
120
+ - green tests (`.loki/quality/test-results.json` shows the runner passed).
121
+
122
+ The diff is the union of committed, staged, unstaged, and untracked changes
123
+ (`--exclude-standard`, so gitignored artifacts do not count), with `.loki/`
124
+ runtime state excluded. Inconclusive cases (no git repo, no baseline, no
125
+ test-results file, `runner=none`) pass through and never false-block a
126
+ legitimate first run.
127
+
128
+ ```bash
129
+ LOKI_EVIDENCE_GATE=0 # opt out: completion is honored without the
130
+ # evidence check (byte-identical to pre-v7.19.1).
131
+ # Default is on (1).
132
+ ```
133
+
134
+ When the gate blocks, it prints the reason and this opt-out to the terminal,
135
+ writes `.loki/council/evidence-block.json`, and surfaces in the dashboard
136
+ (`/api/council/gate` -> `evidence`; the Quality Gates panel shows a banner). A
137
+ persistent block keeps iterating only up to `MAX_ITERATIONS`, then stops
138
+ cleanly; it cannot hang. Honest limit: this proves something-changed-and-tests-
139
+ pass, not PRD-semantic correctness (the council vote is the semantic check).
140
+ The common false-block is a project that was ALREADY red before the run; the
141
+ one-step opt-out is the escape hatch.
142
+
113
143
  **Override-judge knobs (v7.5.4+):**
114
144
 
115
145
  ```bash
@@ -172,6 +202,91 @@ crash via the primitive's `finally` cleanup.
172
202
 
173
203
  ---
174
204
 
205
+ ## Uncertainty-gated escalation (v7.19.2, default-on)
206
+
207
+ When Loki is likely stuck or thrashing, it escalates proactively to the human
208
+ via the existing PAUSE + notification + handoff machinery, rather than silently
209
+ burning iterations until max-iterations. No new metacognition: the system
210
+ reuses three proxy signals that already exist and escalates only when at least
211
+ two of the three co-occur for N consecutive rounds.
212
+
213
+ ### Trigger condition
214
+
215
+ Three proxy signals are evaluated each iteration:
216
+
217
+ - **Proxy 1 (no-change counter):** `consecutive_no_change` in council state.json
218
+ reaches `LOKI_UNCERTAINTY_NOCHANGE_MIN` (default: `COUNCIL_STAGNATION_LIMIT - 1`,
219
+ i.e. one below the circuit-breaker limit so escalation fires before the
220
+ breaker ends the run).
221
+ - **Proxy 2 (diff-hash oscillation):** the current iteration's combined diff
222
+ hash matches a hash seen 2+ rounds back in a bounded ring buffer (A -> B -> A
223
+ pattern). Detects oscillation/revert cycling; does not fire on the trivial
224
+ immediate-repeat case which proxy 1 already covers.
225
+ - **Proxy 3 (persistent council split):** the last `LOKI_UNCERTAINTY_SPLIT_ROUNDS`
226
+ consecutive council verdicts are all REJECTED-with-at-least-one-approver
227
+ (split verdict). Stale between council votes; fresh exactly when proxy 1 is
228
+ hot, because proxy 1 hot forces a circuit-breaker vote that refreshes verdicts.
229
+
230
+ Escalation fires when `hot_count >= 2` (at least two proxies hot simultaneously)
231
+ for `LOKI_UNCERTAINTY_ROUNDS` consecutive rounds AND the episode has not already
232
+ been escalated (one escalation per stuck-episode, with re-arm when co-occurrence
233
+ clears).
234
+
235
+ ### Action
236
+
237
+ When the trigger condition is met, the run.sh action block:
238
+
239
+ 1. Prints a loud terminal line with the opt-out env var.
240
+ 2. Calls `write_structured_handoff "uncertainty_escalation"` (saves
241
+ `.loki/memory/handoffs/<ts>.json` and `.md`).
242
+ 3. Calls `notify_intervention_needed` with a structured reason string.
243
+ 4. Writes a `.loki/signals/UNCERTAINTY_ESCALATION` marker file.
244
+ 5. Touches `.loki/PAUSE`.
245
+
246
+ ### Knobs
247
+
248
+ ```bash
249
+ LOKI_UNCERTAINTY_ESCALATION=0 # Disable entirely. Byte-identical when off:
250
+ # zero reads, zero writes, no state file.
251
+ # Default: 1 (enabled). Toggle value is 0/1,
252
+ # not false/true.
253
+ LOKI_UNCERTAINTY_ROUNDS=2 # Consecutive co-occurrence rounds required.
254
+ # Recommended range 2-3. Default: 2.
255
+ LOKI_UNCERTAINTY_NOCHANGE_MIN=N # Proxy 1 threshold. Unset = auto-computed as
256
+ # COUNCIL_STAGNATION_LIMIT - 1 (floored at 1).
257
+ LOKI_UNCERTAINTY_SPLIT_ROUNDS=2 # Proxy 3 trailing split-round run length.
258
+ # Default: 2.
259
+ ```
260
+
261
+ Configurable via `config.yaml` under `completion.uncertainty.*` (see
262
+ `autonomy/config.example.yaml`).
263
+
264
+ ### Honest limits
265
+
266
+ - **Perpetual-mode = notify-only by default.** `AUTONOMY_MODE` defaults to
267
+ `perpetual`. In perpetual mode the existing consumer (`check_human_intervention`)
268
+ auto-clears PAUSE and continues. Escalation therefore degrades to a notification
269
+ plus a handoff document; it does NOT halt the run. The terminal prints an explicit
270
+ warning at the escalation site: "Perpetual mode: PAUSE will be auto-cleared; this
271
+ is notify-only and will NOT halt the run."
272
+ - **Proxy 2 is count-blind by origin.** It approximates oscillation with
273
+ diff-hash recurrence-at-distance; it cannot distinguish a genuine revert from
274
+ a coincidental identical tree state, and misses oscillation where the hash
275
+ differs every round.
276
+ - **Proxy 3 is stale between council votes.** Verdicts are only appended when the
277
+ council actually votes (every `COUNCIL_CHECK_INTERVAL` or circuit-forced). In
278
+ practice p3 is always fresh in the regime that matters (proxy 1 hot forces a
279
+ vote), but it may lag by up to `COUNCIL_CHECK_INTERVAL` iterations otherwise.
280
+ - **These are heuristics, not true metacognition.** The system does not know it
281
+ is stuck; it infers stuckness from three correlated symptoms. A legitimately
282
+ hard refactor that produces no net diff for several rounds while the council
283
+ remains split can false-fire. Requiring >=2 co-occurring for N rounds reduces
284
+ but does not eliminate false fires. The cost of a false fire is bounded: one
285
+ notification + one handoff + one PAUSE (auto-cleared in perpetual), opt-out
286
+ at the site.
287
+
288
+ ---
289
+
175
290
  ## Guardrails Execution Modes
176
291
 
177
292
  - **Blocking**: Guardrail completes before agent starts (use for expensive operations)