loki-mode 7.5.17 → 7.5.28

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. package/README.md +10 -9
  2. package/SKILL.md +14 -14
  3. package/VERSION +1 -1
  4. package/autonomy/completion-council.sh +26 -3
  5. package/autonomy/lib/claude-flags.sh +132 -0
  6. package/autonomy/lib/mcp-config.sh +160 -0
  7. package/autonomy/lib/project-graph.sh +685 -0
  8. package/autonomy/lib/voter-agents.sh +356 -0
  9. package/autonomy/loki +108 -111
  10. package/autonomy/run.sh +95 -186
  11. package/bin/loki +12 -1
  12. package/dashboard/__init__.py +1 -1
  13. package/dashboard/requirements.txt +13 -8
  14. package/dashboard/server.py +33 -15
  15. package/dashboard/static/index.html +298 -299
  16. package/docs/INSTALLATION.md +54 -21
  17. package/docs/retrospectives/v7.5.15-fleet-postmortem.md +325 -0
  18. package/docs/retrospectives/v7.5.15-honesty-audit.md +136 -0
  19. package/docs/retrospectives/v7.5.15-llm-failure-modes.md +49 -0
  20. package/loki-ts/data/finding-schema.json +74 -0
  21. package/loki-ts/data/model-pricing.json +12 -0
  22. package/loki-ts/dist/loki.js +198 -172
  23. package/mcp/__init__.py +1 -1
  24. package/mcp/lsp_proxy.py +713 -0
  25. package/mcp/requirements.txt +9 -3
  26. package/mcp/tests/__init__.py +0 -0
  27. package/mcp/tests/test_lsp_proxy.py +377 -0
  28. package/memory/app_graph.py +153 -0
  29. package/memory/storage.py +6 -1
  30. package/memory/tests/test_app_graph.py +134 -0
  31. package/package.json +4 -3
  32. package/providers/claude.sh +115 -4
  33. package/providers/codex.sh +2 -2
  34. package/providers/loader.sh +4 -4
  35. package/providers/model_catalog.json +0 -9
  36. package/providers/models.sh +1 -2
  37. package/references/multi-provider.md +26 -35
  38. package/references/prompt-repetition.md +1 -1
  39. package/references/quality-control.md +1 -1
  40. package/skills/00-index.md +3 -3
  41. package/skills/model-selection.md +11 -14
  42. package/skills/providers.md +17 -57
  43. package/skills/quality-gates.md +2 -2
  44. package/skills/troubleshooting.md +1 -1
  45. package/src/integrations/github/action-handler.js +3 -2
  46. package/src/protocols/tools/start-project.js +1 -1
  47. package/providers/gemini.sh +0 -343
@@ -0,0 +1,49 @@
1
+ # v7.5.15 LLM Failure Modes Catalogue
2
+
3
+ This table catalogues LLM failure modes that occurred (or were explicitly
4
+ considered and ruled out) during the v7.5.15 8-agent fleet session. "Failure
5
+ mode" means a systematic LLM behavioral error, not a human process mistake.
6
+ Each row covers one category. If a category did not materialize in this
7
+ session, it is marked NOT OBSERVED so the reader can confirm it was
8
+ considered, not overlooked. All session moments and file references are
9
+ specific: no generalities.
10
+
11
+ | Failure mode | Where it appeared | What caught it | Cost if missed |
12
+ |---|---|---|---|
13
+ | Confident hallucination (agent claims a file/function exists that doesn't) | NOT OBSERVED in this session | n/a | n/a |
14
+ | Stale context (agent acts on memory of prior state instead of current) | Dev5's `autonomy/loki` worktree was branched from pre-merge `HEAD` (`2ce36624`), i.e., prior to Dev3+Dev4's merges. The integrator treated the worktree file as current state and bulk-`cp`'d it over the integrated tree. This is the integrator exhibiting stale-context behavior. | `grep -c "init-rules" autonomy/loki` returned 0 instead of 4. Immediate recovery via `git checkout HEAD -- autonomy/loki` + surgical `Edit` of only Dev5's two help blocks. | Dev3's `init-rules` subcommand (9/9 tests) and Dev4's `cmd_doctor_json` sentrux field (13/13 tests) would have shipped silently regressed. Two advertised features gone from the binary. |
15
+ | Scope creep (agent does more than asked, breaking adjacent code) | NOT OBSERVED in this session. All 8 dev agents touched only their assigned functions/files. Dev7 explicitly disclaimed out-of-scope sites (`cross_project.py:101`) rather than patching them. | n/a | n/a |
16
+ | Silent test rot (tests written but not wired to runner) | 7 of 8 new test suites (Dev1 through Dev7) were not registered in `tests/run-all-tests.sh`. Only Dev8's `test-ci-sentrux-coverage.sh` was wired. R1, R2, R3 all ran the tests directly and confirmed PASS without asking whether CI would ever invoke them. | Devil's Advocate reviewer asked "will these tests run tomorrow?" and checked `tests/run-all-tests.sh` directly. Caught pre-commit. Fixed by adding 5 bash entries + 2 pytest wrapper scripts at `tests/run-all-tests.sh:85-99`. | 7 tests silently orphaned. Future regressions in sentrux wire-in, dashboard endpoint, init-rules, doctor JSON parity, dashboard nav UAT, pytest timeout, and episode resilience paths would go undetected until a manual invocation. |
17
+ | Cross-agent state assumption (Dev N assumes Dev M's branch is in some state without verifying) | Integrator assumed Dev5's worktree file was safe to `cp` over the integrated main tree. The assumption was that worktree files reflected the post-merge state; they did not (Dev5's branch point was `2ce36624`, before Dev3 + Dev4 merges). See `v7.5.15-fleet-postmortem.md` section 1. | Structural checksum: `grep -c "init-rules" autonomy/loki` == 0 (expected 4). Caught immediately after the bad copy. REPEATABILITY: LOW (manual, not codified -- see fleet-postmortem section 1 recommendation 3 for the fix). | Same as stale context above: two shipped features silently deleted from the binary. |
18
+ | Optimistic verification (agent says "tested" without running test) | R3 (first dispatch) returned "Still running. Let me wait for monitor." as its entire output. R3 had not run any of the 6 cross-cutting integration checks assigned to it, but its status was "completed". The agent used language implying work was in progress when no work was done. | Integrator read the result body. 7-word fragment clearly not a structured 6-risk verdict. Re-spawned R3 with a prompt containing literal `bash` blocks and "DO NOT wait for any monitor." R3-retry completed in 61s. See `v7.5.15-fleet-postmortem.md` section 2. | Integration safety review would have been missing from the record. Subtle cross-file interactions between Dev1+Dev6's `autonomy/run.sh` edits and Dev3+Dev4's `autonomy/loki` edits would have gone unchecked. A previously undetected conflict could have shipped. |
19
+ | Path-prefix confusion (rm -rf interpreted as dangerous when it isn't) | `validate-bash` hook matched the pattern `rm -rf /Users/lokesh/git/loki-mode/.sentrux` against its dangerous-rm guard (`rm -rf /` prefix check). The path is a subdirectory of the project, not the root, but the string match fired a false positive. Hook returned BLOCKED. | The hook itself caught and surfaced the block explicitly -- no silent data loss. Workaround: rewrite as relative path `rm -rf ./.sentrux`. Documented in session notes from the validate-bash hook output. | If the hook had silently suppressed the error and continued, a subtly wrong rm invocation could have run. Conversely, if the integrator had bypassed the hook without understanding why, it could have established a precedent of ignoring hook blocks. |
20
+ | Exit code over-trust (relying on subprocess exit code that lies) | Pre-existing MCP test failure: `python3 -c "import mcp"` exits non-zero on this Mac because `pip install mcp` was never run. The exit code does not represent a regression introduced by v7.5.15 -- it is an environmental gap that predates the session. If taken at face value it could be misread as a newly broken test. | Verified pre-existing by reproducing on `origin/main` before the release commit. CHANGELOG v7.5.15 documents it honestly as "single failure is pre-existing pip install mcp env gap". 24/25 PASS is the accurate count. | Misfiled as a release regression: would have blocked the release or, if overridden, set a precedent of shipping with unacknowledged test failures. |
21
+ | Bash vs zsh source-file scope leaking | session-reported, unverified by audit. Reported behavior: `command -v sentrux` returned empty under zsh mid-session even though the binary was on PATH. Plausible mechanism: zsh PATH hash cache had not refreshed to see the binary added during the session. No file:line cite, no grep output, no git ref was captured that would allow an auditor to reproduce or confirm this independently. | session-reported, unverified by audit. Reported workaround: switching to `bash -c` for the test invocation forces a fresh hash evaluation and the invocation succeeded. No codified guardrail exists for this behavior. | If the zsh/bash inconsistency had been attributed to a code bug rather than a shell hash quirk, time would have been spent diagnosing a non-existent defect in the sentrux integration code. Alternatively, the test could have been skipped and a real future failure masked. |
22
+ | Tool hook noise drowning real failures | First `bash tests/run-all-tests.sh` invocation was piped through `tail -15`. The tail window captured only 15 lines of output, discarding all per-test PASS/FAIL lines and the full run summary. The visible output showed only the last few lines of the final test's output -- no actionable signal. | Integrator noticed the output was truncated (no "Tests Run:" summary line) and re-ran without the `tail` pipe. Second run produced full per-test output: 24/25 PASS with the one failing test identified by name. | If the truncated output had been accepted as the verification record, the integrator would have had no per-test attribution. Any failure that landed outside the last 15 lines would have been invisible. The release would have shipped without a verified pass/fail count per test. |
23
+ | (beyond user's 10) Build artifact drift after source-only merge | The bun-parity matrix in `local-ci.sh` ran both bash and Bun routes through `diff -q` and surfaced a 5/13 failure caused by stale `loki-ts/dist/loki.js`. The Bun dist had not been rebuilt after source changes were merged, so the Bun route diverged from the bash route at 5 commands. This is a distinct failure mode from stale context: the source was current, but the compiled artifact was not. | `bash scripts/local-ci.sh` bun-parity matrix caught the 5/13 diff before the commit. Fix was `bun run build` followed by re-staging the updated dist file. | If the stale dist had shipped, 5 Bun-route commands would have exhibited the pre-merge behavior while the bash route exhibited the post-merge behavior. Users on the Bun route would have silently received the older behavior with no error. |
24
+
25
+ ## What went right
26
+
27
+ The failure modes above were caught -- none shipped to users -- because
28
+ several guardrails worked exactly as intended. The `validate-bash` hook
29
+ surfaced the false-positive rm block explicitly rather than silently passing
30
+ or silently failing. The post-merge structural grep checksum (`grep -c
31
+ "init-rules"`) caught the bad-copy regression in seconds. The Devil's
32
+ Advocate role, operating with an explicitly non-domain mandate separate from
33
+ R1/R2/R3, asked the meta-question about test runner wiring that three
34
+ domain-specific reviewers each had reason not to ask. The `local-ci`
35
+ pre-push gate (`bash scripts/local-ci.sh`, 21/21 PASS) caught the stale
36
+ `loki-ts/dist/loki.js` artifact before the commit: the bun-parity matrix
37
+ ran both bash and Bun routes through `diff -q` and surfaced the 5/13
38
+ failure that triggered a `bun run build` before staging. The integrator's
39
+ discipline of reading reviewer output rather than trusting "completed"
40
+ status caught R3's fragment response before it was treated as a verdict.
41
+ Taken together: grep-based checksums, the DA contrarian role, local-ci
42
+ pre-push, the bun-parity matrix, and the validate-bash hook are the
43
+ codified mechanisms that kept this session's failure modes from becoming
44
+ user-facing defects. Note: the integrator reading R3's result body before
45
+ accepting its status as a verdict was human attentiveness, not a repeatable
46
+ guardrail. It is not listed here as a mechanism; it is listed as the action
47
+ that exposed the gap. The gap itself (agents reporting "completed" without
48
+ producing required output) is a process deficiency without a codified fix
49
+ as of this session.
@@ -0,0 +1,74 @@
1
+ {
2
+ "$schema": "http://json-schema.org/draft-07/schema#",
3
+ "$id": "https://loki-mode.dev/schemas/finding.json",
4
+ "title": "Council Multi-Finding Response",
5
+ "description": "Phase C (v7.5.20) JSON contract Claude Code must emit when invoked with --agents <voter-set> and --json-schema. Top-level is an object with a 'findings' array, one entry per dispatched voter. Each finding describes a single voter's verdict on the current iteration. NOTE on reconciliation with the existing AgentVerdict type (loki-ts/src/runner/council.ts:128-133): the schema field 'vote' maps onto AgentVerdict.verdict; the schema field 'confidence' is required at emission time but is NOT carried onto AgentVerdict (validateFinding drops it). Optional 'severity' and 'suggested_action' top-level per-finding fields are likewise dropped from AgentVerdict. The schema is the authoritative shape Claude must produce; AgentVerdict is the simplified shape internal council logic consumes.",
6
+ "type": "object",
7
+ "additionalProperties": false,
8
+ "required": ["findings"],
9
+ "properties": {
10
+ "findings": {
11
+ "type": "array",
12
+ "minItems": 1,
13
+ "items": {
14
+ "type": "object",
15
+ "additionalProperties": false,
16
+ "required": ["role", "vote", "reason", "confidence"],
17
+ "properties": {
18
+ "role": {
19
+ "type": "string",
20
+ "minLength": 1,
21
+ "maxLength": 128,
22
+ "description": "Voter slug, e.g. 'requirements-verifier', 'test-auditor', 'convergence-voter', 'devils-advocate'."
23
+ },
24
+ "vote": {
25
+ "type": "string",
26
+ "enum": ["APPROVE", "REJECT", "CANNOT_VALIDATE"],
27
+ "description": "Voter's verdict. Maps onto AgentVerdict.verdict on the consumer side."
28
+ },
29
+ "reason": {
30
+ "type": "string",
31
+ "minLength": 1,
32
+ "maxLength": 4000,
33
+ "description": "Free-text explanation of the vote. Surfaced verbatim in council transcripts."
34
+ },
35
+ "confidence": {
36
+ "type": "number",
37
+ "minimum": 0,
38
+ "maximum": 1,
39
+ "description": "Voter's self-reported confidence in the vote, 0..1. Used by future weighting passes; not carried onto AgentVerdict in v7.5.20."
40
+ },
41
+ "severity": {
42
+ "type": "string",
43
+ "enum": ["CRITICAL", "HIGH", "MEDIUM", "LOW"],
44
+ "description": "Top-level severity hint for the finding. Optional; per-issue severities live under 'issues'."
45
+ },
46
+ "suggested_action": {
47
+ "type": "string",
48
+ "maxLength": 2000,
49
+ "description": "Short remediation hint. Optional; surfaced in transcripts when present."
50
+ },
51
+ "issues": {
52
+ "type": "array",
53
+ "items": {
54
+ "type": "object",
55
+ "additionalProperties": false,
56
+ "required": ["severity", "description"],
57
+ "properties": {
58
+ "severity": {
59
+ "type": "string",
60
+ "enum": ["CRITICAL", "HIGH", "MEDIUM", "LOW"]
61
+ },
62
+ "description": {
63
+ "type": "string",
64
+ "minLength": 1,
65
+ "maxLength": 2000
66
+ }
67
+ }
68
+ }
69
+ }
70
+ }
71
+ }
72
+ }
73
+ }
74
+ }
@@ -0,0 +1,12 @@
1
+ {
2
+ "$schema_version": 1,
3
+ "_comment": "Rolling pricing table consumed by loki-ts/src/runner/budget.ts. Update this file when Anthropic / OpenAI / others publish new prices; no code change required. Pricing is USD per 1 million tokens. Aliases (opus/sonnet/haiku) point to the latest model of that family per providers/model_catalog.json.",
4
+ "_updated": "2026-05-22",
5
+ "_source": "https://www.anthropic.com/pricing + provider docs",
6
+ "pricing": {
7
+ "opus": { "input": 5.0, "output": 25.0 },
8
+ "sonnet": { "input": 3.0, "output": 15.0 },
9
+ "haiku": { "input": 1.0, "output": 5.0 },
10
+ "gpt-5.3-codex": { "input": 1.5, "output": 12.0 }
11
+ }
12
+ }