loki-mode 7.26.0 → 7.28.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +15 -13
- package/SKILL.md +11 -2
- package/VERSION +1 -1
- package/autonomy/completion-council.sh +310 -6
- package/autonomy/context-tracker.py +32 -7
- package/autonomy/grill.sh +321 -0
- package/autonomy/lib/trust_metrics.py +636 -0
- package/autonomy/loki +142 -0
- package/autonomy/prd-checklist.sh +248 -14
- package/autonomy/run.sh +283 -32
- package/autonomy/spec.sh +646 -0
- package/autonomy/verify.sh +1130 -0
- package/dashboard/__init__.py +1 -1
- package/dashboard/static/index.html +1 -1
- package/docs/COMPARISON.md +9 -9
- package/docs/COMPETITIVE-ANALYSIS.md +18 -37
- package/docs/INSTALLATION.md +1 -1
- package/docs/auto-claude-comparison.md +9 -6
- package/docs/certification/01-core-concepts/lesson.md +3 -3
- package/docs/competitive/emergence-others-analysis.md +1 -1
- package/docs/competitive/replit-lovable-analysis.md +1 -1
- package/docs/cursor-comparison.md +1 -1
- package/docs/prd-purple-lab-platform.md +1 -1
- package/docs/show-hn-post.md +2 -2
- package/loki-ts/dist/loki.js +2 -2
- package/mcp/__init__.py +1 -1
- package/package.json +2 -1
- package/providers/codex.sh +3 -2
- package/references/agent-types.md +9 -9
- package/references/agents.md +8 -8
- package/references/business-ops.md +1 -1
- package/references/competitive-analysis.md +1 -1
- package/skills/agents.md +3 -3
- package/skills/providers.md +3 -3
- package/skills/quality-gates.md +46 -0
package/README.md
CHANGED
|
@@ -18,7 +18,7 @@
|
|
|
18
18
|
|
|
19
19
|
---
|
|
20
20
|
|
|
21
|
-
> **How it works:** Drop a spec -- a PRD, GitHub issue, OpenAPI/JSON/YAML, or one-line brief. Loki Mode classifies complexity (`run.sh:detect_complexity()`), assembles an agent team from 41 specialized
|
|
21
|
+
> **How it works:** Drop a spec -- a PRD, GitHub issue, OpenAPI/JSON/YAML, or one-line brief. Loki Mode classifies complexity (`run.sh:detect_complexity()`), assembles an agent team from 41 specialized agent roles across 8 domains - prompt-defined specifications the orchestrator adopts per phase, with parallel review (blind council) and optional worktree streams on Claude Code, sequential on other providers - and runs autonomous RARV cycles (Reason - Act - Reflect - Verify, see `run.sh:run_autonomous()`) with 11 quality gates (see `skills/quality-gates.md`). Code is not "done" until it passes automated verification. Output is a Git repo with source, tests, configs, and audit logs.
|
|
22
22
|
|
|
23
23
|
---
|
|
24
24
|
|
|
@@ -26,6 +26,8 @@
|
|
|
26
26
|
|
|
27
27
|
- **Spec-driven, autonomous, with a built-in trust layer** -- Hand Loki a spec, walk away, come back to working code with tests. The full RARV-C closure loop (Reason - Act - Reflect - Verify - Close) runs until the work is actually done, not just attempted. The verified-completion evidence gate (`skills/quality-gates.md`) refuses any "done" claim on an empty git diff against the run-start commit, and blocks completion when tests run red, so "complete" means proven, not promised.
|
|
28
28
|
- **Production quality built in** -- 11 quality gates (`skills/quality-gates.md`), blind 3-reviewer code review (`run.sh:run_code_review()`), anti-sycophancy checks
|
|
29
|
+
- **Standalone verification: `loki verify`** -- Run Loki's deterministic gates (build, tests, static analysis, secret scan, dependency audit) against any branch or PR diff, including code written by other agents or humans. CI-ready exit codes (0 VERIFIED, 1 CONCERNS, 2 BLOCKED), machine-readable evidence at `.loki/verify/evidence.json`. Inconclusive evidence is never reported as VERIFIED (v7.27.0).
|
|
30
|
+
- **Living spec and pre-build interrogation** -- `loki spec` locks a spec and detects drift deterministically (`spec.lock`, `drift-report.json`, and a `SPEC_DRIFT` finding in `loki verify` with CI exit codes), so you can tell when the build diverges from what was agreed. `loki grill` runs a Devil's-Advocate interrogation of the spec before you build, surfacing gaps and contradictions early (v7.28.0).
|
|
29
31
|
- **Live App Preview** -- The dashboard embeds the locally-running app in an iframe so you can interact with it immediately during a build. Use `loki preview` (alias `loki open`) to print the URL and open it in your browser. Local-first: no hosted service, no vendor lock (v7.24.0).
|
|
30
32
|
- **Compose-first fullstack** -- When a spec needs more than one service (web + database + cache) Loki generates a 12-factor `docker-compose.yml` with healthchecks, `depends_on` wiring, env-var config, and a `.env.example`. The Live App Preview surfaces the web service URL (not a database port), and health reflects the web service's Docker healthcheck so a crashed app shows as crashed even when the database stays up. Single-service apps stay on a plain run command. All local-first, no hosted service (v7.26.0).
|
|
31
33
|
- **Intelligent `loki start`** -- For interactive foreground runs the dashboard auto-opens in the browser (cross-platform; skipped in CI, SSH-without-TTY, and piped runs; opt out with `LOKI_NO_AUTO_OPEN=1`). The completion summary shows "Your app is live at <url>" so you know exactly where to try what Loki just built. The autonomous loop passes Claude Code's `--effort`, `--max-budget-usd`, and `--fallback-model` on every iteration (each gated on CLI support and individual opt-out env vars) for better long-run unattended execution (v7.25.0).
|
|
@@ -47,7 +49,7 @@ Loki drives a coding agent CLI and orchestrates real builds, so it needs a few t
|
|
|
47
49
|
|
|
48
50
|
Required:
|
|
49
51
|
|
|
50
|
-
- An agent provider CLI
|
|
52
|
+
- An agent provider CLI: [Claude Code](https://docs.claude.com/en/docs/claude-code) (`claude`, Tier 1, recommended and E2E-verified - the provider Loki Mode is built for). Codex, Cline, and Aider are supported as experimental providers (wiring in place; not yet E2E-verified by us).
|
|
51
53
|
- Python 3.10+ (`python3`) for the dashboard, memory system, and orchestration helpers.
|
|
52
54
|
- Git 2.x (`git`) for checkpoints and worktrees.
|
|
53
55
|
- `curl` for installation and network calls.
|
|
@@ -87,9 +89,9 @@ loki quick "build a landing page with a signup form"
|
|
|
87
89
|
|
|
88
90
|
| Method | Command | Notes |
|
|
89
91
|
|--------|---------|-------|
|
|
90
|
-
| **Bun (recommended)** | `bun install -g loki-mode` | Fastest
|
|
92
|
+
| **Bun (recommended)** | `bun install -g loki-mode` | Fastest startup for CLI commands. |
|
|
91
93
|
| **Homebrew** | `brew tap asklokesh/tap && brew install loki-mode` | Auto-installs Bun as a dep |
|
|
92
|
-
| **Docker** | `docker pull asklokesh/loki-mode:7.
|
|
94
|
+
| **Docker** | `docker pull asklokesh/loki-mode:7.28.0 && docker run --rm asklokesh/loki-mode:7.28.0 start prd.md` | Bun pre-installed in image |
|
|
93
95
|
| **npm (compat)** | `npm install -g loki-mode` | Works without Bun (bash fallback). Migrate any time with `loki self-update --to bun`. |
|
|
94
96
|
|
|
95
97
|
**Upgrading:**
|
|
@@ -108,7 +110,7 @@ See the [Installation Guide](docs/INSTALLATION.md) for the long form.
|
|
|
108
110
|
|
|
109
111
|
## Runtime Architecture
|
|
110
112
|
|
|
111
|
-
Loki Mode
|
|
113
|
+
Loki Mode runs a dual runtime by deliberate design: the battle-tested Bash engine is the stable core (the autonomous loop, quality gates, and completion council stay on it; it receives bug fixes and hardening), and new product surfaces are built TypeScript/Bun-first as modules that wrap the engine rather than reimplement it. An earlier plan to make v8 Bun-only has been superseded by this stable-engine approach: rewriting the verified trust layer would risk the exact guarantees this product exists to provide, for no capability gain. Bash support is not going away.
|
|
112
114
|
|
|
113
115
|
**What ships today:**
|
|
114
116
|
|
|
@@ -149,7 +151,7 @@ The next major release sunsets the Bash runtime entirely. There is no firm calen
|
|
|
149
151
|
| Method | Command |
|
|
150
152
|
|--------|---------|
|
|
151
153
|
| **Homebrew** | `brew tap asklokesh/tap && brew install loki-mode` |
|
|
152
|
-
| **Docker** | `docker pull asklokesh/loki-mode:7.
|
|
154
|
+
| **Docker** | `docker pull asklokesh/loki-mode:7.28.0` |
|
|
153
155
|
| **Inside Claude Code** | `claude --dangerously-skip-permissions` then type "Loki Mode" |
|
|
154
156
|
| **Git clone** | `git clone https://github.com/asklokesh/loki-mode.git` |
|
|
155
157
|
|
|
@@ -227,8 +229,8 @@ Every iteration: **Reason** (read state) - **Act** (execute, commit) - **Reflect
|
|
|
227
229
|
</td>
|
|
228
230
|
<td width="33%" valign="top">
|
|
229
231
|
|
|
230
|
-
### 41 Agent
|
|
231
|
-
8
|
|
232
|
+
### 41 Agent Roles
|
|
233
|
+
8 domains: engineering, operations, business, data, product, growth, review, orchestration. These are prompt-defined role specifications the orchestrator adopts per phase, auto-composed by PRD complexity; parallelism comes from the blind review council, the adversarial reviewer, and optional git-worktree streams on Claude Code, sequential on other providers.
|
|
232
234
|
|
|
233
235
|
[Agent Types](references/agent-types.md)
|
|
234
236
|
|
|
@@ -331,14 +333,14 @@ Loki's autonomy and quality loop are the product; the underlying coding CLI is s
|
|
|
331
333
|
|
|
332
334
|
| Provider | Status | Autonomous Flag | Parallel Agents | Install |
|
|
333
335
|
|----------|--------|:-:|:-:|---------|
|
|
334
|
-
| **Claude Code** | Active (Tier 1) | `--dangerously-skip-permissions` | Yes (10+) | `npm i -g @anthropic-ai/claude-code` |
|
|
335
|
-
| **Codex CLI** |
|
|
336
|
-
| **Cline CLI** |
|
|
337
|
-
| **Aider** |
|
|
336
|
+
| **Claude Code** | Active (Tier 1, E2E-verified) | `--dangerously-skip-permissions` | Yes (10+) | `npm i -g @anthropic-ai/claude-code` |
|
|
337
|
+
| **Codex CLI** | Experimental (Tier 3) | `--full-auto --skip-git-repo-check` | Sequential | `npm i -g @openai/codex` |
|
|
338
|
+
| **Cline CLI** | Experimental (Tier 2) | `-y` | Sequential | `npm i -g @anthropic-ai/cline` |
|
|
339
|
+
| **Aider** | Experimental (Tier 3) | `--yes-always` | Sequential | `pip install aider-chat` |
|
|
338
340
|
| **Google Gemini CLI** | DEPRECATED v7.5.18 | -- | -- | Upstream deprecated; runtime removed. `LOKI_PROVIDER=gemini` exits with migration message. |
|
|
339
341
|
| **Anthropic Antigravity CLI** | Coming soon | -- | -- | Integration planned. |
|
|
340
342
|
|
|
341
|
-
Claude gets full features (subagents, parallelization, MCP, Task tool).
|
|
343
|
+
Status legend: "E2E-verified" means we run real spec-to-code builds on it ourselves. Claude Code is the primary, fully supported provider and the one Loki Mode is built for; it gets full features (subagents, parallelization, MCP, Task tool). "Experimental" means the wiring is in place but we have not produced an end-to-end verified build ourselves; treat as community-tested. Experimental providers run sequentially. Auto-failover switches providers when rate-limited. See [Provider Guide](skills/providers.md).
|
|
342
344
|
|
|
343
345
|
---
|
|
344
346
|
|
package/SKILL.md
CHANGED
|
@@ -3,7 +3,7 @@ name: loki-mode
|
|
|
3
3
|
description: Autonomous spec-driven build system with a built-in trust layer. It does not call work done until it is verified (RARV-C closure loop, 11 quality gates, completion council, verified-completion evidence gate). Triggers on "Loki Mode". Takes a spec (PRD, GitHub issue, OpenAPI doc, etc.) to deployed product with minimal human intervention. Provider-agnostic. Requires --dangerously-skip-permissions flag.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
# Loki Mode v7.
|
|
6
|
+
# Loki Mode v7.28.0
|
|
7
7
|
|
|
8
8
|
**You are an autonomous agent. You make decisions. You do not ask questions. You do not stop.**
|
|
9
9
|
|
|
@@ -335,6 +335,15 @@ See `references/core-workflow.md` for the full RARV-C contract.
|
|
|
335
335
|
|
|
336
336
|
---
|
|
337
337
|
|
|
338
|
+
## Trust-layer additions (v7.28.0)
|
|
339
|
+
|
|
340
|
+
Two completion-trust features extend the verification gates. Full details in `skills/quality-gates.md`.
|
|
341
|
+
|
|
342
|
+
- **Held-out spec evals:** ~25% of checklist items (deterministic `sha256(id)` order, `N >= 4`) are reserved into `.loki/checklist/held-out.json` and excluded from the build prompt feed; the completion council blocks if a held-out item fails. Opt out with `LOKI_HELDOUT_GATE=0`. Honest limit: this guards the prompt feed, not a sandbox; the reservation file is on disk and an agent with filesystem access can read it.
|
|
343
|
+
- **Inconclusive-baseline disclosure:** when the evidence gate cannot establish a diff baseline (`no_git_repo` / `no_run_start_sha`) it writes `.loki/state/evidence-inconclusive.json` and `COMPLETION.txt` carries an honest "not independently verified" line. It never blocks non-git projects; red tests still block.
|
|
344
|
+
|
|
345
|
+
---
|
|
346
|
+
|
|
338
347
|
## Concurrency and Security Hardening (v7.5.7 - v7.5.13)
|
|
339
348
|
|
|
340
349
|
Three back-to-back patches closed cross-process and security gaps. No user-facing behavior change on the default flow; verify via the cited paths.
|
|
@@ -383,4 +392,4 @@ See `CHANGELOG.md` entries [7.5.7], [7.5.8], [7.5.13] for the per-fix list and r
|
|
|
383
392
|
|
|
384
393
|
---
|
|
385
394
|
|
|
386
|
-
**v7.
|
|
395
|
+
**v7.28.0 | [Autonomi](https://www.autonomi.dev/) flagship product | ~260 lines core**
|
package/VERSION
CHANGED
|
@@ -1 +1 @@
|
|
|
1
|
-
7.
|
|
1
|
+
7.28.0
|
|
@@ -752,6 +752,18 @@ with open(state_file, 'w') as f:
|
|
|
752
752
|
"threshold=$effective_threshold" \
|
|
753
753
|
"result=$([ $approve_count -ge $effective_threshold ] && echo 'APPROVED' || echo 'REJECTED')" 2>/dev/null || true
|
|
754
754
|
|
|
755
|
+
# Trust-metrics: durable per-vote record for the council rejection / split
|
|
756
|
+
# rate. The council state.json verdicts[] array is per-run only; this log is
|
|
757
|
+
# the cross-run corpus. Additive, best-effort, stdout-silent.
|
|
758
|
+
if type record_trust_event_bash &>/dev/null; then
|
|
759
|
+
record_trust_event_bash "council_vote" \
|
|
760
|
+
"approve=$approve_count" \
|
|
761
|
+
"reject=$reject_count" \
|
|
762
|
+
"threshold=$effective_threshold" \
|
|
763
|
+
"result=$([ $approve_count -ge $effective_threshold ] && echo 'APPROVED' || echo 'REJECTED')" \
|
|
764
|
+
>/dev/null 2>&1 || true
|
|
765
|
+
fi
|
|
766
|
+
|
|
755
767
|
# Write transcript for this council round (Path A: council_vote path)
|
|
756
768
|
local _ct_outcome
|
|
757
769
|
_ct_outcome=$([ $approve_count -ge $effective_threshold ] && echo "APPROVED" || echo "REJECTED")
|
|
@@ -1086,19 +1098,24 @@ council_reverify_checklist() {
|
|
|
1086
1098
|
council_checklist_gate() {
|
|
1087
1099
|
local results_file=".loki/checklist/verification-results.json"
|
|
1088
1100
|
local waivers_file=".loki/checklist/waivers.json"
|
|
1101
|
+
local heldout_file=".loki/checklist/held-out.json"
|
|
1089
1102
|
|
|
1090
1103
|
# No checklist = no gate (backwards compatible)
|
|
1091
1104
|
if [ ! -f "$results_file" ]; then
|
|
1092
1105
|
return 0
|
|
1093
1106
|
fi
|
|
1094
1107
|
|
|
1095
|
-
# Check for critical failures, excluding waived items
|
|
1108
|
+
# Check for critical failures, excluding waived AND held-out items. Held-out
|
|
1109
|
+
# items (v7.28.0) must NOT block here: they are evaluated separately by
|
|
1110
|
+
# council_heldout_gate at the ship gate, and surfacing them in this gate's
|
|
1111
|
+
# block report would leak their identity back into the build loop.
|
|
1096
1112
|
local gate_result
|
|
1097
|
-
gate_result=$(_RESULTS_FILE="$results_file" _WAIVERS_FILE="$waivers_file" python3 -c "
|
|
1113
|
+
gate_result=$(_RESULTS_FILE="$results_file" _WAIVERS_FILE="$waivers_file" _HELDOUT_FILE="$heldout_file" python3 -c "
|
|
1098
1114
|
import json, sys, os
|
|
1099
1115
|
|
|
1100
1116
|
results_file = os.environ['_RESULTS_FILE']
|
|
1101
1117
|
waivers_file = os.environ.get('_WAIVERS_FILE', '')
|
|
1118
|
+
heldout_file = os.environ.get('_HELDOUT_FILE', '')
|
|
1102
1119
|
|
|
1103
1120
|
try:
|
|
1104
1121
|
with open(results_file) as f:
|
|
@@ -1117,12 +1134,22 @@ if waivers_file and os.path.exists(waivers_file):
|
|
|
1117
1134
|
except (json.JSONDecodeError, KeyError):
|
|
1118
1135
|
pass
|
|
1119
1136
|
|
|
1120
|
-
#
|
|
1137
|
+
# Load held-out item ids (excluded from this gate)
|
|
1138
|
+
heldout_ids = set()
|
|
1139
|
+
if heldout_file and os.path.exists(heldout_file):
|
|
1140
|
+
try:
|
|
1141
|
+
with open(heldout_file) as f:
|
|
1142
|
+
heldout_ids = set(json.load(f).get('held_out', []))
|
|
1143
|
+
except (json.JSONDecodeError, KeyError):
|
|
1144
|
+
pass
|
|
1145
|
+
|
|
1146
|
+
# Find critical failures not waived and not held-out
|
|
1121
1147
|
critical_failures = []
|
|
1122
1148
|
for cat in results.get('categories', []):
|
|
1123
1149
|
for item in cat.get('items', []):
|
|
1124
1150
|
if item.get('priority') == 'critical' and item.get('status') == 'failing':
|
|
1125
|
-
|
|
1151
|
+
iid = item.get('id')
|
|
1152
|
+
if iid not in waived_ids and iid not in heldout_ids:
|
|
1126
1153
|
critical_failures.append(item.get('title', item.get('id', 'unknown')))
|
|
1127
1154
|
|
|
1128
1155
|
if critical_failures:
|
|
@@ -1175,6 +1202,221 @@ GATE_EOF
|
|
|
1175
1202
|
return 0
|
|
1176
1203
|
}
|
|
1177
1204
|
|
|
1205
|
+
#===============================================================================
|
|
1206
|
+
# Council Held-out Spec Eval Gate (v7.28.0) - anti-reward-hacking
|
|
1207
|
+
#===============================================================================
|
|
1208
|
+
# Held-out checklist items are reserved at PRD-checklist generation time and are
|
|
1209
|
+
# excluded from the prompt feed the build loop sees (checklist_summary, the build
|
|
1210
|
+
# prompt, and council_checklist_gate). The completion council evaluates them only
|
|
1211
|
+
# here, at the ship gate. Scope of the guarantee: this protects the prompt feed,
|
|
1212
|
+
# not a sandbox. .loki/checklist/held-out.json is plain on-disk JSON, so a
|
|
1213
|
+
# non-cooperative agent with filesystem tools can read the reservation directly;
|
|
1214
|
+
# the protection is against feeding held-out items to the loop, not isolation.
|
|
1215
|
+
# The gate uses the SAME verification machinery the
|
|
1216
|
+
# checklist already uses: council_reverify_checklist re-runs checklist-verify.py
|
|
1217
|
+
# over the FULL checklist (including held-out items), so this gate just reads
|
|
1218
|
+
# the held-out items' freshly-computed statuses from verification-results.json.
|
|
1219
|
+
#
|
|
1220
|
+
# A held-out item with status 'failing' blocks completion exactly like the
|
|
1221
|
+
# evidence gate (return 1 = CONTINUE). Pending/inconclusive items pass through.
|
|
1222
|
+
# Default-on ONLY when held-out items exist; opt out with LOKI_HELDOUT_GATE=0
|
|
1223
|
+
# (byte-identical to prior behavior: no read, no write).
|
|
1224
|
+
council_heldout_gate() {
|
|
1225
|
+
# Knob first: opt-out is exact-as-today, before any file read or write.
|
|
1226
|
+
[ "${LOKI_HELDOUT_GATE:-1}" = "0" ] && return 0
|
|
1227
|
+
|
|
1228
|
+
local results_file=".loki/checklist/verification-results.json"
|
|
1229
|
+
local heldout_file=".loki/checklist/held-out.json"
|
|
1230
|
+
local waivers_file=".loki/checklist/waivers.json"
|
|
1231
|
+
|
|
1232
|
+
# No held-out reservation = no gate (default-off when nothing reserved).
|
|
1233
|
+
if [ ! -f "$heldout_file" ] || [ ! -f "$results_file" ]; then
|
|
1234
|
+
return 0
|
|
1235
|
+
fi
|
|
1236
|
+
|
|
1237
|
+
if [ -z "${COUNCIL_STATE_DIR:-}" ]; then
|
|
1238
|
+
COUNCIL_STATE_DIR="${TARGET_DIR:-.}/.loki/council"
|
|
1239
|
+
fi
|
|
1240
|
+
|
|
1241
|
+
# Evaluate held-out items against their freshly-verified statuses. Output is
|
|
1242
|
+
# a single line "<verdict> <pass> <fail>" where verdict is NONE (no held-out
|
|
1243
|
+
# items reserved, gate inert), STALE (ids reserved but ZERO matched current
|
|
1244
|
+
# items -> reservation orphaned by a checklist regeneration), PASS, or BLOCK.
|
|
1245
|
+
# The failing titles are NOT carried in this line (a checklist title may
|
|
1246
|
+
# contain ':' or '|'); they are read separately from the held-out JSON block
|
|
1247
|
+
# below in the BLOCK branch.
|
|
1248
|
+
local gate_result
|
|
1249
|
+
gate_result=$(_RESULTS_FILE="$results_file" _HELDOUT_FILE="$heldout_file" _WAIVERS_FILE="$waivers_file" python3 -c "
|
|
1250
|
+
import json, sys, os
|
|
1251
|
+
|
|
1252
|
+
results_file = os.environ['_RESULTS_FILE']
|
|
1253
|
+
heldout_file = os.environ['_HELDOUT_FILE']
|
|
1254
|
+
waivers_file = os.environ.get('_WAIVERS_FILE', '')
|
|
1255
|
+
|
|
1256
|
+
try:
|
|
1257
|
+
with open(results_file) as f:
|
|
1258
|
+
results = json.load(f)
|
|
1259
|
+
with open(heldout_file) as f:
|
|
1260
|
+
heldout_ids = set(json.load(f).get('held_out', []))
|
|
1261
|
+
except (json.JSONDecodeError, IOError, KeyError):
|
|
1262
|
+
print('NONE 0 0')
|
|
1263
|
+
sys.exit(0)
|
|
1264
|
+
|
|
1265
|
+
# No held-out items reserved (e.g. N<4): gate is inert. Emit NONE so the caller
|
|
1266
|
+
# skips the trust-event entirely (no no-op heldout_eval pollution per round).
|
|
1267
|
+
if not heldout_ids:
|
|
1268
|
+
print('NONE 0 0')
|
|
1269
|
+
sys.exit(0)
|
|
1270
|
+
|
|
1271
|
+
# Waived held-out items are not counted as failures (operator override path).
|
|
1272
|
+
waived_ids = set()
|
|
1273
|
+
if waivers_file and os.path.exists(waivers_file):
|
|
1274
|
+
try:
|
|
1275
|
+
with open(waivers_file) as f:
|
|
1276
|
+
waived_ids = {w['item_id'] for w in json.load(f).get('waivers', []) if w.get('active', True)}
|
|
1277
|
+
except (json.JSONDecodeError, KeyError):
|
|
1278
|
+
pass
|
|
1279
|
+
|
|
1280
|
+
# HIGH-1(b): track how many held-out ids actually matched a current item. If the
|
|
1281
|
+
# reservation lists ids but ZERO matched (orphaned after a checklist regen), the
|
|
1282
|
+
# gate must NOT report PASS (that reads as evaluated-and-passed). 'matched' is
|
|
1283
|
+
# distinct from passed/failed: an all-pending matched set legitimately yields
|
|
1284
|
+
# passed=0 failed=0 and must stay PASS/pass-through, not STALE.
|
|
1285
|
+
matched = 0
|
|
1286
|
+
passed = 0
|
|
1287
|
+
failed = 0
|
|
1288
|
+
for cat in results.get('categories', []):
|
|
1289
|
+
for item in cat.get('items', []):
|
|
1290
|
+
iid = item.get('id', '')
|
|
1291
|
+
if iid not in heldout_ids:
|
|
1292
|
+
continue
|
|
1293
|
+
matched += 1
|
|
1294
|
+
if iid in waived_ids:
|
|
1295
|
+
continue
|
|
1296
|
+
status = item.get('status')
|
|
1297
|
+
if status == 'verified':
|
|
1298
|
+
passed += 1
|
|
1299
|
+
elif status == 'failing':
|
|
1300
|
+
failed += 1
|
|
1301
|
+
# pending/inconclusive: pass-through (not counted as pass or fail block)
|
|
1302
|
+
|
|
1303
|
+
if matched == 0:
|
|
1304
|
+
# Reservation is stale: ids exist but none map to a current item. Selection-
|
|
1305
|
+
# side repair (checklist_select_heldout) fixes this next iteration; emit STALE
|
|
1306
|
+
# so this round is recorded honestly rather than as a silent PASS.
|
|
1307
|
+
print('STALE 0 0')
|
|
1308
|
+
sys.exit(0)
|
|
1309
|
+
|
|
1310
|
+
verdict = 'BLOCK' if failed > 0 else 'PASS'
|
|
1311
|
+
print('%s %d %d' % (verdict, passed, failed))
|
|
1312
|
+
" 2>/dev/null || echo "NONE 0 0")
|
|
1313
|
+
|
|
1314
|
+
local verdict pass_count fail_count
|
|
1315
|
+
read -r verdict pass_count fail_count <<< "$gate_result"
|
|
1316
|
+
[ -z "$verdict" ] && verdict="NONE"
|
|
1317
|
+
[ -z "$pass_count" ] && pass_count=0
|
|
1318
|
+
[ -z "$fail_count" ] && fail_count=0
|
|
1319
|
+
|
|
1320
|
+
# NONE: no held-out items reserved -> gate inert, no trust-event, no block.
|
|
1321
|
+
# LOW-5: still clear any stale block report so a prior BLOCK does not linger
|
|
1322
|
+
# after the reservation is emptied (matches the PASS branch cleanup).
|
|
1323
|
+
if [ "$verdict" = "NONE" ]; then
|
|
1324
|
+
if [ -n "${COUNCIL_STATE_DIR:-}" ] && [ -f "$COUNCIL_STATE_DIR/heldout-block.json" ]; then
|
|
1325
|
+
rm -f "$COUNCIL_STATE_DIR/heldout-block.json"
|
|
1326
|
+
fi
|
|
1327
|
+
return 0
|
|
1328
|
+
fi
|
|
1329
|
+
|
|
1330
|
+
# STALE: reservation orphaned by a checklist regeneration (ids reserved but
|
|
1331
|
+
# zero matched current items). Emit a STALE trust event so the round is not
|
|
1332
|
+
# silently counted as a pass, warn, clear any stale block file (LOW-5), and
|
|
1333
|
+
# return 0 (pass-through): blocking here would loop forever, and the
|
|
1334
|
+
# selection-side repair re-selects valid ids on the next iteration.
|
|
1335
|
+
if [ "$verdict" = "STALE" ]; then
|
|
1336
|
+
log_warn "[Council] Held-out reservation is stale (checklist regenerated; reserved ids match no current item). Selection will re-select next iteration; not treating this as an evaluated PASS."
|
|
1337
|
+
if type record_trust_event_bash &>/dev/null; then
|
|
1338
|
+
record_trust_event_bash "heldout_eval" \
|
|
1339
|
+
"verdict=STALE" \
|
|
1340
|
+
"pass=0" \
|
|
1341
|
+
"fail=0" \
|
|
1342
|
+
>/dev/null 2>&1 || true
|
|
1343
|
+
fi
|
|
1344
|
+
if [ -n "${COUNCIL_STATE_DIR:-}" ] && [ -f "$COUNCIL_STATE_DIR/heldout-block.json" ]; then
|
|
1345
|
+
rm -f "$COUNCIL_STATE_DIR/heldout-block.json"
|
|
1346
|
+
fi
|
|
1347
|
+
return 0
|
|
1348
|
+
fi
|
|
1349
|
+
|
|
1350
|
+
# Trust-metrics: durable per-evaluation record (pass/fail counts). Emitted
|
|
1351
|
+
# only when held-out items actually exist (verdict PASS or BLOCK).
|
|
1352
|
+
if type record_trust_event_bash &>/dev/null; then
|
|
1353
|
+
record_trust_event_bash "heldout_eval" \
|
|
1354
|
+
"verdict=$verdict" \
|
|
1355
|
+
"pass=$pass_count" \
|
|
1356
|
+
"fail=$fail_count" \
|
|
1357
|
+
>/dev/null 2>&1 || true
|
|
1358
|
+
fi
|
|
1359
|
+
|
|
1360
|
+
if [ "$verdict" = "BLOCK" ]; then
|
|
1361
|
+
# Read failing held-out titles directly from the data (colon/pipe-safe).
|
|
1362
|
+
local titles_json titles_display
|
|
1363
|
+
titles_json=$(_RESULTS_FILE="$results_file" _HELDOUT_FILE="$heldout_file" _WAIVERS_FILE="$waivers_file" python3 -c "
|
|
1364
|
+
import json, os
|
|
1365
|
+
results = json.load(open(os.environ['_RESULTS_FILE']))
|
|
1366
|
+
heldout_ids = set(json.load(open(os.environ['_HELDOUT_FILE'])).get('held_out', []))
|
|
1367
|
+
waived_ids = set()
|
|
1368
|
+
wf = os.environ.get('_WAIVERS_FILE', '')
|
|
1369
|
+
if wf and os.path.exists(wf):
|
|
1370
|
+
try:
|
|
1371
|
+
waived_ids = {w['item_id'] for w in json.load(open(wf)).get('waivers', []) if w.get('active', True)}
|
|
1372
|
+
except Exception:
|
|
1373
|
+
pass
|
|
1374
|
+
titles = []
|
|
1375
|
+
for cat in results.get('categories', []):
|
|
1376
|
+
for item in cat.get('items', []):
|
|
1377
|
+
iid = item.get('id', '')
|
|
1378
|
+
if iid in heldout_ids and iid not in waived_ids and item.get('status') == 'failing':
|
|
1379
|
+
titles.append(item.get('title', iid))
|
|
1380
|
+
print(json.dumps(titles[:5]))
|
|
1381
|
+
" 2>/dev/null || echo '[]')
|
|
1382
|
+
titles_display=$(_T="$titles_json" python3 -c "
|
|
1383
|
+
import json, os
|
|
1384
|
+
try:
|
|
1385
|
+
print(', '.join(json.loads(os.environ['_T'])))
|
|
1386
|
+
except Exception:
|
|
1387
|
+
print('')
|
|
1388
|
+
" 2>/dev/null || echo "")
|
|
1389
|
+
log_warn "[Council] Held-out gate BLOCKED: ${fail_count} held-out acceptance check(s) failing: ${titles_display}"
|
|
1390
|
+
log_warn "[Council] Held-out checks are hidden from the build loop and verified only at completion. To opt out: set LOKI_HELDOUT_GATE=0"
|
|
1391
|
+
|
|
1392
|
+
mkdir -p "$COUNCIL_STATE_DIR" 2>/dev/null || true
|
|
1393
|
+
local ho_file="$COUNCIL_STATE_DIR/heldout-block.json"
|
|
1394
|
+
local ho_tmp="${ho_file}.tmp"
|
|
1395
|
+
local timestamp
|
|
1396
|
+
timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
|
1397
|
+
cat > "$ho_tmp" << HELDOUT_EOF
|
|
1398
|
+
{
|
|
1399
|
+
"status": "blocked",
|
|
1400
|
+
"blocked": true,
|
|
1401
|
+
"blocked_at": "$timestamp",
|
|
1402
|
+
"iteration": ${ITERATION_COUNT:-0},
|
|
1403
|
+
"reason": "held_out_checks_failing",
|
|
1404
|
+
"passed": $pass_count,
|
|
1405
|
+
"failed": $fail_count,
|
|
1406
|
+
"failures": $titles_json
|
|
1407
|
+
}
|
|
1408
|
+
HELDOUT_EOF
|
|
1409
|
+
mv "$ho_tmp" "$ho_file"
|
|
1410
|
+
return 1
|
|
1411
|
+
fi
|
|
1412
|
+
|
|
1413
|
+
# Gate passes: remove any stale block report.
|
|
1414
|
+
if [ -f "$COUNCIL_STATE_DIR/heldout-block.json" ]; then
|
|
1415
|
+
rm -f "$COUNCIL_STATE_DIR/heldout-block.json"
|
|
1416
|
+
fi
|
|
1417
|
+
return 0
|
|
1418
|
+
}
|
|
1419
|
+
|
|
1178
1420
|
#===============================================================================
|
|
1179
1421
|
# Council Evidence Hard Gate (v7.19.1) - "verified completion"
|
|
1180
1422
|
#===============================================================================
|
|
@@ -1212,13 +1454,20 @@ council_evidence_gate() {
|
|
|
1212
1454
|
# read, so none is tracked (avoids SC2034 dead-assignment).
|
|
1213
1455
|
local diff_fails="false"
|
|
1214
1456
|
local diff_files=0
|
|
1457
|
+
# v7.28.0: track WHY the diff baseline could not be established, so the
|
|
1458
|
+
# inconclusive case is surfaced honestly instead of passing through silently.
|
|
1459
|
+
# diff_inconclusive stays "false" on the conclusive branch below.
|
|
1460
|
+
local diff_inconclusive="false"
|
|
1461
|
+
local diff_inconclusive_reason=""
|
|
1215
1462
|
if ! git rev-parse --is-inside-work-tree >/dev/null 2>&1; then
|
|
1216
1463
|
# No git repo => cannot prove fabrication => inconclusive => pass-through.
|
|
1217
|
-
|
|
1464
|
+
diff_inconclusive="true"
|
|
1465
|
+
diff_inconclusive_reason="no_git_repo"
|
|
1218
1466
|
elif [ -z "$base_sha" ]; then
|
|
1219
1467
|
# No baseline captured (non-git/zero-commit run, or never set) =>
|
|
1220
1468
|
# inconclusive => pass-through. Never false-block a legit first run.
|
|
1221
|
-
|
|
1469
|
+
diff_inconclusive="true"
|
|
1470
|
+
diff_inconclusive_reason="no_run_start_sha"
|
|
1222
1471
|
else
|
|
1223
1472
|
# Count the UNION of three change sources (auto-commit is not guaranteed,
|
|
1224
1473
|
# so committed-only would false-block a dirty-but-real working tree):
|
|
@@ -1296,6 +1545,40 @@ else:
|
|
|
1296
1545
|
# Missing test-results.json (the else of the -f check) likewise leaves
|
|
1297
1546
|
# test_fails="false" => inconclusive => pass-through (no file = no gate).
|
|
1298
1547
|
|
|
1548
|
+
# --- v7.28.0: inconclusive-baseline lifecycle -------------------------------
|
|
1549
|
+
# When the gate cannot establish a diff baseline (no git repo, or no run-start
|
|
1550
|
+
# SHA) it does NOT block (would break non-git projects), but completion is no
|
|
1551
|
+
# longer independently verified. Record that fact durably so the completion
|
|
1552
|
+
# summary can surface one honest line, and emit a trust-event. The record is
|
|
1553
|
+
# about the DIFF baseline only, so it is written regardless of the test
|
|
1554
|
+
# outcome. On any CONCLUSIVE baseline we remove a stale record.
|
|
1555
|
+
local inconclusive_file="${TARGET_DIR:-.}/.loki/state/evidence-inconclusive.json"
|
|
1556
|
+
if [ "$diff_inconclusive" = "true" ]; then
|
|
1557
|
+
mkdir -p "${TARGET_DIR:-.}/.loki/state" 2>/dev/null || true
|
|
1558
|
+
local inc_tmp="${inconclusive_file}.tmp"
|
|
1559
|
+
local inc_ts
|
|
1560
|
+
inc_ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
|
1561
|
+
cat > "$inc_tmp" << INCONCLUSIVE_EOF
|
|
1562
|
+
{
|
|
1563
|
+
"inconclusive": true,
|
|
1564
|
+
"recorded_at": "$inc_ts",
|
|
1565
|
+
"iteration": ${ITERATION_COUNT:-0},
|
|
1566
|
+
"reason": "$diff_inconclusive_reason"
|
|
1567
|
+
}
|
|
1568
|
+
INCONCLUSIVE_EOF
|
|
1569
|
+
mv "$inc_tmp" "$inconclusive_file" 2>/dev/null || rm -f "$inc_tmp" 2>/dev/null || true
|
|
1570
|
+
if type record_trust_event_bash &>/dev/null; then
|
|
1571
|
+
record_trust_event_bash "evidence_inconclusive" \
|
|
1572
|
+
"reason=$diff_inconclusive_reason" \
|
|
1573
|
+
>/dev/null 2>&1 || true
|
|
1574
|
+
fi
|
|
1575
|
+
else
|
|
1576
|
+
# Conclusive baseline: clear any stale inconclusive record.
|
|
1577
|
+
if [ -f "$inconclusive_file" ]; then
|
|
1578
|
+
rm -f "$inconclusive_file"
|
|
1579
|
+
fi
|
|
1580
|
+
fi
|
|
1581
|
+
|
|
1299
1582
|
# --- Block decision: block iff DIFF FAILS or TEST FAILS ---
|
|
1300
1583
|
if [ "$diff_fails" != "true" ] && [ "$test_fails" != "true" ]; then
|
|
1301
1584
|
# Gate passes: remove any stale block report.
|
|
@@ -1366,6 +1649,19 @@ print(json.dumps(items[:5]))
|
|
|
1366
1649
|
}
|
|
1367
1650
|
EVIDENCE_EOF
|
|
1368
1651
|
mv "$ev_tmp" "$ev_file"
|
|
1652
|
+
|
|
1653
|
+
# Trust-metrics: durable per-block record. evidence-block.json is a single
|
|
1654
|
+
# state file that is DELETED the moment the gate next passes, so it cannot
|
|
1655
|
+
# be the cross-run corpus for the block rate. Append an event here, where a
|
|
1656
|
+
# block is definitely happening. Additive, best-effort, stdout-silent.
|
|
1657
|
+
if type record_trust_event_bash &>/dev/null; then
|
|
1658
|
+
record_trust_event_bash "evidence_block" \
|
|
1659
|
+
"reason=$reason" \
|
|
1660
|
+
"diff_ok=$diff_ok" \
|
|
1661
|
+
"tests_ok=$tests_ok" \
|
|
1662
|
+
>/dev/null 2>&1 || true
|
|
1663
|
+
fi
|
|
1664
|
+
|
|
1369
1665
|
return 1
|
|
1370
1666
|
}
|
|
1371
1667
|
|
|
@@ -2000,6 +2296,14 @@ council_evaluate() {
|
|
|
2000
2296
|
return 1 # CONTINUE - can't complete with critical failures
|
|
2001
2297
|
fi
|
|
2002
2298
|
|
|
2299
|
+
# v7.28.0: held-out spec eval gate - verify the hidden acceptance checks the
|
|
2300
|
+
# build loop never saw. Runs after the visible-checklist gate, using the
|
|
2301
|
+
# statuses council_reverify_checklist just recomputed over the full checklist.
|
|
2302
|
+
if ! council_heldout_gate; then
|
|
2303
|
+
log_info "[Council] Completion blocked by held-out spec eval gate"
|
|
2304
|
+
return 1 # CONTINUE - cannot complete with failing held-out checks
|
|
2305
|
+
fi
|
|
2306
|
+
|
|
2003
2307
|
# Phase 2.5 (v7.19.1): evidence hard gate - block completion unless there is
|
|
2004
2308
|
# real evidence that files changed AND tests are green.
|
|
2005
2309
|
if ! council_evidence_gate; then
|
|
@@ -21,6 +21,7 @@ Usage:
|
|
|
21
21
|
import argparse
|
|
22
22
|
import json
|
|
23
23
|
import os
|
|
24
|
+
import re
|
|
24
25
|
import sys
|
|
25
26
|
from datetime import datetime, timezone
|
|
26
27
|
from pathlib import Path
|
|
@@ -61,13 +62,31 @@ def get_pricing(provider):
|
|
|
61
62
|
return PRICING_BY_PROVIDER.get(provider, PRICING_BY_PROVIDER["claude"])
|
|
62
63
|
|
|
63
64
|
|
|
64
|
-
def
|
|
65
|
-
"""
|
|
66
|
-
|
|
67
|
-
|
|
65
|
+
def derive_naive_project_slug():
|
|
66
|
+
"""Legacy slug rule: replace only '/' with '-'.
|
|
67
|
+
|
|
68
|
+
Kept for backward compatibility: stale sessions created before the
|
|
69
|
+
sanitization fix live under this naming. find_session_file falls back to
|
|
70
|
+
it when the correctly-sanitized slug dir does not exist.
|
|
71
|
+
"""
|
|
72
|
+
cwd = os.path.realpath(os.getcwd())
|
|
68
73
|
return "-" + cwd.lstrip("/").replace("/", "-")
|
|
69
74
|
|
|
70
75
|
|
|
76
|
+
def derive_project_slug():
|
|
77
|
+
"""Derive Claude's project slug from cwd.
|
|
78
|
+
|
|
79
|
+
Claude Code sanitizes EVERY non-alphanumeric character in the realpath to
|
|
80
|
+
'-' (rule: re.sub(r'[^a-zA-Z0-9]', '-', path)). The earlier implementation
|
|
81
|
+
replaced only '/', so any path with underscores, dots, or other special
|
|
82
|
+
characters produced a slug that did not match Claude's real session dir,
|
|
83
|
+
silently zeroing token/cost capture. realpath resolves symlinks (e.g.
|
|
84
|
+
/tmp -> /private/tmp) to match Claude's own keying.
|
|
85
|
+
"""
|
|
86
|
+
cwd = os.path.realpath(os.getcwd())
|
|
87
|
+
return "-" + re.sub(r"[^a-zA-Z0-9]", "-", cwd.lstrip("/"))
|
|
88
|
+
|
|
89
|
+
|
|
71
90
|
def find_session_file(provider, session_file_arg=None):
|
|
72
91
|
"""Find the most recently modified session file for the given provider.
|
|
73
92
|
|
|
@@ -84,10 +103,16 @@ def find_session_file(provider, session_file_arg=None):
|
|
|
84
103
|
return path if path.exists() else None
|
|
85
104
|
|
|
86
105
|
if provider == "claude":
|
|
87
|
-
|
|
88
|
-
session_dir =
|
|
106
|
+
projects_root = Path.home() / ".claude" / "projects"
|
|
107
|
+
session_dir = projects_root / derive_project_slug()
|
|
108
|
+
# Backward compatibility: if the correctly-sanitized slug dir does not
|
|
109
|
+
# exist but a stale session under the old naive slug does, use it.
|
|
89
110
|
if not session_dir.is_dir():
|
|
90
|
-
|
|
111
|
+
naive_dir = projects_root / derive_naive_project_slug()
|
|
112
|
+
if naive_dir.is_dir():
|
|
113
|
+
session_dir = naive_dir
|
|
114
|
+
else:
|
|
115
|
+
return None
|
|
91
116
|
jsonl_files = sorted(session_dir.glob("*.jsonl"), key=lambda p: p.stat().st_mtime, reverse=True)
|
|
92
117
|
return jsonl_files[0] if jsonl_files else None
|
|
93
118
|
|