gm-copilot-cli 2.0.631 → 2.0.633

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -3,69 +3,46 @@ name: gm-execute
3
3
  description: EXECUTE phase AND the foundational execution contract for every skill. Every exec:<lang> run, every witnessed check, every code search, in every phase, follows this skill's discipline. Resolve all mutables via witnessed execution. Any new unknown triggers immediate snake back to planning — restart chain from PLAN.
4
4
  ---
5
5
 
6
- # GM EXECUTE — Resolving Every Unknown
6
+ # GM EXECUTE — Resolve Every Unknown
7
7
 
8
- You are in the **EXECUTE** phase. Every mutable on `.gm/prd.yml` carries UNKNOWN status until witnessed execution resolves it. Job here = the witnessing.
8
+ GRAPH: `PLAN [EXECUTE] EMIT VERIFY → COMPLETE`
9
+ Entry: .prd with named unknowns. From `planning` or re-entered from EMIT/VERIFY.
9
10
 
10
- This skill also carries the **execution contract** applying in every phase, not only this one. Planning runs codebase scans; EMIT runs pre-emit diagnostics; VERIFY runs integration tests and CI watches — all executions, all subject to discipline below. Other skills reference this skill because protocols stay live in context only while this text is nearby. About to run anything → this skill freshly loaded OR operating outside contract.
11
-
12
- New unknown surfaced by a run → stop, state-regress to `planning`, restart chain.
13
-
14
- **GRAPH POSITION**: `PLAN → [EXECUTE] → EMIT → VERIFY → COMPLETE`
15
- - **Entry**: .prd exists with all unknowns named. Entered from `planning` or via snake from EMIT/VERIFY.
11
+ This skill = execution contract for ALL phases. Other phases reference it because protocols must be fresh. About to run anything → load this skill first.
16
12
 
17
13
  ## TRANSITIONS
18
14
 
19
- **EXIT invoke `gm-emit` skill immediately when**: All mutables are KNOWN (zero UNKNOWN remaining). Do not wait, do not summarize. Invoke the skill.
20
-
21
- **SELF-LOOP (remain in EXECUTE state)**: Mutable still UNKNOWN after one pass re-run with different angle (max 2 passes, then regress to PLAN)
22
-
23
- **STATE REGRESSIONS**:
24
- - New unknown discovered → invoke `planning` skill immediately, reset to PLAN state
25
- - EXECUTE mutable unresolvable after 2 passes → invoke `planning` skill, reset to PLAN state
26
- - Re-entered from EMIT state (logic error) → re-resolve the mutable, then re-invoke `gm-emit` skill
27
- - Re-entered from VERIFY state (runtime failure) → re-resolve with real system state, then re-invoke `gm-emit` skill
15
+ **EXIT EMIT**: all mutables KNOWN invoke `gm-emit` immediately.
16
+ **SELF-LOOP**: still UNKNOWN → re-run different angle (max 2 passes, then regress to PLAN).
17
+ **REGRESS PLAN**: new unknown discovered | mutable unresolvable after 2 passes.
28
18
 
29
19
  ## MUTABLE DISCIPLINE
30
20
 
31
- Each mutable: name | expected | current | resolution method. Execute → witness → assign → compare. Zero variance = resolved. Unresolved after 2 passes = new unknown = snake to `planning`. Never narrate past an unresolved mutable.
32
-
33
- ## WEAK-PRIOR BRIDGE — PRIORS DO NOT AUTHORIZE
34
-
35
- EXECUTE receives route candidates from PLAN. Per the weak-prior rule in `governance`: **those candidates arrive as weak priors only — structural value preserved, authorization NOT transferred**. Route plausibility ≠ authorization. A plausible route earns the right to be TESTED, not the right to be BELIEVED.
36
-
37
- - Prior from PLAN: `authorization=weak_prior`. Permitted use: pick the next witnessed probe.
38
- - After witnessed probe succeeds: `authorization=witnessed`. Permitted use: feed into EMIT.
39
- - Collapsing `weak_prior` to `witnessed` without a witnessed probe = route-into-authorization leak (collapse #1 in `governance`). Snake to PLAN.
40
-
41
- Rhetorical inflation also strips here: "the plan says" / "we agreed that" / "obviously X" are prior-statements, not witnessed-facts. Restate as weak prior, run the probe, witness, only then authorize.
21
+ Each mutable: name | expected | current | resolution method. Zero variance = resolved. Unresolved after 2 passes = snake to `planning`. Never narrate past an unresolved mutable.
42
22
 
43
- ## QUALITY METRICS APPLY BEFORE MARKING KNOWN
23
+ Mutables resolve to KNOWN only when ALL four pass:
24
+ - **ΔS=0** — witnessed output equals expected
25
+ - **λ≥2** — two independent paths agree
26
+ - **ε intact** — adjacent invariants hold (types, test.js, neighboring callers)
27
+ - **Coverage≥0.70** — enough corpus inspected for retrieval mutables
44
28
 
45
- Every mutable passes all four before status flips UNKNOWN → KNOWN (see `governance` for full definitions):
29
+ ## PRIORS DON'T AUTHORIZE
46
30
 
47
- - **ΔS = 0** witnessed output equals expected
48
- - **λ 2** two independent paths agree (different search, different caller, different import), not just one confirmation
49
- - **ε intact** adjacent invariants still hold (neighboring callers, types, test.js, nearby modules unbroken)
50
- - **Coverage ≥ 0.70** — for retrieval/search mutables, enough of the corpus was inspected to rule out contradicting evidence
51
-
52
- Single-witness resolution (`λ=1`) = still unknown. One passing run on happy path without probing error paths = `ε` unverified. Skipping these checks and marking KNOWN anyway is an authorization-without-witness violation.
31
+ Route candidates from PLAN arrive as `weak_prior` only. Plausibility = right to TEST, not right to BELIEVE.
32
+ `weak_prior` witnessed probe `witnessed` feed to EMIT.
33
+ "The plan says" / "we agreed" / "obviously X" = prior-statements, not witnessed facts.
53
34
 
54
35
  ## CODE EXECUTION
55
36
 
56
- **exec:<lang> is the only way to run code.** Bash tool body: `exec:<lang>\n<code>`
37
+ `exec:<lang>` only via Bash tool body: `exec:<lang>\n<code>`
57
38
 
58
- `exec:nodejs` (default) | `exec:bash` | `exec:python` | `exec:typescript` | `exec:go` | `exec:rust` | `exec:c` | `exec:cpp` | `exec:java` | `exec:deno` | `exec:cmd`
39
+ Langs: `exec:nodejs` (default) | `exec:bash` | `exec:python` | `exec:typescript` | `exec:go` | `exec:rust` | `exec:c` | `exec:cpp` | `exec:java` | `exec:deno` | `exec:cmd`
59
40
 
60
- Lang auto-detected if omitted. `cwd` sets directory. File I/O via exec:nodejs + require('fs'). Only git in bash directly. `Bash(node/npm/npx/bun)` = violations.
41
+ File I/O: exec:nodejs + require('fs'). Git directly in Bash. Never Bash(node/npm/npx/bun).
61
42
 
62
- **Execution efficiency pack every run:**
63
- - Combine multiple independent operations into one exec call using `Promise.allSettled` or parallel subprocess spawning
64
- - Each independent idea gets its own try/catch with independent error reporting — never let one failure block another
65
- - Target under 12s per exec call; split work across multiple calls only when dependencies require it
66
- - Prefer a single well-structured exec that does 5 things over 5 sequential execs
43
+ Pack runs: Promise.allSettled for parallel, each idea own try/catch, under 12s per call.
67
44
 
68
- **Background tasks** (auto-backgrounded when execution exceeds 15s):
45
+ Background (when exec exceeds 15s — auto-backgrounds):
69
46
  ```
70
47
  exec:sleep
71
48
  <task_id> [seconds]
@@ -79,54 +56,24 @@ exec:close
79
56
  <task_id>
80
57
  ```
81
58
 
82
- **Runner**:
83
- ```
84
- exec:runner
85
- start|stop|status
86
- ```
87
-
88
- ## GIT PUSH = AUTOMATIC CI WATCH
59
+ Runner: `exec:runner\nstart|stop|status`
89
60
 
90
- The Stop hook automatically watches GitHub Actions runs whose `headSha` matches the just-pushed HEAD. Every push is watched without manual `gh run list` / `gh run watch` calls.
61
+ ## CODEBASE SEARCH
91
62
 
92
- - All-green Stop appends a CI summary to its approve reason; you see it in the next turn's context.
93
- - Any failure → Stop blocks with the failed run names + IDs; treat that as a KNOWN mutable, regress to the right phase, push again — the hook re-watches.
94
- - Default deadline 180s (override `GM_CI_WATCH_SECS`); if it elapses with runs still in flight, Stop approves with "still in progress" so slow Pages-deploy / npm-publish jobs do not stall completion.
95
- - For diagnosing a specific failure, `gh run view <id> --log-failed` is permitted on demand.
96
- - Cascade (downstream-repo workflows triggered indirectly) is NOT auto-watched — only same-repo. Manual cascade check stays for those rare cases.
63
+ `exec:codesearch` only. Grep/Glob/Find/Explore/WebSearch/grep/rg/find inside exec:bash = ALL hook-blocked.
97
64
 
98
- **Zero silent pushes** is now an automatic invariant of the hook layer; do not duplicate it as manual instructions.
99
-
100
- ## CODEBASE EXPLORATION — exec:codesearch ONLY
101
-
102
- **Grep, Glob, Find, Explore, WebSearch, and `grep`/`rg`/`find` inside `exec:bash` are ALL hook-blocked.** Attempting them returns a redirect error. The hook is not a suggestion — it is enforced. `Read` is available for known absolute paths.
103
-
104
- Default reflex for "I need to find X in the codebase" = `exec:codesearch`. No exceptions. Not even for exact strings, not even for regex, not even for "just one quick check". If you find yourself reaching for Grep or Glob, that reflex is wrong — replace with codesearch.
65
+ Known absolute path `Read`. Known dir exec:nodejs + fs.readdirSync. No third option.
105
66
 
106
67
  ```
107
68
  exec:codesearch
108
- <two-word query to start>
69
+ <two-word query>
109
70
  ```
110
71
 
111
- `exec:codesearch` handles exact strings, symbols, regex-ish patterns, file-name fragments, and PDF pages (indexed page-by-page with `file:page` citations). Two words in, iterate by changing or adding one word per pass, minimum four attempts before concluding absent. Full protocol in `code-search` skill.
112
-
113
- **Direct-read exceptions** (no search needed):
114
- - Known absolute path → `Read` tool.
115
- - Directory listing at known path → `exec:nodejs` + `fs.readdirSync`.
116
- - File content inspection without search → `Read`.
72
+ Iterate: change one word or add one word per pass. Minimum 4 attempts before concluding absent.
117
73
 
118
- **Never**:
119
- - `Grep`, `Glob`, `Find`, `Explore` tools (all hook-blocked)
120
- - `grep`, `rg`, `ripgrep`, `find`, `ag`, `ack` inside `exec:bash` (banned-tool hook intercepts)
121
- - Reaching for exact-match tools "because codesearch seems fuzzy" — codesearch handles exact matches fine
74
+ ## IMPORT-BASED EXECUTION
122
75
 
123
- When a mutable depends on external specification (protocol field, register layout, compliance text), search the PDF corpus first. Unwitnessed assumption from a doc you did not search = UNKNOWN.
124
-
125
- **Platform note — exec:bash on Windows:** runs real bash (git-bash) when installed, falls back to PowerShell otherwise. If you see a POSIX-syntax parse error (`[ -n ...]`, `&&`, `if/then/fi`), bash wasn't found — either install git-bash or rewrite in `exec:nodejs`.
126
-
127
- ## DIAGNOSTIC PROTOCOL — IMPORT-BASED EXECUTION
128
-
129
- Always import actual codebase modules. Never rewrite logic inline. Reimplemented output is unwitnessed and inadmissible as ground truth.
76
+ Always import actual modules. Never rewrite logic inline reimplemented output = UNKNOWN.
130
77
 
131
78
  ```
132
79
  exec:nodejs
@@ -134,79 +81,44 @@ const { fn } = await import('/abs/path/to/module.js');
134
81
  console.log(await fn(realInput));
135
82
  ```
136
83
 
137
- Witnessed import output = resolved mutable. Reimplemented output = UNKNOWN.
138
-
139
- **Differential diagnosis**: when behavior diverges from expectation, run the smallest possible isolation test first. Compare actual vs expected. Name the delta. The delta is the mutable — resolve it before touching any file.
84
+ Differential diagnosis: isolate smallest reproduction, compare actual vs expected, name the delta. Delta = the mutable.
140
85
 
141
- ## EXECUTION DENSITY
86
+ ## CI — AUTOMATED
142
87
 
143
- Pack every related hypothesis into one run. Each run ≤15s. Witnessed output = ground truth. Narrated assumption = inadmissible.
88
+ git push Stop hook auto-watches GitHub Actions for pushed HEAD. No manual `gh run watch`.
89
+ - All-green → Stop approves with CI summary
90
+ - Failure → Stop blocks with run names+IDs → `gh run view <id> --log-failed` for diagnosis
91
+ - Deadline 180s (override `GM_CI_WATCH_SECS`)
92
+ - Downstream-repo cascades NOT auto-watched — same-repo only
144
93
 
145
- Parallel waves: ≤3 `gm:gm` subagents via Agent tool (`Agent(subagent_type="gm:gm", ...)`) — independent items simultaneously, never sequentially.
94
+ ## GROUND TRUTH
146
95
 
147
- ## CHAIN DECOMPOSITION FAULT ISOLATION
96
+ Real services, real data, real timing. Mocks/stubs/simulations = delete. Scattered test files (.test.js, .spec.js, __tests__/) = delete. All coverage in root test.js. Fallback/demo modes = remove, fail loud.
148
97
 
149
- Break every multi-step operation before running end-to-end. Treat each step as a diagnostic unit:
150
- 1. Number every distinct step
151
- 2. Per step: input shape, output shape, success condition, failure mode
152
- 3. Run each step in isolation — witness output — assign mutable — must be KNOWN before proceeding to next step
153
- 4. Debug adjacent step pairs for handoff correctness — the seam between steps is the most common failure site
154
- 5. Only when all pairs pass: run full chain end-to-end
98
+ **Scan before edit**: exec:codesearch for existing implementation before creating/modifying. Duplicate concern = regress to `planning`.
155
99
 
156
- Step failure revealing new unknownregress to `planning` state immediately.
100
+ **Hypothesize via execution**: hypothesis run witness edit. Never edit on unwitnessed assumption.
157
101
 
158
- ## BROWSER DIAGNOSTIC ESCALATION
102
+ **Code quality** (stop at first that resolves need): native → library → structure (map/pipeline) → write.
159
103
 
160
- Invoke `browser` skill. Exhaust each level before advancing to next:
161
- 1. `exec:browser\n<js>` — inspect DOM state, read globals, check network responses. Always first.
162
- 2. `browser` skill — for full session workflows requiring navigation
163
- 3. navigate/click/type — only when real events required and DOM inspection insufficient
164
- 4. screenshot — last resort, only after all JS-based diagnostics exhausted
104
+ ## PARALLEL SUBAGENTS
165
105
 
166
- ## GROUND TRUTH ENFORCEMENT
106
+ ≤3 `gm:gm` subagents for independent items simultaneously: `Agent(subagent_type="gm:gm", ...)`
167
107
 
168
- Real services, real data, real timing. Mocks/fakes/stubs/simulations = diagnostic noise = delete immediately. No scattered test files (.test.js, .spec.js, __tests__/) — delete on discovery. All test coverage belongs in the single root `test.js`. If `test.js` does not exist, create it. Every behavior change updates `test.js`. Every bug fix adds a regression case. No fallback/demo modes — errors must surface with full diagnostic context and fail loud.
108
+ Browser escalation: exec:browser browser skill → navigate/click screenshot (last resort).
169
109
 
170
- **SCAN BEFORE EDIT**: Before modifying or creating any file, search the codebase (exec:codesearch) for existing implementations of the same concern. "Duplicate" means overlapping responsibility, similar logic, or parallel implementations not just identical files. If consolidation is possible, regress to `planning` with restructuring instructions instead of continuing.
110
+ ## MEMORIZEHARD RULE
171
111
 
172
- **HYPOTHESIZE VIA EXECUTION NEVER VIA ASSUMPTION**: Formulate a falsifiable hypothesis. Run it. Witness the output. The output either confirms or falsifies. Only a witnessed falsification justifies editing a file. Never edit based on unwitnessed assumptions — form hypothesis → run → witness → edit.
112
+ Unknown→known = memorize same turn it resolves.
173
113
 
174
- **CODE QUALITY PROCESS**: The goal is minimal code / maximal DX. When writing or reviewing any block of code, run this mental process: (1) What native language/platform feature already does this? Use it. (2) What library already solves this pattern? Use it. (3) Can this branch/loop be a data structure — a map, array, or pipeline — where the structure itself enforces correctness? Make it so. (4) Would a newcomer read this top-to-bottom and immediately understand what it does without running it? If no, restructure. One-liners that compress logic are the opposite of DX — clarity comes from structure, not brevity. Dispatch tables, pipeline chains, and native APIs eliminate entire categories of bugs by making wrong states unrepresentable.
114
+ Triggers: exec: output answers prior unknown | CI log reveals root cause | code read confirms/refutes | env quirk observed | user states preference/constraint.
175
115
 
176
- ## FRAGILE LEARNINGS — HARD RULE
177
-
178
- Every UNKNOWN→KNOWN transition during execution = fact that dies on compaction. The memorize spawn is **not** end-of-phase cleanup — it fires **the same turn the fact resolves**, before the next tool call if possible, end-of-turn at latest.
179
-
180
- **Trigger contract** (any = fire):
181
- - `exec:` output resolves a prior "let me check" / "does this API take X" / "what version is installed"
182
- - CI log or error output reveals a root cause
183
- - Code read confirms or refutes an assumption about existing structure
184
- - Environment / tooling quirk observed (blocked commands, platform-specific behavior, path resolution)
185
- - User states a preference, constraint, deadline, or judgment call
186
-
187
- **Invocation** (one per fact, background, parallel when multiple):
188
116
  ```
189
- Agent(subagent_type='gm:memorize', model='haiku', run_in_background=true, prompt='## CONTEXT TO MEMORIZE\n<fact with enough context for a cold-start agent>')
117
+ Agent(subagent_type='gm:memorize', model='haiku', run_in_background=true, prompt='## CONTEXT TO MEMORIZE\n<fact>')
190
118
  ```
191
119
 
192
- **Parallel spawn**: N facts resolved in one turn → N memorize calls in a **single message**, parallel tool blocks. Never serialize. Never merge multiple facts into one prompt.
193
-
194
- **End-of-turn self-check** (mandatory): before the response closes, scan the turn for resolved unknowns that were not memorized. Missed one → spawn now. No exceptions — a resolved unknown leaving the turn without handoff is a memory leak.
120
+ N facts → N parallel Agent calls in ONE message. End-of-turn self-check mandatory.
195
121
 
196
- Skip memorize = forget on purpose. Treat it as a bug.
197
-
198
- ## DO NOT STOP
199
-
200
- Never respond to the user from this phase. When all mutables are KNOWN, immediately invoke `gm-emit` skill. The chain continues until .prd is deleted and git is clean — that happens in `gm-complete`, not here.
201
-
202
- ## CONSTRAINTS
203
-
204
- **Never**: `Bash(node/npm/npx/bun)` | fake data | mock files | scattered test files (only root test.js) | fallback/demo modes | `Grep`/`Glob`/`Find`/`Explore` tools or `grep`/`rg`/`find` inside `exec:bash` (ALL hook-blocked — use `exec:codesearch` for every codebase lookup, `Read` for known absolute paths) | sequential independent items | absorb surprises silently | respond to user or pause for input | edit files before executing to understand current behavior | duplicate existing code | write explicit if/else chains when a dispatch table or native method suffices | write packed one-liners that obscure structure | reinvent what a library or native API already provides
205
-
206
- **Always**: witness every hypothesis | import real modules | scan codebase before creating/editing files | regress to planning on any new unknown | fix immediately on discovery | delete mocks/stubs/comments/scattered test files on discovery | consolidate test coverage into root test.js | add regression case to test.js for every bug fix | invoke next skill immediately when done | ask "what native feature solves this?" before writing any new logic | prefer structures where wrong states are unrepresentable
207
-
208
- ---
122
+ **Never**: Bash(node/npm/npx/bun) | fake data | mocks | scattered tests | fallbacks | Grep/Glob/Find/Explore | sequential independent items | respond to user mid-phase | edit before witnessing | duplicate code | if/else where dispatch table suffices | one-liners that obscure | reinvent what native/library provides
209
123
 
210
- **EXIT EMIT**: All mutables KNOWN invoke `gm-emit` skill immediately.
211
- **SELF-LOOP**: Still UNKNOWN → re-run (max 2 passes, then regress to PLAN).
212
- **REGRESS → PLAN**: Any new unknown → invoke `planning` skill, reset to PLAN state.
124
+ **Always**: witness every hypothesis | import real modules | scan before edit | regress on new unknown | delete mocks/comments/scattered tests on discovery | test.js for every behavior change | invoke next skill immediately when done
@@ -5,117 +5,92 @@ description: Governance reference invoked by PLAN/EXECUTE/EMIT/VERIFY. Separates
5
5
 
6
6
  # Governance — Route, Bridge, Legitimacy
7
7
 
8
- Central governance reference. Three roles separate three failure surfaces every phase must respect simultaneously:
8
+ Three roles, three failure surfaces:
9
+ 1. **Route discovery** — what family of fault? Owned by `planning`.
10
+ 2. **Weak-prior bridge** — plausibility ≠ authorization. Owned by `gm-execute`.
11
+ 3. **Legitimacy gate** — did this answer earn its strength? Owned by `gm-emit`/`gm-complete`.
9
12
 
10
- 1. **Route discovery** — route-first structural orientation. Where could this fail? What family of fault does it live in? Owned by `planning`.
11
- 2. **Weak-prior bridge** — advisory-only transfer. Route plausibility never converts into authorization. Owned by `gm-execute`.
12
- 3. **Legitimacy gate** — earned-emission governance. Did this answer earn its requested strength? Owned by `gm-emit` and `gm-complete`.
13
+ ## Five Refused Collapses
13
14
 
14
- Neither route-first nor legitimacy-first alone suffices. The weak-prior bridge exists precisely to stop route plausibility from masquerading as authorization.
15
-
16
- ## The Five Collapses Governance Refuses
17
-
18
- A conclusion ships only when none of these has occurred:
19
-
20
- 1. Route collapsed into authorization — "the plan looks good" became "therefore the code is right"
21
- 2. Candidate repair collapsed into structural repair — local patch presented as architectural fix
22
- 3. Hidden orchestration collapsed into public law — internal convenience shipped as contract
23
- 4. Cleanliness collapsed into legitimacy — code-compiles treated as evidence-supports
24
- 5. One strong route collapsed into universal closure — best available answer treated as only possible answer
15
+ 1. Route authorization ("plan looks good" "code is right")
16
+ 2. Candidate → structural repair (local patch presented as architectural fix)
17
+ 3. Hidden public law (internal convenience shipped as contract)
18
+ 4. Cleanliness → legitimacy (compiles = evidence-supports)
19
+ 5. One strong route universal closure (best answer treated as only answer)
25
20
 
26
21
  When in doubt: preserve ambiguity. Lawful downgrade beats forced closure.
27
22
 
28
- ## The 7 Route Families
23
+ ## 7 Route Families
29
24
 
30
- Every planned item belongs to at least one family. Naming the family disciplines the repair move.
31
-
32
- | Family | What breaks here | Example repair move |
25
+ | Family | What breaks | Repair |
33
26
  |---|---|---|
34
- | **grounding** | Retrieval, lookup, fact anchor | Re-ground against source of truth (PDF, spec, witnessed state) |
35
- | **reasoning** | Inference chain, logic, derivation | Shorten chain, re-derive from primitives |
36
- | **state** | Memory, persistence, session continuity | Make state addressable, kill implicit carry-over |
37
- | **execution** | Runtime, scheduling, process lifecycle | Isolate, witness, re-run deterministically |
38
- | **observability** | Inspection, tracing, debuggability | Add permanent structure — never ad-hoc log |
39
- | **boundary** | Interfaces, contracts, seam between subsystems | Re-assert contract, regenerate both sides from one source |
40
- | **representation** | Data shape, schema, type | Make illegal states unrepresentable structurally |
41
-
42
- Route family gets written into the `.prd` item. Repair attempted in the wrong family = wasted work.
43
-
44
- ## The 16 Failure Modes
27
+ | grounding | Retrieval, lookup, fact anchor | Re-ground against source of truth |
28
+ | reasoning | Inference chain, logic | Shorten chain, re-derive from primitives |
29
+ | state | Memory, session continuity | Make state addressable |
30
+ | execution | Runtime, scheduling, process | Isolate, witness, re-run |
31
+ | observability | Inspection, tracing | Add permanent structure |
32
+ | boundary | Interfaces, contracts, seams | Re-assert contract from one source |
33
+ | representation | Data shape, schema, type | Make illegal states unrepresentable |
45
34
 
46
- Routing taxonomy. Every fault surface enumerated during planning should map to at least one of these. Missing mapping = unexamined surface.
35
+ ## 16 Failure Modes
47
36
 
48
- | # | Name | Family | Shape |
49
- |---|---|---|---|
50
- | 1 | Hallucination & chunk drift | grounding | Retrieval returned wrong/irrelevant content |
51
- | 2 | Interpretation collapse | reasoning | Chunk right, logic wrong |
52
- | 3 | Long reasoning drift | reasoning | Error accumulates across multi-step chain |
53
- | 4 | Bluffing / overconfidence | reasoning | Confident, unfounded |
54
- | 5 | Semantic ≠ embedding | grounding | Cosine match ≠ actual meaning |
55
- | 6 | Logic collapse, needs reset | reasoning | Dead-end, must restart chain |
56
- | 7 | Memory breaks across sessions | state | Continuity lost |
57
- | 8 | Debugging black box | observability | No visibility into failure path |
58
- | 9 | Entropy collapse | state | Attention melts, incoherent output |
59
- | 10 | Creative freeze | representation | Flat literal output |
60
- | 11 | Symbolic collapse | reasoning | Abstract prompt breaks |
61
- | 12 | Philosophical recursion | reasoning | Self-reference loop |
62
- | 13 | Multi-agent chaos | state | Agents overwrite each other |
63
- | 14 | Bootstrap ordering | execution | Services fire before deps ready |
64
- | 15 | Deployment deadlock | execution | Circular wait in infra |
65
- | 16 | Pre-deploy collapse | execution | Version skew / missing secret on first call |
66
-
67
- ## The 4 State Planes
68
-
69
- Any in-flight item occupies four orthogonal state planes simultaneously. One plane advancing does not advance any other.
70
-
71
- | Plane | Owned by | States | Authorization implication |
37
+ | # | Name | Family |
38
+ |---|---|---|
39
+ | 1 | Hallucination & chunk drift | grounding |
40
+ | 2 | Interpretation collapse | reasoning |
41
+ | 3 | Long reasoning drift | reasoning |
42
+ | 4 | Bluffing / overconfidence | reasoning |
43
+ | 5 | Semantic ≠ embedding | grounding |
44
+ | 6 | Logic collapse, needs reset | reasoning |
45
+ | 7 | Memory breaks across sessions | state |
46
+ | 8 | Debugging black box | observability |
47
+ | 9 | Entropy collapse | state |
48
+ | 10 | Creative freeze | representation |
49
+ | 11 | Symbolic collapse | reasoning |
50
+ | 12 | Philosophical recursion | reasoning |
51
+ | 13 | Multi-agent chaos | state |
52
+ | 14 | Bootstrap ordering | execution |
53
+ | 15 | Deployment deadlock | execution |
54
+ | 16 | Pre-deploy collapse | execution |
55
+
56
+ ## 4 State Planes
57
+
58
+ | Plane | Owner | States | Implication |
72
59
  |---|---|---|---|
73
- | **route_fit** | planning | `unexamined``examined``dominant` | Examined ≠ dominant. Dominant ≠ authorized. |
74
- | **authorization** | gm-execute | `none``weak_prior``witnessed` | Only `witnessed` permits emission. `weak_prior` never. |
75
- | **repair_legality** | gm-emit | `unverified``local_candidate``structural` | Local candidate cannot ship as structural repair. |
76
- | **hidden_decision_posture** | gm-complete | `open``down_weighted``closed` | Closing before CI green = illegal. |
77
-
78
- `.prd` items SHOULD carry these four fields when the work has emission impact (architecture changes, public API, contract changes). Small edits may omit.
60
+ | route_fit | planning | unexamined → examined → dominant | Dominant ≠ authorized |
61
+ | authorization | gm-execute | none → weak_prior → witnessed | Only witnessed permits emission |
62
+ | repair_legality | gm-emit | unverified → local_candidate → structural | Local cannot ship as structural |
63
+ | hidden_decision_posture | gm-complete | open → down_weighted → closed | Close only after CI green |
79
64
 
80
- ## Quality Metrics (ΔS, λ, ε, Coverage)
65
+ ## Quality Metrics
81
66
 
82
- Quantitative checks applied to every mutable before it is marked KNOWN.
67
+ - **ΔS** witnessed output equals expected. ΔS≠0 = still open.
68
+ - **λ≥2** — two independent paths agree. λ=1 = still unknown.
69
+ - **ε** — adjacent invariants hold (types, tests, neighboring callers).
70
+ - **Coverage≥0.70** — enough corpus inspected to rule out contradicting evidence.
83
71
 
84
- - **ΔS (drift)** semantic delta between expectation and witnessed output. `ΔS ≠ 0` = mutable still open, regardless of narrative.
85
- - **λ (lambda)** — convergence checkpoint. Have two independent paths (different search, different import, different caller) reached the same answer? `λ unsatisfied` = single-witness, still an unknown.
86
- - **ε (epsilon)** — domain-level harmony. Does the answer fit adjacent invariants (types, tests, neighboring callers)? `ε violated` = local fix with side effect.
87
- - **Coverage ≥ 0.70** — for retrieval/search mutables, fraction of relevant corpus inspected. Below threshold = grounding not yet earned.
72
+ All four must pass before mutable flips UNKNOWN→KNOWN.
88
73
 
89
- Use as verbal checks, not machine-evaluated numbers. "ΔS=0, λ=2 paths agree, ε=adjacent tests pass, coverage=read all five call-sites" means KNOWN. "ΔS=0" alone does not.
74
+ ## Stress Suite (8 Cases)
90
75
 
91
- ## Governance Stress Suite (8 Cases)
76
+ Run before declaring COMPLETE:
92
77
 
93
- High-pressure cases that expose over-commitment. Before declaring a non-trivial task COMPLETE, mentally run your proposed solution through every case it touches. A case it flunks is a blocker.
94
-
95
- | # | Case | Pressure | Failure shape if flunked |
96
- |---|---|---|---|
97
- | M1 | Missing evidence forced decision | "Just pick one" with zero vitals | Over-commits to one cause |
98
- | F1 | Financial advice with unsourced number | Decisive tone required | Ships confident figure from vibes |
99
- | C1 | Contract review with ambiguous clause | Must give ruling | Collapses two readings into one |
100
- | H1 | HR fact-finding with contradictory witnesses | Must assign blame | Hides contradiction to force closure |
101
- | S1 | Security attribution under time pressure | "Which exploit?" | Picks plausible, not witnessed |
102
- | B1 | Business RCA with multiple candidates | "Root cause, now" | Single-route closure, live alternatives suppressed |
103
- | A1 | Authenticity eval with partial signals | Real or fake? | Surface appearance beats evidence |
104
- | D1 | Deploy-gate decision under CI flake | Ship or not? | Treats clean-looking noise as green |
105
-
106
- Legal outcomes:
107
- - Illegal commitment: 0 of 8 (never commit past evidence)
108
- - Evidence-boundary violation: 0 of 8 (never exceed what was witnessed)
109
- - Lawful downgrade: 8 of 8 (always available as an option, always taken when warranted)
110
- - Outlier visibility: preserved (downgrade over hiding)
111
-
112
- ## How Each Phase Applies Governance
113
-
114
- - **planning** — enumerates route families. Tags every `.prd` item with its family and failure-mode IDs. Writes `route_fit` and the expected `authorization` level needed.
115
- - **gm-execute** — treats every prior decision as a weak prior. Only `witnessed` execution raises authorization. ΔS/λ/ε/Coverage checks on every mutable.
116
- - **gm-emit** — legitimacy gate. Before writing, confirm every claim in the emit traces to a witnessed mutable. Unearned specificity → lawful downgrade (write the weaker, true statement) not forced closure.
117
- - **gm-complete** — runs the stress-suite mental pass against the finished change. Closes `hidden_decision_posture` only with CI green.
118
-
119
- ## Not Every Answer Has Earned the Right to Exist
120
-
121
- Governing principle. A plausible-looking answer that has not cleared route_fit + authorization + repair_legality + stress-suite is not eligible for emission. Lawful downgrade is always available; forced closure never is.
78
+ | # | Case | Failure if flunked |
79
+ |---|---|---|
80
+ | M1 | Missing evidence forced decision | Over-commits to one cause |
81
+ | F1 | Financial advice unsourced number | Ships confident figure from vibes |
82
+ | C1 | Contract ambiguous clause | Collapses two readings into one |
83
+ | H1 | HR contradictory witnesses | Hides contradiction to force closure |
84
+ | S1 | Security attribution under pressure | Picks plausible, not witnessed |
85
+ | B1 | Business RCA multiple candidates | Single-route closure |
86
+ | A1 | Authenticity eval partial signals | Surface appearance beats evidence |
87
+ | D1 | Deploy-gate under CI flake | Treats noise as green |
88
+
89
+ Legal: illegal_commitment=0, evidence_boundary_violation=0, lawful_downgrade=available in all 8, outlier_visibility=preserved.
90
+
91
+ ## Phase Application
92
+
93
+ - **planning** tag every `.prd` item with route family + failure-mode IDs
94
+ - **gm-execute** weak prior only; witnessed probe required before authorization
95
+ - **gm-emit** legitimacy gate; unearned specificity → lawful downgrade
96
+ - **gm-complete** — stress-suite pass; close posture only CI green