@kiwidata/grimoire 0.2.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +16 -3
- package/package.json +1 -1
- package/skills/grimoire-apply/SKILL.md +62 -10
- package/skills/grimoire-audit/SKILL.md +12 -1
- package/skills/grimoire-bug/SKILL.md +3 -0
- package/skills/grimoire-verify/SKILL.md +19 -0
- package/skills/references/review-personas.md +30 -2
- package/templates/learnings.md +40 -0
package/AGENTS.md
CHANGED
|
@@ -61,6 +61,10 @@ After any failure, state what you observe before proposing a fix. One sentence:
|
|
|
61
61
|
|
|
62
62
|
This applies especially to test failures. "The test failed" is not a diagnosis. "The test expected `302` but got `200` because the redirect middleware isn't registered in the test client" is.
|
|
63
63
|
|
|
64
|
+
### Loop-level breaker (autonomous apply)
|
|
65
|
+
|
|
66
|
+
The attempt budget above is per-problem. Autonomous `grimoire-apply` adds a run-level circuit breaker and cross-section thrash detection on top of it — see `grimoire-apply` SKILL.md. Don't duplicate the per-problem rules there; the breaker is the loop-scale backstop, this protocol is the per-problem one.
|
|
67
|
+
|
|
64
68
|
## When to Use Grimoire
|
|
65
69
|
|
|
66
70
|
Use grimoire when the user's request involves:
|
|
@@ -178,7 +182,7 @@ If a task seems wrong or impossible during apply:
|
|
|
178
182
|
|
|
179
183
|
## Directory Structure
|
|
180
184
|
|
|
181
|
-
Features, decisions, constraints, and schema are edited **live on the feature branch** — `git diff` is the staging area. A change folder holds only the ephemeral coordination artifacts (manifest
|
|
185
|
+
Features, decisions, constraints, and schema are edited **live on the feature branch** — `git diff` is the staging area. A change folder holds only the ephemeral coordination artifacts (manifest, tasks, and the apply-maintained learnings file) and is removed at finalize; the PR diff and git history are the record. There is no proposed-copy tree and no archive tree.
|
|
182
186
|
|
|
183
187
|
```
|
|
184
188
|
project-root/
|
|
@@ -193,7 +197,8 @@ project-root/
|
|
|
193
197
|
│ └── changes/ # ephemeral per-change coordination — removed at finalize
|
|
194
198
|
│ └── <change-id>/
|
|
195
199
|
│ ├── manifest.md
|
|
196
|
-
│
|
|
200
|
+
│ ├── tasks.md
|
|
201
|
+
│ └── learnings.md # apply working memory: failure-mode notes + discovered facts
|
|
197
202
|
```
|
|
198
203
|
|
|
199
204
|
## Conventions
|
|
@@ -243,7 +248,15 @@ This is what makes `grimoire trace` work. Without it, the commit is invisible to
|
|
|
243
248
|
### Decision Numbering
|
|
244
249
|
- Sequential, zero-padded: `0001-`, `0002-`, etc.
|
|
245
250
|
- Never reuse numbers
|
|
246
|
-
|
|
251
|
+
|
|
252
|
+
### Decision Lifecycle
|
|
253
|
+
Status moves `proposed → accepted → (deprecated | superseded by NNNN)`:
|
|
254
|
+
- `proposed` — drafted, not yet adopted.
|
|
255
|
+
- `accepted` — in force; treated as a constraint by every stage.
|
|
256
|
+
- `deprecated` — no longer recommended, with no direct replacement (the need went away).
|
|
257
|
+
- `superseded by NNNN` — replaced by a newer decision.
|
|
258
|
+
|
|
259
|
+
Supersession is **two-way and explicit**: the superseding ADR back-links the one it replaces (in Context or Decision Drivers), and the superseded ADR keeps its number with status set to `superseded by NNNN`. This is the only home for the link — don't restate it elsewhere.
|
|
247
260
|
|
|
248
261
|
### Step Definitions
|
|
249
262
|
Organize by **domain concept**, NOT by feature file. Check the project's existing test setup and match its BDD framework conventions. See the active skill's testing reference for ecosystem-specific patterns.
|
package/package.json
CHANGED
|
@@ -21,7 +21,7 @@ Implement tasks from a planned grimoire change using **test-first discipline at
|
|
|
21
21
|
|
|
22
22
|
Do NOT write a `.feature` scenario for a `unit-invariant` or `characterization` task — forcing Gherkin where a unit test is correct is the antipattern that fills feature files with slop. One right way: behavior → scenario, everything else → unit test.
|
|
23
23
|
|
|
24
|
-
**Artifacts are edited live on the feature branch.** Features, decisions, constraints, and schema are real files in `features/`, `.grimoire/decisions/`, `.grimoire/docs/`. There is no copy-into-change-folder and no promote step — `git diff` is the staging area. The change folder holds only ephemeral process scaffolding (`manifest.md`, `tasks.md`).
|
|
24
|
+
**Artifacts are edited live on the feature branch.** Features, decisions, constraints, and schema are real files in `features/`, `.grimoire/decisions/`, `.grimoire/docs/`. There is no copy-into-change-folder and no promote step — `git diff` is the staging area. The change folder holds only ephemeral process scaffolding (`manifest.md`, `tasks.md`, and the apply-maintained `learnings.md`).
|
|
25
25
|
|
|
26
26
|
## CRITICAL: Two Rules That Must Not Be Broken
|
|
27
27
|
|
|
@@ -84,16 +84,26 @@ If the user doesn't specify, default to review mode.
|
|
|
84
84
|
|
|
85
85
|
**Both modes:** Update `tasks.md` in real time as work progresses. Mark tasks `- [x]` the moment they pass. If a task is split, reordered, or new tasks are discovered during implementation, update `tasks.md` immediately so it always reflects the current state. The task list is the source of truth for progress — if the session is interrupted, the next agent should be able to read `tasks.md` and know exactly where to resume.
|
|
86
86
|
|
|
87
|
+
### Working Memory: `learnings.md`
|
|
88
|
+
|
|
89
|
+
Apply keeps one ephemeral file, `.grimoire/changes/<change-id>/learnings.md` (create it from `templates/learnings.md` the first time you need it). It is the loop's memory between attempts and sessions, and it is **removed at finalize** with the rest of the change folder — nothing in it reaches the repo. Two sections, two lifecycles:
|
|
90
|
+
|
|
91
|
+
- **Failure-mode notes** — transient. After a failed attempt, append one line: what you tried and why it failed. Before any retry, read this section so you don't repeat a dead end. Prune a task's notes the moment it goes green. Never promote them.
|
|
92
|
+
- **Discovered facts** — durable facts about the project learned while implementing (a build flag, a convention, an undocumented contract). Stage them here with their destination home; at finalize they are reconciled into that one home and cleared. Do **not** write them into `AGENTS.md`.
|
|
93
|
+
|
|
94
|
+
Subagents and fresh sessions read and append to this file the same way they use `tasks.md` — it is shared state on disk, not context-window memory.
|
|
95
|
+
|
|
87
96
|
### Stuck Detection & Recovery
|
|
88
97
|
|
|
89
98
|
**You MUST track failed attempts per task.** If a test won't go green, count your attempts:
|
|
90
99
|
|
|
100
|
+
- **Before any attempt past the first:** read the task's **failure-mode notes** in `learnings.md`. Do not repeat an approach already recorded there as failed.
|
|
91
101
|
- **Attempt 1:** Try the straightforward implementation from the task description.
|
|
92
|
-
- **Attempt 2:** If attempt 1 failed, re-read the error carefully
|
|
93
|
-
- **Attempt 3 (final):** If attempt 2 failed, try one more *fundamentally different* approach. If the same error recurs, the problem is likely not in your implementation.
|
|
102
|
+
- **Attempt 2:** If attempt 1 failed, append a failure-mode note (`<task-id> · tried … · failed: …`), re-read the error carefully, then try a *different* approach — not the same code with minor tweaks. State what you're doing differently and why.
|
|
103
|
+
- **Attempt 3 (final):** If attempt 2 failed, append the second dead end as a failure-mode note, then try one more *fundamentally different* approach. If the same error recurs, the problem is likely not in your implementation.
|
|
94
104
|
|
|
95
105
|
**After 3 failed attempts on a single task, STOP.** Do not continue. Instead:
|
|
96
|
-
1. Add a comment to `tasks.md` under the task: `<!-- BLOCKED: <summary
|
|
106
|
+
1. Add a comment to `tasks.md` under the task: `<!-- BLOCKED: <summary> -->` (the full trail is already in the failure-mode notes)
|
|
97
107
|
2. Present to the user:
|
|
98
108
|
- What the task requires
|
|
99
109
|
- What you tried (all 3 approaches, briefly)
|
|
@@ -115,10 +125,29 @@ If the user doesn't specify, default to review mode.
|
|
|
115
125
|
|
|
116
126
|
**Never silently retry the same approach.** If your implementation produced error X and you're about to write code that will produce error X again, stop and think about why. If you can't identify what would change the outcome, stop and ask.
|
|
117
127
|
|
|
128
|
+
### Circuit Breaker & Cross-Section Thrash (Autonomous Mode)
|
|
129
|
+
|
|
130
|
+
The per-task 3-attempt cap bounds a single task; it cannot see the *run* cycling. Autonomous mode adds a loop-level breaker the parent orchestrator checks **between sections**. Caps live under `llm.coding.limits` in `.grimoire/config.yaml`:
|
|
131
|
+
|
|
132
|
+
| Cap | Default | Kind |
|
|
133
|
+
|-----|---------|------|
|
|
134
|
+
| `max_sections_without_checkpoint` | 5 | followable — halt and checkpoint with the user |
|
|
135
|
+
| `consecutive_blocked` | 2 | followable — two BLOCKED sections in a row → halt |
|
|
136
|
+
| `max_cost_usd` | null (opt-in) | **soft** — self-reported; not harness-enforced in v1 |
|
|
137
|
+
| `max_wallclock_min` | null (opt-in) | **soft** — self-reported; not harness-enforced in v1 |
|
|
138
|
+
|
|
139
|
+
**Cross-section thrash detection:** halt the whole run — don't just retry locally — when the last two sections both ended BLOCKED, **or** when a section's failure-mode error class repeats the prior section's (read the failure-mode notes in `learnings.md` to compare). A failed attempt always leaves a note, so the thrash signal accumulates across sections; the breaker is the last resort once that signal shows the loop is stuck, not the first line of defense.
|
|
140
|
+
|
|
141
|
+
**On any trip:** stop, state the trip reason and a one-line diagnosis (what cycled, what was tried), and hand to the user. Do not continue past a tripped breaker.
|
|
142
|
+
|
|
143
|
+
> **Enforcement honesty:** the section and BLOCKED caps are orchestrator behavior the agent follows; the cost and wall-clock caps are *soft* — the agent self-reports against them and they are not enforced by the harness in v1. A hard, code-enforced breaker is a deferred follow-up.
|
|
144
|
+
|
|
118
145
|
### Session Management — MANDATORY Fresh Context Per Section
|
|
119
146
|
|
|
120
147
|
**Do NOT implement all tasks in a single conversation context.** Context accumulates across tasks and degrades output quality — the LLM starts hallucinating based on stale file contents it read 5 tasks ago. This is not a suggestion. Fresh context per task section is required.
|
|
121
148
|
|
|
149
|
+
**Size one task to one context.** The goal is not statelessness for its own sake — a task should be small enough that one coherent context carries it start to finish (stateful *within* a task), and context is reset *between* tasks. If a single task overflows its context mid-flight, that is a **smell that the task is too big** — split the spec, don't paper over it with a stateless restart loop. Fresh-context-per-section gives you the "reset between" half for free; keeping tasks small gives you the "continuity within" half.
|
|
150
|
+
|
|
122
151
|
Each task section in `tasks.md` has a `<!-- context: ... -->` block listing the exact files needed. This is the loading list for that section's fresh context.
|
|
123
152
|
|
|
124
153
|
#### Claude Code: Subagent Per Section
|
|
@@ -132,6 +161,12 @@ The parent agent is the **orchestrator only** — it does NOT implement tasks it
|
|
|
132
161
|
find section <N>, and implement all unchecked tasks in that section.
|
|
133
162
|
Follow the red-green BDD cycle for each task. Mark tasks [x] when done.
|
|
134
163
|
|
|
164
|
+
Use `.grimoire/changes/<change-id>/learnings.md` as working memory: read a
|
|
165
|
+
task's failure-mode notes before retrying it and don't repeat a recorded dead
|
|
166
|
+
end; append a failure-mode note after any failed attempt; prune them when the
|
|
167
|
+
task goes green; append durable project facts to Discovered facts with their
|
|
168
|
+
home (never to AGENTS.md). Never weaken or delete a test to force green.
|
|
169
|
+
|
|
135
170
|
Before writing any production code, read `../references/code-quality.md`,
|
|
136
171
|
`../references/testing-contracts.md`, and `../references/pattern-guard.md`.
|
|
137
172
|
Apply the code-quality rules WHILE you write (not after) — reuse before write,
|
|
@@ -255,11 +290,14 @@ Work through `tasks.md` sequentially. **Every task follows the same cycle: test
|
|
|
255
290
|
- Assertions check behavior, not just types or existence — "response status is 302 and redirect URL is /dashboard/" not "response is not None"
|
|
256
291
|
- If you wrote a test that would pass against a null/trivial implementation, strengthen it
|
|
257
292
|
10. **Code quality check:** Walk the seven-point checklist in `../references/code-quality.md` against every file you changed. Any fail → fix code, re-run tests, re-check. Do not mark `[x]` while a check fails.
|
|
258
|
-
11.
|
|
259
|
-
12.
|
|
293
|
+
11. **Reconcile working memory:** prune this task's failure-mode notes from `learnings.md` — it's green, they've served their purpose. If you learned a durable project fact while implementing (a build flag, a convention, an undocumented contract, an architectural constraint), append it to the **Discovered facts** section with its destination home — don't write it into `AGENTS.md` and don't leave it only in context.
|
|
294
|
+
12. Mark complete: `- [ ]` → `- [x]`
|
|
295
|
+
13. Move to next task
|
|
260
296
|
|
|
261
297
|
**This is strict red-green BDD.** A test that has never been red has never proven it can catch a failure. The red step is NOT a formality — it is the proof that the test works. If you skip it or the test passes immediately, you have a false positive that provides zero safety.
|
|
262
298
|
|
|
299
|
+
**Never game the gate (reward-hack guard).** When a test won't pass, fix the production code — never weaken or delete the test to force green. Deleting a test, loosening an assertion to match wrong output, narrowing what it checks, or skipping/`xfail`-ing it to get a green run is **stop-and-flag**, not a valid completion. The gate is the convergence signal; gaming it produces plausible-wrong code faster. If a test genuinely encodes the wrong expectation, that is a spec problem — STOP and go back to draft, don't quietly edit the test to pass.
|
|
300
|
+
|
|
263
301
|
**Step definition rules:**
|
|
264
302
|
- Organize by domain concept, not by feature file
|
|
265
303
|
- Shared steps go in the project's common step location (check existing test setup)
|
|
@@ -290,16 +328,30 @@ When all tests are green. Features, decisions, and constraints were edited live
|
|
|
290
328
|
2. Constraints (`.grimoire/docs/constraints.md`) were edited in place — nothing to move.
|
|
291
329
|
3. If the change has a `data.yml` (schema delta), apply its `add`/`modify`/`remove` entries to the live `.grimoire/docs/data/schema.yml` so the baseline schema stays current. `data.yml` is a migration-delta spec (ephemeral scaffolding carrying nullability/safety/ordering intent a raw diff wouldn't), not a copy of the schema — `schema.yml` is the live target; the delta is discarded with the change folder.
|
|
292
330
|
4. Refresh the project overview: run `grimoire docs`. It regenerates `.grimoire/docs/OVERVIEW.md` (the human entry point) from the now-current features, constraints, decisions, and schema — superseded decisions drop out automatically. This is the existing `docs` command, not a new one.
|
|
293
|
-
5.
|
|
331
|
+
5. Reconcile `learnings.md`: for each entry under **Discovered facts**, write it into the home it names — an area doc (`.grimoire/docs/<area>.md`), a decision, a constraint, or `schema.yml`. Confirm the routing with the user (it's correctable) and drop stale ones. Failure-mode notes are discarded, not promoted. This is the one place facts learned during apply enter the durable record — `AGENTS.md` is never the destination.
|
|
332
|
+
6. Remove the change directory `.grimoire/changes/<change-id>/`. Its `manifest.md` + `tasks.md` + `learnings.md` (+ any `data.yml`) and the `draft.md` design doc are ephemeral process scaffolding. `draft.md` was retained read-only through the pipeline as the agreed-design reference; this is its closing deletion.
|
|
333
|
+
|
|
334
|
+
**Guard — never delete uncommitted scaffolding.** `git log` only preserves what was committed. If `draft.md`/`tasks.md`/`manifest.md`/`learnings.md` were never committed (e.g. draft and plan ran without intermediate commits), deleting them now loses them permanently — there is no recovering an untracked file. Before removing the folder, verify it is in history:
|
|
335
|
+
```
|
|
336
|
+
git ls-files --error-unmatch .grimoire/changes/<change-id>/draft.md
|
|
337
|
+
```
|
|
338
|
+
If that errors (untracked), or `git status` shows uncommitted edits under the change folder, **commit the scaffolding first** (see step 8 — this becomes the first of two commits), then delete. If you cannot commit, STOP and tell the user rather than deleting.
|
|
339
|
+
|
|
340
|
+
The durable record is the branch, the PR, and `git log` — linked by the `Change: <change-id>` trailer; once committed, git history preserves `draft.md` if ever needed. **There is no archive tree** (don't reinvent git history).
|
|
294
341
|
|
|
295
342
|
### 8. Commit
|
|
296
343
|
|
|
297
|
-
|
|
344
|
+
The commit captures the finished state — accepted decisions, live artifacts, cleared scaffolding — not mid-flight change artefacts.
|
|
345
|
+
|
|
346
|
+
**Order depends on whether the scaffolding is already in history (see step 6's guard):**
|
|
347
|
+
|
|
348
|
+
- **Scaffolding already committed** (draft/plan committed earlier, the normal case): finalize fully — including the folder removal — then make one commit capturing the accepted state and the deletion.
|
|
349
|
+
- **Scaffolding NOT yet committed** (this is the change's first commit): you cannot delete-then-commit, or the scaffolding is lost forever. Make **two commits**: (1) commit the implementation, live artifacts, and the still-present change folder so history preserves `draft.md`/`tasks.md`; (2) remove the folder and commit the deletion. Both carry the `Change: <change-id>` trailer.
|
|
298
350
|
|
|
299
|
-
Stage the live artifacts and the scaffolding removal:
|
|
351
|
+
Stage the live artifacts (and, in the single-commit case, the scaffolding removal):
|
|
300
352
|
```
|
|
301
353
|
git add features/ .grimoire/decisions/ .grimoire/docs/ src/ tests/
|
|
302
|
-
git add -u # picks up the removed change directory
|
|
354
|
+
git add -u # picks up the removed change directory (single-commit case)
|
|
303
355
|
```
|
|
304
356
|
|
|
305
357
|
Then commit using `/grimoire:commit` (reads change context for the message) or write a manual message following `AGENTS.md` commit trailer conventions:
|
|
@@ -107,6 +107,9 @@ For confirmed items, create a grimoire change:
|
|
|
107
107
|
Group related items into single changes — don't create one change per discovery.
|
|
108
108
|
|
|
109
109
|
### 6. Dead Feature Detection
|
|
110
|
+
|
|
111
|
+
**Detection is deterministic.** Every dead/stale finding cites exact `file:line` (or ADR id) evidence from a reproducible check — codebase-memory-mcp graph queries (`search_graph` / `get_architecture`) per [0029]/[0030], with `grep` / `git blame` only where the graph has no answer (e.g. `@skip` age). The same commit yields the same findings. The LLM summarizes and interviews; it does not score the codebase by impression.
|
|
112
|
+
|
|
110
113
|
Check for documented features and decisions that may no longer be accurate:
|
|
111
114
|
|
|
112
115
|
**Dead features** — feature files that describe behavior the code no longer implements:
|
|
@@ -137,7 +140,15 @@ After the interview, summarize:
|
|
|
137
140
|
- How many decisions are documented vs. undocumented
|
|
138
141
|
- How many decisions are stale
|
|
139
142
|
- How many conventions files drifted vs. up-to-date
|
|
140
|
-
|
|
143
|
+
|
|
144
|
+
Then emit a **Top Actions** list — most-risk first, each with the exact path and the single next move. The ranking comes from the deterministic checks (§6), not impression, so the same commit yields the same list:
|
|
145
|
+
|
|
146
|
+
```markdown
|
|
147
|
+
## Top Actions
|
|
148
|
+
1. `features/billing/invoice.feature` — dead (InvoiceView deleted ~3mo ago); create a removal change.
|
|
149
|
+
2. `.grimoire/decisions/0007-search-backend.md` — stale (library no longer in deps); deprecate or update.
|
|
150
|
+
3. `.grimoire/docs/conventions/api.md` — drifted (views moved to `src/api/handlers/`); refresh.
|
|
151
|
+
```
|
|
141
152
|
|
|
142
153
|
## Important
|
|
143
154
|
- This is a COLLABORATIVE process, not a dump. Interview, don't lecture.
|
|
@@ -55,6 +55,8 @@ Before touching any production code:
|
|
|
55
55
|
2. Run it — **it MUST FAIL**, reproducing the bug
|
|
56
56
|
3. If it passes, your test doesn't actually reproduce the bug. Fix the test until it fails for the right reason.
|
|
57
57
|
|
|
58
|
+
**Name it after the bug.** This repro test stays as the permanent regression test — name it so the bug is obvious (`test_password_reset_special_chars`; scenario "Password reset with plus-sign email"). One bug → one named regression test. This is how the same bug doesn't come back: a future change that reintroduces it goes red on a test that names the defect.
|
|
59
|
+
|
|
58
60
|
This is non-negotiable. A bug fix without a reproduction test is a guess that might work. A failing test is proof you understand the problem.
|
|
59
61
|
|
|
60
62
|
### 4. Document the Bug
|
|
@@ -193,6 +195,7 @@ Report to the user:
|
|
|
193
195
|
|
|
194
196
|
## Important
|
|
195
197
|
- **Reproduce before you fix.** No exceptions. If you can't reproduce it, you don't understand it, and your fix is a guess.
|
|
198
|
+
- **The test is the source of truth, not your self-review.** When the same agent writes a fix and then reviews it, the same wrong assumption rides into both steps — "looks correct" is not evidence. The red→green of the named regression test (and the configured suites) is the proof. Don't declare a bug fixed on a code re-read; declare it fixed when the mechanical gate flips and stays green.
|
|
196
199
|
- **Small fixes only.** If the bug fix requires significant architectural changes, it's not a bug fix — route to `grimoire-draft` for a proper change.
|
|
197
200
|
- **Don't over-document.** The test is the documentation. A one-line comment in the test explaining the bug is enough. Don't create tracking files, bug reports, or manifests for a bug fix.
|
|
198
201
|
- **The feature file is truth.** If a scenario describes behavior the user now says is wrong, that's a spec change, not a bug. Handle it through `grimoire-draft`.
|
|
@@ -118,9 +118,27 @@ For each step definition:
|
|
|
118
118
|
- **[warning]** `test_auth.py:58` — step "Then user should exist" only asserts `is not None` — check the actual user properties
|
|
119
119
|
```
|
|
120
120
|
|
|
121
|
+
**Regression coverage:** When verifying a bug fix, confirm the fix ships with a regression test **named after the bug** (see `grimoire-bug`). A bug fix with no test that goes red-without-the-fix and pins the defect → WARNING — the bug can silently return.
|
|
122
|
+
|
|
121
123
|
If `grimoire test-quality` CLI command is available, suggest running it for a comprehensive analysis.
|
|
122
124
|
To run tests directly: use `config.tools.bdd_test` for BDD and `config.tools.unit_test` for unit tests.
|
|
123
125
|
|
|
126
|
+
### 3.E Behavioral Verification *(optional — user-facing changes only)*
|
|
127
|
+
|
|
128
|
+
Sections 3.B–3.D verify statically (code exists, asserts, follows decisions) and run the configured suites. They do **not** drive the running app. When the change is user-facing and the app can be run, add a behavioral pass; otherwise skip and say so. This mode adds **no mandatory dependency** — if there's no way to drive the app, mark it INCONCLUSIVE and rely on 3.A–3.D.
|
|
129
|
+
|
|
130
|
+
**Read-only by default.** Read-only navigation/inspection needs no opt-in. Any state-changing action requires explicit user opt-in **and** a non-production target (local/staging URL, seeded creds). Never run mutations against production.
|
|
131
|
+
|
|
132
|
+
**Verdict.** Every behavioral pass ends in exactly one:
|
|
133
|
+
- **SHIP** — behavior matches the spec; no material issues.
|
|
134
|
+
- **SHIP WITH FIXES** — works, with the non-blocking issues listed.
|
|
135
|
+
- **DO NOT SHIP** — a scenario's promised outcome does not hold.
|
|
136
|
+
- **INCONCLUSIVE** — could not verify (no baseline, app wouldn't run, tooling absent).
|
|
137
|
+
|
|
138
|
+
**No baseline ⇒ INCONCLUSIVE, never a silent PASS.** Same rule as §3.C2: without a reference state you cannot claim behavior is correct. Report INCONCLUSIVE and fall back to static verification — do not dress up "I couldn't check" as a pass.
|
|
139
|
+
|
|
140
|
+
**Click-path final-state check.** For each touchpoint the change affects, build a side-effect map — `action → {state it sets, state it resets}` — then trace the sequence and ask: *is the FINAL state what the label/spec promises?* This catches the silent-undo class (action B resets what action A just set) that static reading and single-assert tests miss.
|
|
141
|
+
|
|
124
142
|
### 4. Security Compliance Verification
|
|
125
143
|
|
|
126
144
|
Verify that security guidance from plan and review stages was followed in implementation. Read `../references/security-compliance.md` for the full checklist.
|
|
@@ -222,6 +240,7 @@ Produce a structured report:
|
|
|
222
240
|
- Scenarios verified: X
|
|
223
241
|
- Decisions verified: X
|
|
224
242
|
- Security checks: X passed, X failed
|
|
243
|
+
- Behavioral verdict: <SHIP | SHIP WITH FIXES | DO NOT SHIP | INCONCLUSIVE | n/a (static only)>
|
|
225
244
|
- Issues found: X critical, X warnings, X suggestions
|
|
226
245
|
|
|
227
246
|
## Critical Issues
|
|
@@ -116,6 +116,31 @@ Severity inflation patterns to avoid:
|
|
|
116
116
|
- "Untested edge case" when no scenario in the briefing covers it → not a blocker.
|
|
117
117
|
- "Missing observability" on a level 1-2 change → suggestion, never blocker.
|
|
118
118
|
|
|
119
|
+
## 2c. Pre-Report Gate *(diff-review personas: Senior Engineer code-level, Security code-level scan, Code Style)*
|
|
120
|
+
|
|
121
|
+
The materiality gate (§2) asks "does this matter to *this* project". The Pre-Report Gate asks the prior question: "is this *even a real issue*". Both apply; this one runs first on code-level findings. Before writing any finding, answer four questions:
|
|
122
|
+
|
|
123
|
+
1. **Exact line** — can you cite the precise `file:line` the finding lives at?
|
|
124
|
+
2. **Concrete failure mode** — can you state input → state → bad outcome? Not "could be unsafe" — the actual trigger and consequence.
|
|
125
|
+
3. **Context read** — have you read the callers, imports, and tests around the line, not just the hunk? Trace the type and the caller before claiming a flaw.
|
|
126
|
+
4. **Severity defensible** — would the §2b severity survive the Contrarian?
|
|
127
|
+
|
|
128
|
+
Any "no / unsure" → downgrade or drop. A **blocker** additionally requires the offending snippet, the failure scenario, and **why existing guards don't already catch it** (neighbor code, framework default, narrowing on the prior line). If you can't write that, it is not a blocker.
|
|
129
|
+
|
|
130
|
+
## 2d. Common False Positives — skip these
|
|
131
|
+
|
|
132
|
+
Recurring LLM mis-flags. Each has a disqualifying condition — check it before filing. The fix is always *trace it*, not *pattern-match the syntax*.
|
|
133
|
+
|
|
134
|
+
- **"Possible null deref"** when the preceding line narrows the type (`if (!x) return`, early-return, `?.` already guarding) → trace the type flow; drop.
|
|
135
|
+
- **"N+1 query"** on a fixed-cardinality loop (known small constant) or a DataLoader / batched path → not N+1; drop.
|
|
136
|
+
- **"Missing await"** on an intentionally detached call (`void promise`, fire-and-forget with a comment, a queued job) → check for `void` / comment first; drop.
|
|
137
|
+
- **"Unhandled promise rejection"** on a promise that is `.catch`-chained or `await`ed in a `try` → trace the chain; drop.
|
|
138
|
+
- **"Math.random() is insecure"** in a non-crypto context (jitter, sampling, test data, cache-bust) → security theater; drop. Flag only on tokens/keys/IDs.
|
|
139
|
+
- **"Missing input validation"** when a traced caller already validates at the boundary → trace one caller; internal code may trust it (errors-at-the-boundary). Drop or route as a boundary note.
|
|
140
|
+
- **"Magic number / no constant"**, **"add a comment"**, **"could be more generic"** with no project anchor → style preference; drop (or §4.6 suggestion at most).
|
|
141
|
+
|
|
142
|
+
Closing test for any code-level finding: **would a senior engineer on this team actually change this in review?** If no, skip.
|
|
143
|
+
|
|
119
144
|
---
|
|
120
145
|
|
|
121
146
|
## 3. Complexity Gating
|
|
@@ -170,12 +195,13 @@ Evaluate:
|
|
|
170
195
|
|
|
171
196
|
### 4.2 Senior Engineer
|
|
172
197
|
|
|
173
|
-
Treat accepted decisions as constraints — cite ADR ID before suggesting an override.
|
|
198
|
+
Treat accepted decisions as constraints — cite ADR ID before suggesting an override. On PR/pre-commit, run the Pre-Report Gate (§2c) and check §2d before filing any code-level finding.
|
|
174
199
|
|
|
175
200
|
Evaluate:
|
|
176
201
|
- **Build vs Buy** *(design only)*: Was prior art research thorough? If a well-maintained library exists that the manifest doesn't mention, **blocker**.
|
|
177
202
|
- **Simplicity (YAGNI ladder)**: Walk `../references/principles.md` §4 in order — could it not exist (YAGNI), does the stdlib do it, does a native platform feature cover it, does an installed dep solve it, is it one line? Flag the first rung the code skipped: unnecessary abstraction, indirection, premature generalization, config-driven where a direct call would do. Abstract on the third real use, not the first (**Rule of Three**) — two copies is not yet a pattern. Every finding **names the concrete replacement** (`stdlib: 27-line validator → "@" in email, 1 line`), not just "this seems complex" — a finding the author can't act on is noise.
|
|
178
203
|
- **Architecture**: Decisions sensible for this codebase? Will this paint us into a corner?
|
|
204
|
+
- **Unrecorded decision**: Does the change make an architectural or technology choice — new dependency, pattern, module boundary, NFR target — with no ADR recorded? An architectural decision without a decision record → finding; route to an ADR via `grimoire-draft`, and check it satisfies the capability-surface rule (ADR-0036). Apply the novelty gate (`grimoire-audit` §3) — don't flag default tooling picks.
|
|
179
205
|
- **Conventions** *(PR/pre-commit)*: Does new code match file layout, naming, and patterns already in the touched areas? Check `.grimoire/docs/<area>.md` if present.
|
|
180
206
|
- **Reuse / reinvention**: Existing utilities re-implemented (`grep` similar names; area-doc reusable lists), or stdlib / native-platform / installed-dep functionality hand-rolled (principles.md §3 — don't reinvent the wheel). Name what already does the job.
|
|
181
207
|
- **Dead code** *(PR/pre-commit)*: Functions added but not called, imports unused, commented-out code, stubs with no implementation.
|
|
@@ -221,6 +247,8 @@ Most reviews: only **Data disclosure** + **Linking/Identifying** apply — skip
|
|
|
221
247
|
|
|
222
248
|
#### Code-level scan *(PR/pre-commit only)*
|
|
223
249
|
|
|
250
|
+
Run the Pre-Report Gate (§2c) and check §2d before filing — security findings draw the most reflexive false positives (theoretical injection on validated input, `Math.random()` outside crypto).
|
|
251
|
+
|
|
224
252
|
- **Secrets**: Grep diff for hardcoded keys, tokens, passwords, cloud credentials, JWT secrets. Any hit = **blocker**.
|
|
225
253
|
- **Injection**: Raw SQL with string concatenation, shell-exec with user input, `eval`/`exec`, unsafe deserialization. Tag OWASP + CWE.
|
|
226
254
|
- **Input validation**: New endpoints without schema validation, file uploads without size/type limits, path params used directly in filesystem calls.
|
|
@@ -288,7 +316,7 @@ Verify the diff matches the project's code-style and comment standards. This is
|
|
|
288
316
|
4. Lint/format config in repo root: `.editorconfig`, `eslint.config.*`, `.prettierrc*`, `pyproject.toml` (ruff/black), `.rubocop.yml`, `rustfmt.toml`, `.golangci.yml`, etc.
|
|
289
317
|
5. **Neighboring files** in the touched directories — derive convention from what already exists when no config exists
|
|
290
318
|
|
|
291
|
-
If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped.
|
|
319
|
+
If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped. The Pre-Report Gate (§2c) and §2d apply here too — most style nits without a config anchor are §2d false positives.
|
|
292
320
|
|
|
293
321
|
#### Evaluate
|
|
294
322
|
|
|
@@ -0,0 +1,40 @@
|
|
|
1
|
+
# Learnings — <change-id>
|
|
2
|
+
|
|
3
|
+
<!--
|
|
4
|
+
Ephemeral working memory for this change. Lives only in
|
|
5
|
+
`.grimoire/changes/<change-id>/` and is **removed at finalize** with the rest
|
|
6
|
+
of the scaffolding — nothing here persists to the repo. Re-read it at the start
|
|
7
|
+
of every task section and before every retry.
|
|
8
|
+
|
|
9
|
+
Two sections, two lifecycles. Keep them separate; never write either into
|
|
10
|
+
`AGENTS.md`.
|
|
11
|
+
-->
|
|
12
|
+
|
|
13
|
+
## Failure-mode notes
|
|
14
|
+
|
|
15
|
+
<!--
|
|
16
|
+
Transient. One line per dead end: what was tried and why it failed, so the next
|
|
17
|
+
attempt does not repeat it. This is the antidote to thrashing — a stuck retry
|
|
18
|
+
MUST read this section first. Pruned per task: delete a task's entries the
|
|
19
|
+
moment that task goes green. Never promoted anywhere.
|
|
20
|
+
-->
|
|
21
|
+
|
|
22
|
+
Format: `- <task-id> · tried <approach> · failed: <observed error / why>`
|
|
23
|
+
|
|
24
|
+
- 2.2 · tried mocking the client wrapper · failed: mock satisfied an assertion prod code never reaches — mock at the HTTP boundary instead
|
|
25
|
+
|
|
26
|
+
## Discovered facts
|
|
27
|
+
|
|
28
|
+
<!--
|
|
29
|
+
Durable facts about the project learned while implementing — a build flag, a
|
|
30
|
+
convention, an undocumented contract, an architectural constraint. Staged here
|
|
31
|
+
only until reconciled into the one home that owns that fact at finalize, then
|
|
32
|
+
cleared. Recording the destination home makes reconciliation mechanical and
|
|
33
|
+
lets the user correct the routing — that reconciliation is what keeps the fact
|
|
34
|
+
from going stale, because it then lives where the project's own changes keep it
|
|
35
|
+
honest.
|
|
36
|
+
-->
|
|
37
|
+
|
|
38
|
+
Format: `- fact: <what was learned> → home: <area doc | decision | constraint | schema | feature>`
|
|
39
|
+
|
|
40
|
+
- fact: the bdd suite needs `TZ=UTC` or time-based scenarios flake → home: `.grimoire/docs/<area>.md`
|