@kiwidata/grimoire 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md CHANGED
@@ -248,7 +248,15 @@ This is what makes `grimoire trace` work. Without it, the commit is invisible to
248
248
  ### Decision Numbering
249
249
  - Sequential, zero-padded: `0001-`, `0002-`, etc.
250
250
  - Never reuse numbers
251
- - Superseded decisions keep their number, status updated to `superseded by NNNN`
251
+
252
+ ### Decision Lifecycle
253
+ Status moves `proposed → accepted → (deprecated | superseded by NNNN)`:
254
+ - `proposed` — drafted, not yet adopted.
255
+ - `accepted` — in force; treated as a constraint by every stage.
256
+ - `deprecated` — no longer recommended, with no direct replacement (the need went away).
257
+ - `superseded by NNNN` — replaced by a newer decision.
258
+
259
+ Supersession is **two-way and explicit**: the superseding ADR back-links the one it replaces (in Context or Decision Drivers), and the superseded ADR keeps its number with status set to `superseded by NNNN`. This is the only home for the link — don't restate it elsewhere.
252
260
 
253
261
  ### Step Definitions
254
262
  Organize by **domain concept**, NOT by feature file. Check the project's existing test setup and match its BDD framework conventions. See the active skill's testing reference for ecosystem-specific patterns.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@kiwidata/grimoire",
3
- "version": "0.2.2",
3
+ "version": "0.3.0",
4
4
  "description": "Gherkin + MADR spec-driven development for AI coding assistants",
5
5
  "type": "module",
6
6
  "bin": {
@@ -107,6 +107,9 @@ For confirmed items, create a grimoire change:
107
107
  Group related items into single changes — don't create one change per discovery.
108
108
 
109
109
  ### 6. Dead Feature Detection
110
+
111
+ **Detection is deterministic.** Every dead/stale finding cites exact `file:line` (or ADR id) evidence from a reproducible check — codebase-memory-mcp graph queries (`search_graph` / `get_architecture`) per [0029]/[0030], with `grep` / `git blame` only where the graph has no answer (e.g. `@skip` age). The same commit yields the same findings. The LLM summarizes and interviews; it does not score the codebase by impression.
112
+
110
113
  Check for documented features and decisions that may no longer be accurate:
111
114
 
112
115
  **Dead features** — feature files that describe behavior the code no longer implements:
@@ -137,7 +140,15 @@ After the interview, summarize:
137
140
  - How many decisions are documented vs. undocumented
138
141
  - How many decisions are stale
139
142
  - How many conventions files drifted vs. up-to-date
140
- - Suggest which areas to address first (highest risk / most complex / most frequently changed)
143
+
144
+ Then emit a **Top Actions** list — most-risk first, each with the exact path and the single next move. The ranking comes from the deterministic checks (§6), not impression, so the same commit yields the same list:
145
+
146
+ ```markdown
147
+ ## Top Actions
148
+ 1. `features/billing/invoice.feature` — dead (InvoiceView deleted ~3mo ago); create a removal change.
149
+ 2. `.grimoire/decisions/0007-search-backend.md` — stale (library no longer in deps); deprecate or update.
150
+ 3. `.grimoire/docs/conventions/api.md` — drifted (views moved to `src/api/handlers/`); refresh.
151
+ ```
141
152
 
142
153
  ## Important
143
154
  - This is a COLLABORATIVE process, not a dump. Interview, don't lecture.
@@ -55,6 +55,8 @@ Before touching any production code:
55
55
  2. Run it — **it MUST FAIL**, reproducing the bug
56
56
  3. If it passes, your test doesn't actually reproduce the bug. Fix the test until it fails for the right reason.
57
57
 
58
+ **Name it after the bug.** This repro test stays as the permanent regression test — name it so the bug is obvious (`test_password_reset_special_chars`; scenario "Password reset with plus-sign email"). One bug → one named regression test. This is how the same bug doesn't come back: a future change that reintroduces it goes red on a test that names the defect.
59
+
58
60
  This is non-negotiable. A bug fix without a reproduction test is a guess that might work. A failing test is proof you understand the problem.
59
61
 
60
62
  ### 4. Document the Bug
@@ -193,6 +195,7 @@ Report to the user:
193
195
 
194
196
  ## Important
195
197
  - **Reproduce before you fix.** No exceptions. If you can't reproduce it, you don't understand it, and your fix is a guess.
198
+ - **The test is the source of truth, not your self-review.** When the same agent writes a fix and then reviews it, the same wrong assumption rides into both steps — "looks correct" is not evidence. The red→green of the named regression test (and the configured suites) is the proof. Don't declare a bug fixed on a code re-read; declare it fixed when the mechanical gate flips and stays green.
196
199
  - **Small fixes only.** If the bug fix requires significant architectural changes, it's not a bug fix — route to `grimoire-draft` for a proper change.
197
200
  - **Don't over-document.** The test is the documentation. A one-line comment in the test explaining the bug is enough. Don't create tracking files, bug reports, or manifests for a bug fix.
198
201
  - **The feature file is truth.** If a scenario describes behavior the user now says is wrong, that's a spec change, not a bug. Handle it through `grimoire-draft`.
@@ -118,9 +118,27 @@ For each step definition:
118
118
  - **[warning]** `test_auth.py:58` — step "Then user should exist" only asserts `is not None` — check the actual user properties
119
119
  ```
120
120
 
121
+ **Regression coverage:** When verifying a bug fix, confirm the fix ships with a regression test **named after the bug** (see `grimoire-bug`). A bug fix with no test that goes red-without-the-fix and pins the defect → WARNING — the bug can silently return.
122
+
121
123
  If `grimoire test-quality` CLI command is available, suggest running it for a comprehensive analysis.
122
124
  To run tests directly: use `config.tools.bdd_test` for BDD and `config.tools.unit_test` for unit tests.
123
125
 
126
+ ### 3.E Behavioral Verification *(optional — user-facing changes only)*
127
+
128
+ Sections 3.B–3.D verify statically (code exists, asserts, follows decisions) and run the configured suites. They do **not** drive the running app. When the change is user-facing and the app can be run, add a behavioral pass; otherwise skip and say so. This mode adds **no mandatory dependency** — if there's no way to drive the app, mark it INCONCLUSIVE and rely on 3.A–3.D.
129
+
130
+ **Read-only by default.** Read-only navigation/inspection needs no opt-in. Any state-changing action requires explicit user opt-in **and** a non-production target (local/staging URL, seeded creds). Never run mutations against production.
131
+
132
+ **Verdict.** Every behavioral pass ends in exactly one:
133
+ - **SHIP** — behavior matches the spec; no material issues.
134
+ - **SHIP WITH FIXES** — works, with the non-blocking issues listed.
135
+ - **DO NOT SHIP** — a scenario's promised outcome does not hold.
136
+ - **INCONCLUSIVE** — could not verify (no baseline, app wouldn't run, tooling absent).
137
+
138
+ **No baseline ⇒ INCONCLUSIVE, never a silent PASS.** Same rule as §3.C2: without a reference state you cannot claim behavior is correct. Report INCONCLUSIVE and fall back to static verification — do not dress up "I couldn't check" as a pass.
139
+
140
+ **Click-path final-state check.** For each touchpoint the change affects, build a side-effect map — `action → {state it sets, state it resets}` — then trace the sequence and ask: *is the FINAL state what the label/spec promises?* This catches the silent-undo class (action B resets what action A just set) that static reading and single-assert tests miss.
141
+
124
142
  ### 4. Security Compliance Verification
125
143
 
126
144
  Verify that security guidance from plan and review stages was followed in implementation. Read `../references/security-compliance.md` for the full checklist.
@@ -222,6 +240,7 @@ Produce a structured report:
222
240
  - Scenarios verified: X
223
241
  - Decisions verified: X
224
242
  - Security checks: X passed, X failed
243
+ - Behavioral verdict: <SHIP | SHIP WITH FIXES | DO NOT SHIP | INCONCLUSIVE | n/a (static only)>
225
244
  - Issues found: X critical, X warnings, X suggestions
226
245
 
227
246
  ## Critical Issues
@@ -116,6 +116,31 @@ Severity inflation patterns to avoid:
116
116
  - "Untested edge case" when no scenario in the briefing covers it → not a blocker.
117
117
  - "Missing observability" on a level 1-2 change → suggestion, never blocker.
118
118
 
119
+ ## 2c. Pre-Report Gate *(diff-review personas: Senior Engineer code-level, Security code-level scan, Code Style)*
120
+
121
+ The materiality gate (§2) asks "does this matter to *this* project". The Pre-Report Gate asks the prior question: "is this *even a real issue*". Both apply; this one runs first on code-level findings. Before writing any finding, answer four questions:
122
+
123
+ 1. **Exact line** — can you cite the precise `file:line` the finding lives at?
124
+ 2. **Concrete failure mode** — can you state input → state → bad outcome? Not "could be unsafe" — the actual trigger and consequence.
125
+ 3. **Context read** — have you read the callers, imports, and tests around the line, not just the hunk? Trace the type and the caller before claiming a flaw.
126
+ 4. **Severity defensible** — would the §2b severity survive the Contrarian?
127
+
128
+ Any "no / unsure" → downgrade or drop. A **blocker** additionally requires the offending snippet, the failure scenario, and **why existing guards don't already catch it** (neighbor code, framework default, narrowing on the prior line). If you can't write that, it is not a blocker.
129
+
130
+ ## 2d. Common False Positives — skip these
131
+
132
+ Recurring LLM mis-flags. Each has a disqualifying condition — check it before filing. The fix is always *trace it*, not *pattern-match the syntax*.
133
+
134
+ - **"Possible null deref"** when the preceding line narrows the type (`if (!x) return`, early-return, `?.` already guarding) → trace the type flow; drop.
135
+ - **"N+1 query"** on a fixed-cardinality loop (known small constant) or a DataLoader / batched path → not N+1; drop.
136
+ - **"Missing await"** on an intentionally detached call (`void promise`, fire-and-forget with a comment, a queued job) → check for `void` / comment first; drop.
137
+ - **"Unhandled promise rejection"** on a promise that is `.catch`-chained or `await`ed in a `try` → trace the chain; drop.
138
+ - **"Math.random() is insecure"** in a non-crypto context (jitter, sampling, test data, cache-bust) → security theater; drop. Flag only on tokens/keys/IDs.
139
+ - **"Missing input validation"** when a traced caller already validates at the boundary → trace one caller; internal code may trust it (errors-at-the-boundary). Drop or route as a boundary note.
140
+ - **"Magic number / no constant"**, **"add a comment"**, **"could be more generic"** with no project anchor → style preference; drop (or §4.6 suggestion at most).
141
+
142
+ Closing test for any code-level finding: **would a senior engineer on this team actually change this in review?** If no, skip.
143
+
119
144
  ---
120
145
 
121
146
  ## 3. Complexity Gating
@@ -170,12 +195,13 @@ Evaluate:
170
195
 
171
196
  ### 4.2 Senior Engineer
172
197
 
173
- Treat accepted decisions as constraints — cite ADR ID before suggesting an override.
198
+ Treat accepted decisions as constraints — cite ADR ID before suggesting an override. On PR/pre-commit, run the Pre-Report Gate (§2c) and check §2d before filing any code-level finding.
174
199
 
175
200
  Evaluate:
176
201
  - **Build vs Buy** *(design only)*: Was prior art research thorough? If a well-maintained library exists that the manifest doesn't mention, **blocker**.
177
202
  - **Simplicity (YAGNI ladder)**: Walk `../references/principles.md` §4 in order — could it not exist (YAGNI), does the stdlib do it, does a native platform feature cover it, does an installed dep solve it, is it one line? Flag the first rung the code skipped: unnecessary abstraction, indirection, premature generalization, config-driven where a direct call would do. Abstract on the third real use, not the first (**Rule of Three**) — two copies is not yet a pattern. Every finding **names the concrete replacement** (`stdlib: 27-line validator → "@" in email, 1 line`), not just "this seems complex" — a finding the author can't act on is noise.
178
203
  - **Architecture**: Decisions sensible for this codebase? Will this paint us into a corner?
204
+ - **Unrecorded decision**: Does the change make an architectural or technology choice — new dependency, pattern, module boundary, NFR target — with no ADR recorded? An architectural decision without a decision record → finding; route to an ADR via `grimoire-draft`, and check it satisfies the capability-surface rule (ADR-0036). Apply the novelty gate (`grimoire-audit` §3) — don't flag default tooling picks.
179
205
  - **Conventions** *(PR/pre-commit)*: Does new code match file layout, naming, and patterns already in the touched areas? Check `.grimoire/docs/<area>.md` if present.
180
206
  - **Reuse / reinvention**: Existing utilities re-implemented (`grep` similar names; area-doc reusable lists), or stdlib / native-platform / installed-dep functionality hand-rolled (principles.md §3 — don't reinvent the wheel). Name what already does the job.
181
207
  - **Dead code** *(PR/pre-commit)*: Functions added but not called, imports unused, commented-out code, stubs with no implementation.
@@ -221,6 +247,8 @@ Most reviews: only **Data disclosure** + **Linking/Identifying** apply — skip
221
247
 
222
248
  #### Code-level scan *(PR/pre-commit only)*
223
249
 
250
+ Run the Pre-Report Gate (§2c) and check §2d before filing — security findings draw the most reflexive false positives (theoretical injection on validated input, `Math.random()` outside crypto).
251
+
224
252
  - **Secrets**: Grep diff for hardcoded keys, tokens, passwords, cloud credentials, JWT secrets. Any hit = **blocker**.
225
253
  - **Injection**: Raw SQL with string concatenation, shell-exec with user input, `eval`/`exec`, unsafe deserialization. Tag OWASP + CWE.
226
254
  - **Input validation**: New endpoints without schema validation, file uploads without size/type limits, path params used directly in filesystem calls.
@@ -288,7 +316,7 @@ Verify the diff matches the project's code-style and comment standards. This is
288
316
  4. Lint/format config in repo root: `.editorconfig`, `eslint.config.*`, `.prettierrc*`, `pyproject.toml` (ruff/black), `.rubocop.yml`, `rustfmt.toml`, `.golangci.yml`, etc.
289
317
  5. **Neighboring files** in the touched directories — derive convention from what already exists when no config exists
290
318
 
291
- If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped.
319
+ If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped. The Pre-Report Gate (§2c) and §2d apply here too — most style nits without a config anchor are §2d false positives.
292
320
 
293
321
  #### Evaluate
294
322