npm - @kiwidata/grimoire - Versions diffs - 0.2.2 → 0.3.0 - Mend

@kiwidata/grimoire 0.2.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/AGENTS.md +9 -1
package/package.json +1 -1
package/skills/grimoire-audit/SKILL.md +12 -1
package/skills/grimoire-bug/SKILL.md +3 -0
package/skills/grimoire-verify/SKILL.md +19 -0
package/skills/references/review-personas.md +30 -2

package/AGENTS.md CHANGED Viewed

@@ -248,7 +248,15 @@ This is what makes `grimoire trace` work. Without it, the commit is invisible to
 ### Decision Numbering
 - Sequential, zero-padded: `0001-`, `0002-`, etc.
 - Never reuse numbers
-- Superseded decisions keep their number, status updated to `superseded by NNNN`
+### Decision Lifecycle
+Status moves `proposed → accepted → (deprecated | superseded by NNNN)`:
+- `proposed` — drafted, not yet adopted.
+- `accepted` — in force; treated as a constraint by every stage.
+- `deprecated` — no longer recommended, with no direct replacement (the need went away).
+- `superseded by NNNN` — replaced by a newer decision.
+Supersession is **two-way and explicit**: the superseding ADR back-links the one it replaces (in Context or Decision Drivers), and the superseded ADR keeps its number with status set to `superseded by NNNN`. This is the only home for the link — don't restate it elsewhere.
 ### Step Definitions
 Organize by **domain concept**, NOT by feature file. Check the project's existing test setup and match its BDD framework conventions. See the active skill's testing reference for ecosystem-specific patterns.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@kiwidata/grimoire",
-  "version": "0.2.2",
+  "version": "0.3.0",
   "description": "Gherkin + MADR spec-driven development for AI coding assistants",
   "type": "module",
   "bin": {

package/skills/grimoire-audit/SKILL.md CHANGED Viewed

@@ -107,6 +107,9 @@ For confirmed items, create a grimoire change:
 Group related items into single changes — don't create one change per discovery.
 ### 6. Dead Feature Detection
+**Detection is deterministic.** Every dead/stale finding cites exact `file:line` (or ADR id) evidence from a reproducible check — codebase-memory-mcp graph queries (`search_graph` / `get_architecture`) per [0029]/[0030], with `grep` / `git blame` only where the graph has no answer (e.g. `@skip` age). The same commit yields the same findings. The LLM summarizes and interviews; it does not score the codebase by impression.
 Check for documented features and decisions that may no longer be accurate:
 **Dead features** — feature files that describe behavior the code no longer implements:
@@ -137,7 +140,15 @@ After the interview, summarize:
 - How many decisions are documented vs. undocumented
 - How many decisions are stale
 - How many conventions files drifted vs. up-to-date
-- Suggest which areas to address first (highest risk / most complex / most frequently changed)
+Then emit a **Top Actions** list — most-risk first, each with the exact path and the single next move. The ranking comes from the deterministic checks (§6), not impression, so the same commit yields the same list:
+```markdown
+## Top Actions
+1. `features/billing/invoice.feature` — dead (InvoiceView deleted ~3mo ago); create a removal change.
+2. `.grimoire/decisions/0007-search-backend.md` — stale (library no longer in deps); deprecate or update.
+3. `.grimoire/docs/conventions/api.md` — drifted (views moved to `src/api/handlers/`); refresh.
+```
 ## Important
 - This is a COLLABORATIVE process, not a dump. Interview, don't lecture.

package/skills/grimoire-bug/SKILL.md CHANGED Viewed

@@ -55,6 +55,8 @@ Before touching any production code:
 2. Run it — **it MUST FAIL**, reproducing the bug
 3. If it passes, your test doesn't actually reproduce the bug. Fix the test until it fails for the right reason.
+**Name it after the bug.** This repro test stays as the permanent regression test — name it so the bug is obvious (`test_password_reset_special_chars`; scenario "Password reset with plus-sign email"). One bug → one named regression test. This is how the same bug doesn't come back: a future change that reintroduces it goes red on a test that names the defect.
 This is non-negotiable. A bug fix without a reproduction test is a guess that might work. A failing test is proof you understand the problem.
 ### 4. Document the Bug
@@ -193,6 +195,7 @@ Report to the user:
 ## Important
 - **Reproduce before you fix.** No exceptions. If you can't reproduce it, you don't understand it, and your fix is a guess.
+- **The test is the source of truth, not your self-review.** When the same agent writes a fix and then reviews it, the same wrong assumption rides into both steps — "looks correct" is not evidence. The red→green of the named regression test (and the configured suites) is the proof. Don't declare a bug fixed on a code re-read; declare it fixed when the mechanical gate flips and stays green.
 - **Small fixes only.** If the bug fix requires significant architectural changes, it's not a bug fix — route to `grimoire-draft` for a proper change.
 - **Don't over-document.** The test is the documentation. A one-line comment in the test explaining the bug is enough. Don't create tracking files, bug reports, or manifests for a bug fix.
 - **The feature file is truth.** If a scenario describes behavior the user now says is wrong, that's a spec change, not a bug. Handle it through `grimoire-draft`.

package/skills/grimoire-verify/SKILL.md CHANGED Viewed

@@ -118,9 +118,27 @@ For each step definition:
    - **[warning]** `test_auth.py:58` — step "Then user should exist" only asserts `is not None` — check the actual user properties
    ```
+**Regression coverage:** When verifying a bug fix, confirm the fix ships with a regression test **named after the bug** (see `grimoire-bug`). A bug fix with no test that goes red-without-the-fix and pins the defect → WARNING — the bug can silently return.
 If `grimoire test-quality` CLI command is available, suggest running it for a comprehensive analysis.
 To run tests directly: use `config.tools.bdd_test` for BDD and `config.tools.unit_test` for unit tests.
+### 3.E Behavioral Verification *(optional — user-facing changes only)*
+Sections 3.B–3.D verify statically (code exists, asserts, follows decisions) and run the configured suites. They do **not** drive the running app. When the change is user-facing and the app can be run, add a behavioral pass; otherwise skip and say so. This mode adds **no mandatory dependency** — if there's no way to drive the app, mark it INCONCLUSIVE and rely on 3.A–3.D.
+**Read-only by default.** Read-only navigation/inspection needs no opt-in. Any state-changing action requires explicit user opt-in **and** a non-production target (local/staging URL, seeded creds). Never run mutations against production.
+**Verdict.** Every behavioral pass ends in exactly one:
+- **SHIP** — behavior matches the spec; no material issues.
+- **SHIP WITH FIXES** — works, with the non-blocking issues listed.
+- **DO NOT SHIP** — a scenario's promised outcome does not hold.
+- **INCONCLUSIVE** — could not verify (no baseline, app wouldn't run, tooling absent).
+**No baseline ⇒ INCONCLUSIVE, never a silent PASS.** Same rule as §3.C2: without a reference state you cannot claim behavior is correct. Report INCONCLUSIVE and fall back to static verification — do not dress up "I couldn't check" as a pass.
+**Click-path final-state check.** For each touchpoint the change affects, build a side-effect map — `action → {state it sets, state it resets}` — then trace the sequence and ask: *is the FINAL state what the label/spec promises?* This catches the silent-undo class (action B resets what action A just set) that static reading and single-assert tests miss.
 ### 4. Security Compliance Verification
 Verify that security guidance from plan and review stages was followed in implementation. Read `../references/security-compliance.md` for the full checklist.
@@ -222,6 +240,7 @@ Produce a structured report:
 - Scenarios verified: X
 - Decisions verified: X
 - Security checks: X passed, X failed
+- Behavioral verdict: <SHIP | SHIP WITH FIXES | DO NOT SHIP | INCONCLUSIVE | n/a (static only)>
 - Issues found: X critical, X warnings, X suggestions
 ## Critical Issues

package/skills/references/review-personas.md CHANGED Viewed

@@ -116,6 +116,31 @@ Severity inflation patterns to avoid:
 - "Untested edge case" when no scenario in the briefing covers it → not a blocker.
 - "Missing observability" on a level 1-2 change → suggestion, never blocker.
+## 2c. Pre-Report Gate *(diff-review personas: Senior Engineer code-level, Security code-level scan, Code Style)*
+The materiality gate (§2) asks "does this matter to *this* project". The Pre-Report Gate asks the prior question: "is this *even a real issue*". Both apply; this one runs first on code-level findings. Before writing any finding, answer four questions:
+1. **Exact line** — can you cite the precise `file:line` the finding lives at?
+2. **Concrete failure mode** — can you state input → state → bad outcome? Not "could be unsafe" — the actual trigger and consequence.
+3. **Context read** — have you read the callers, imports, and tests around the line, not just the hunk? Trace the type and the caller before claiming a flaw.
+4. **Severity defensible** — would the §2b severity survive the Contrarian?
+Any "no / unsure" → downgrade or drop. A **blocker** additionally requires the offending snippet, the failure scenario, and **why existing guards don't already catch it** (neighbor code, framework default, narrowing on the prior line). If you can't write that, it is not a blocker.
+## 2d. Common False Positives — skip these
+Recurring LLM mis-flags. Each has a disqualifying condition — check it before filing. The fix is always *trace it*, not *pattern-match the syntax*.
+- **"Possible null deref"** when the preceding line narrows the type (`if (!x) return`, early-return, `?.` already guarding) → trace the type flow; drop.
+- **"N+1 query"** on a fixed-cardinality loop (known small constant) or a DataLoader / batched path → not N+1; drop.
+- **"Missing await"** on an intentionally detached call (`void promise`, fire-and-forget with a comment, a queued job) → check for `void` / comment first; drop.
+- **"Unhandled promise rejection"** on a promise that is `.catch`-chained or `await`ed in a `try` → trace the chain; drop.
+- **"Math.random() is insecure"** in a non-crypto context (jitter, sampling, test data, cache-bust) → security theater; drop. Flag only on tokens/keys/IDs.
+- **"Missing input validation"** when a traced caller already validates at the boundary → trace one caller; internal code may trust it (errors-at-the-boundary). Drop or route as a boundary note.
+- **"Magic number / no constant"**, **"add a comment"**, **"could be more generic"** with no project anchor → style preference; drop (or §4.6 suggestion at most).
+Closing test for any code-level finding: **would a senior engineer on this team actually change this in review?** If no, skip.
 ---
 ## 3. Complexity Gating
@@ -170,12 +195,13 @@ Evaluate:
 ### 4.2 Senior Engineer
-Treat accepted decisions as constraints — cite ADR ID before suggesting an override.
+Treat accepted decisions as constraints — cite ADR ID before suggesting an override. On PR/pre-commit, run the Pre-Report Gate (§2c) and check §2d before filing any code-level finding.
 Evaluate:
 - **Build vs Buy** *(design only)*: Was prior art research thorough? If a well-maintained library exists that the manifest doesn't mention, **blocker**.
 - **Simplicity (YAGNI ladder)**: Walk `../references/principles.md` §4 in order — could it not exist (YAGNI), does the stdlib do it, does a native platform feature cover it, does an installed dep solve it, is it one line? Flag the first rung the code skipped: unnecessary abstraction, indirection, premature generalization, config-driven where a direct call would do. Abstract on the third real use, not the first (**Rule of Three**) — two copies is not yet a pattern. Every finding **names the concrete replacement** (`stdlib: 27-line validator → "@" in email, 1 line`), not just "this seems complex" — a finding the author can't act on is noise.
 - **Architecture**: Decisions sensible for this codebase? Will this paint us into a corner?
+- **Unrecorded decision**: Does the change make an architectural or technology choice — new dependency, pattern, module boundary, NFR target — with no ADR recorded? An architectural decision without a decision record → finding; route to an ADR via `grimoire-draft`, and check it satisfies the capability-surface rule (ADR-0036). Apply the novelty gate (`grimoire-audit` §3) — don't flag default tooling picks.
 - **Conventions** *(PR/pre-commit)*: Does new code match file layout, naming, and patterns already in the touched areas? Check `.grimoire/docs/<area>.md` if present.
 - **Reuse / reinvention**: Existing utilities re-implemented (`grep` similar names; area-doc reusable lists), or stdlib / native-platform / installed-dep functionality hand-rolled (principles.md §3 — don't reinvent the wheel). Name what already does the job.
 - **Dead code** *(PR/pre-commit)*: Functions added but not called, imports unused, commented-out code, stubs with no implementation.
@@ -221,6 +247,8 @@ Most reviews: only **Data disclosure** + **Linking/Identifying** apply — skip
 #### Code-level scan *(PR/pre-commit only)*
+Run the Pre-Report Gate (§2c) and check §2d before filing — security findings draw the most reflexive false positives (theoretical injection on validated input, `Math.random()` outside crypto).
 - **Secrets**: Grep diff for hardcoded keys, tokens, passwords, cloud credentials, JWT secrets. Any hit = **blocker**.
 - **Injection**: Raw SQL with string concatenation, shell-exec with user input, `eval`/`exec`, unsafe deserialization. Tag OWASP + CWE.
 - **Input validation**: New endpoints without schema validation, file uploads without size/type limits, path params used directly in filesystem calls.
@@ -288,7 +316,7 @@ Verify the diff matches the project's code-style and comment standards. This is
 4. Lint/format config in repo root: `.editorconfig`, `eslint.config.*`, `.prettierrc*`, `pyproject.toml` (ruff/black), `.rubocop.yml`, `rustfmt.toml`, `.golangci.yml`, etc.
 5. **Neighboring files** in the touched directories — derive convention from what already exists when no config exists
-If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped.
+If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped. The Pre-Report Gate (§2c) and §2d apply here too — most style nits without a config anchor are §2d false positives.
 #### Evaluate