@kiwidata/grimoire 0.2.2 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/AGENTS.md
CHANGED
|
@@ -248,7 +248,15 @@ This is what makes `grimoire trace` work. Without it, the commit is invisible to
|
|
|
248
248
|
### Decision Numbering
|
|
249
249
|
- Sequential, zero-padded: `0001-`, `0002-`, etc.
|
|
250
250
|
- Never reuse numbers
|
|
251
|
-
|
|
251
|
+
|
|
252
|
+
### Decision Lifecycle
|
|
253
|
+
Status moves `proposed → accepted → (deprecated | superseded by NNNN)`:
|
|
254
|
+
- `proposed` — drafted, not yet adopted.
|
|
255
|
+
- `accepted` — in force; treated as a constraint by every stage.
|
|
256
|
+
- `deprecated` — no longer recommended, with no direct replacement (the need went away).
|
|
257
|
+
- `superseded by NNNN` — replaced by a newer decision.
|
|
258
|
+
|
|
259
|
+
Supersession is **two-way and explicit**: the superseding ADR back-links the one it replaces (in Context or Decision Drivers), and the superseded ADR keeps its number with status set to `superseded by NNNN`. This is the only home for the link — don't restate it elsewhere.
|
|
252
260
|
|
|
253
261
|
### Step Definitions
|
|
254
262
|
Organize by **domain concept**, NOT by feature file. Check the project's existing test setup and match its BDD framework conventions. See the active skill's testing reference for ecosystem-specific patterns.
|
package/package.json
CHANGED
|
@@ -107,6 +107,9 @@ For confirmed items, create a grimoire change:
|
|
|
107
107
|
Group related items into single changes — don't create one change per discovery.
|
|
108
108
|
|
|
109
109
|
### 6. Dead Feature Detection
|
|
110
|
+
|
|
111
|
+
**Detection is deterministic.** Every dead/stale finding cites exact `file:line` (or ADR id) evidence from a reproducible check — codebase-memory-mcp graph queries (`search_graph` / `get_architecture`) per [0029]/[0030], with `grep` / `git blame` only where the graph has no answer (e.g. `@skip` age). The same commit yields the same findings. The LLM summarizes and interviews; it does not score the codebase by impression.
|
|
112
|
+
|
|
110
113
|
Check for documented features and decisions that may no longer be accurate:
|
|
111
114
|
|
|
112
115
|
**Dead features** — feature files that describe behavior the code no longer implements:
|
|
@@ -137,7 +140,15 @@ After the interview, summarize:
|
|
|
137
140
|
- How many decisions are documented vs. undocumented
|
|
138
141
|
- How many decisions are stale
|
|
139
142
|
- How many conventions files drifted vs. up-to-date
|
|
140
|
-
|
|
143
|
+
|
|
144
|
+
Then emit a **Top Actions** list — most-risk first, each with the exact path and the single next move. The ranking comes from the deterministic checks (§6), not impression, so the same commit yields the same list:
|
|
145
|
+
|
|
146
|
+
```markdown
|
|
147
|
+
## Top Actions
|
|
148
|
+
1. `features/billing/invoice.feature` — dead (InvoiceView deleted ~3mo ago); create a removal change.
|
|
149
|
+
2. `.grimoire/decisions/0007-search-backend.md` — stale (library no longer in deps); deprecate or update.
|
|
150
|
+
3. `.grimoire/docs/conventions/api.md` — drifted (views moved to `src/api/handlers/`); refresh.
|
|
151
|
+
```
|
|
141
152
|
|
|
142
153
|
## Important
|
|
143
154
|
- This is a COLLABORATIVE process, not a dump. Interview, don't lecture.
|
|
@@ -55,6 +55,8 @@ Before touching any production code:
|
|
|
55
55
|
2. Run it — **it MUST FAIL**, reproducing the bug
|
|
56
56
|
3. If it passes, your test doesn't actually reproduce the bug. Fix the test until it fails for the right reason.
|
|
57
57
|
|
|
58
|
+
**Name it after the bug.** This repro test stays as the permanent regression test — name it so the bug is obvious (`test_password_reset_special_chars`; scenario "Password reset with plus-sign email"). One bug → one named regression test. This is how the same bug doesn't come back: a future change that reintroduces it goes red on a test that names the defect.
|
|
59
|
+
|
|
58
60
|
This is non-negotiable. A bug fix without a reproduction test is a guess that might work. A failing test is proof you understand the problem.
|
|
59
61
|
|
|
60
62
|
### 4. Document the Bug
|
|
@@ -193,6 +195,7 @@ Report to the user:
|
|
|
193
195
|
|
|
194
196
|
## Important
|
|
195
197
|
- **Reproduce before you fix.** No exceptions. If you can't reproduce it, you don't understand it, and your fix is a guess.
|
|
198
|
+
- **The test is the source of truth, not your self-review.** When the same agent writes a fix and then reviews it, the same wrong assumption rides into both steps — "looks correct" is not evidence. The red→green of the named regression test (and the configured suites) is the proof. Don't declare a bug fixed on a code re-read; declare it fixed when the mechanical gate flips and stays green.
|
|
196
199
|
- **Small fixes only.** If the bug fix requires significant architectural changes, it's not a bug fix — route to `grimoire-draft` for a proper change.
|
|
197
200
|
- **Don't over-document.** The test is the documentation. A one-line comment in the test explaining the bug is enough. Don't create tracking files, bug reports, or manifests for a bug fix.
|
|
198
201
|
- **The feature file is truth.** If a scenario describes behavior the user now says is wrong, that's a spec change, not a bug. Handle it through `grimoire-draft`.
|
|
@@ -118,9 +118,27 @@ For each step definition:
|
|
|
118
118
|
- **[warning]** `test_auth.py:58` — step "Then user should exist" only asserts `is not None` — check the actual user properties
|
|
119
119
|
```
|
|
120
120
|
|
|
121
|
+
**Regression coverage:** When verifying a bug fix, confirm the fix ships with a regression test **named after the bug** (see `grimoire-bug`). A bug fix with no test that goes red-without-the-fix and pins the defect → WARNING — the bug can silently return.
|
|
122
|
+
|
|
121
123
|
If `grimoire test-quality` CLI command is available, suggest running it for a comprehensive analysis.
|
|
122
124
|
To run tests directly: use `config.tools.bdd_test` for BDD and `config.tools.unit_test` for unit tests.
|
|
123
125
|
|
|
126
|
+
### 3.E Behavioral Verification *(optional — user-facing changes only)*
|
|
127
|
+
|
|
128
|
+
Sections 3.B–3.D verify statically (code exists, asserts, follows decisions) and run the configured suites. They do **not** drive the running app. When the change is user-facing and the app can be run, add a behavioral pass; otherwise skip and say so. This mode adds **no mandatory dependency** — if there's no way to drive the app, mark it INCONCLUSIVE and rely on 3.A–3.D.
|
|
129
|
+
|
|
130
|
+
**Read-only by default.** Read-only navigation/inspection needs no opt-in. Any state-changing action requires explicit user opt-in **and** a non-production target (local/staging URL, seeded creds). Never run mutations against production.
|
|
131
|
+
|
|
132
|
+
**Verdict.** Every behavioral pass ends in exactly one:
|
|
133
|
+
- **SHIP** — behavior matches the spec; no material issues.
|
|
134
|
+
- **SHIP WITH FIXES** — works, with the non-blocking issues listed.
|
|
135
|
+
- **DO NOT SHIP** — a scenario's promised outcome does not hold.
|
|
136
|
+
- **INCONCLUSIVE** — could not verify (no baseline, app wouldn't run, tooling absent).
|
|
137
|
+
|
|
138
|
+
**No baseline ⇒ INCONCLUSIVE, never a silent PASS.** Same rule as §3.C2: without a reference state you cannot claim behavior is correct. Report INCONCLUSIVE and fall back to static verification — do not dress up "I couldn't check" as a pass.
|
|
139
|
+
|
|
140
|
+
**Click-path final-state check.** For each touchpoint the change affects, build a side-effect map — `action → {state it sets, state it resets}` — then trace the sequence and ask: *is the FINAL state what the label/spec promises?* This catches the silent-undo class (action B resets what action A just set) that static reading and single-assert tests miss.
|
|
141
|
+
|
|
124
142
|
### 4. Security Compliance Verification
|
|
125
143
|
|
|
126
144
|
Verify that security guidance from plan and review stages was followed in implementation. Read `../references/security-compliance.md` for the full checklist.
|
|
@@ -222,6 +240,7 @@ Produce a structured report:
|
|
|
222
240
|
- Scenarios verified: X
|
|
223
241
|
- Decisions verified: X
|
|
224
242
|
- Security checks: X passed, X failed
|
|
243
|
+
- Behavioral verdict: <SHIP | SHIP WITH FIXES | DO NOT SHIP | INCONCLUSIVE | n/a (static only)>
|
|
225
244
|
- Issues found: X critical, X warnings, X suggestions
|
|
226
245
|
|
|
227
246
|
## Critical Issues
|
|
@@ -116,6 +116,31 @@ Severity inflation patterns to avoid:
|
|
|
116
116
|
- "Untested edge case" when no scenario in the briefing covers it → not a blocker.
|
|
117
117
|
- "Missing observability" on a level 1-2 change → suggestion, never blocker.
|
|
118
118
|
|
|
119
|
+
## 2c. Pre-Report Gate *(diff-review personas: Senior Engineer code-level, Security code-level scan, Code Style)*
|
|
120
|
+
|
|
121
|
+
The materiality gate (§2) asks "does this matter to *this* project". The Pre-Report Gate asks the prior question: "is this *even a real issue*". Both apply; this one runs first on code-level findings. Before writing any finding, answer four questions:
|
|
122
|
+
|
|
123
|
+
1. **Exact line** — can you cite the precise `file:line` the finding lives at?
|
|
124
|
+
2. **Concrete failure mode** — can you state input → state → bad outcome? Not "could be unsafe" — the actual trigger and consequence.
|
|
125
|
+
3. **Context read** — have you read the callers, imports, and tests around the line, not just the hunk? Trace the type and the caller before claiming a flaw.
|
|
126
|
+
4. **Severity defensible** — would the §2b severity survive the Contrarian?
|
|
127
|
+
|
|
128
|
+
Any "no / unsure" → downgrade or drop. A **blocker** additionally requires the offending snippet, the failure scenario, and **why existing guards don't already catch it** (neighbor code, framework default, narrowing on the prior line). If you can't write that, it is not a blocker.
|
|
129
|
+
|
|
130
|
+
## 2d. Common False Positives — skip these
|
|
131
|
+
|
|
132
|
+
Recurring LLM mis-flags. Each has a disqualifying condition — check it before filing. The fix is always *trace it*, not *pattern-match the syntax*.
|
|
133
|
+
|
|
134
|
+
- **"Possible null deref"** when the preceding line narrows the type (`if (!x) return`, early-return, `?.` already guarding) → trace the type flow; drop.
|
|
135
|
+
- **"N+1 query"** on a fixed-cardinality loop (known small constant) or a DataLoader / batched path → not N+1; drop.
|
|
136
|
+
- **"Missing await"** on an intentionally detached call (`void promise`, fire-and-forget with a comment, a queued job) → check for `void` / comment first; drop.
|
|
137
|
+
- **"Unhandled promise rejection"** on a promise that is `.catch`-chained or `await`ed in a `try` → trace the chain; drop.
|
|
138
|
+
- **"Math.random() is insecure"** in a non-crypto context (jitter, sampling, test data, cache-bust) → security theater; drop. Flag only on tokens/keys/IDs.
|
|
139
|
+
- **"Missing input validation"** when a traced caller already validates at the boundary → trace one caller; internal code may trust it (errors-at-the-boundary). Drop or route as a boundary note.
|
|
140
|
+
- **"Magic number / no constant"**, **"add a comment"**, **"could be more generic"** with no project anchor → style preference; drop (or §4.6 suggestion at most).
|
|
141
|
+
|
|
142
|
+
Closing test for any code-level finding: **would a senior engineer on this team actually change this in review?** If no, skip.
|
|
143
|
+
|
|
119
144
|
---
|
|
120
145
|
|
|
121
146
|
## 3. Complexity Gating
|
|
@@ -170,12 +195,13 @@ Evaluate:
|
|
|
170
195
|
|
|
171
196
|
### 4.2 Senior Engineer
|
|
172
197
|
|
|
173
|
-
Treat accepted decisions as constraints — cite ADR ID before suggesting an override.
|
|
198
|
+
Treat accepted decisions as constraints — cite ADR ID before suggesting an override. On PR/pre-commit, run the Pre-Report Gate (§2c) and check §2d before filing any code-level finding.
|
|
174
199
|
|
|
175
200
|
Evaluate:
|
|
176
201
|
- **Build vs Buy** *(design only)*: Was prior art research thorough? If a well-maintained library exists that the manifest doesn't mention, **blocker**.
|
|
177
202
|
- **Simplicity (YAGNI ladder)**: Walk `../references/principles.md` §4 in order — could it not exist (YAGNI), does the stdlib do it, does a native platform feature cover it, does an installed dep solve it, is it one line? Flag the first rung the code skipped: unnecessary abstraction, indirection, premature generalization, config-driven where a direct call would do. Abstract on the third real use, not the first (**Rule of Three**) — two copies is not yet a pattern. Every finding **names the concrete replacement** (`stdlib: 27-line validator → "@" in email, 1 line`), not just "this seems complex" — a finding the author can't act on is noise.
|
|
178
203
|
- **Architecture**: Decisions sensible for this codebase? Will this paint us into a corner?
|
|
204
|
+
- **Unrecorded decision**: Does the change make an architectural or technology choice — new dependency, pattern, module boundary, NFR target — with no ADR recorded? An architectural decision without a decision record → finding; route to an ADR via `grimoire-draft`, and check it satisfies the capability-surface rule (ADR-0036). Apply the novelty gate (`grimoire-audit` §3) — don't flag default tooling picks.
|
|
179
205
|
- **Conventions** *(PR/pre-commit)*: Does new code match file layout, naming, and patterns already in the touched areas? Check `.grimoire/docs/<area>.md` if present.
|
|
180
206
|
- **Reuse / reinvention**: Existing utilities re-implemented (`grep` similar names; area-doc reusable lists), or stdlib / native-platform / installed-dep functionality hand-rolled (principles.md §3 — don't reinvent the wheel). Name what already does the job.
|
|
181
207
|
- **Dead code** *(PR/pre-commit)*: Functions added but not called, imports unused, commented-out code, stubs with no implementation.
|
|
@@ -221,6 +247,8 @@ Most reviews: only **Data disclosure** + **Linking/Identifying** apply — skip
|
|
|
221
247
|
|
|
222
248
|
#### Code-level scan *(PR/pre-commit only)*
|
|
223
249
|
|
|
250
|
+
Run the Pre-Report Gate (§2c) and check §2d before filing — security findings draw the most reflexive false positives (theoretical injection on validated input, `Math.random()` outside crypto).
|
|
251
|
+
|
|
224
252
|
- **Secrets**: Grep diff for hardcoded keys, tokens, passwords, cloud credentials, JWT secrets. Any hit = **blocker**.
|
|
225
253
|
- **Injection**: Raw SQL with string concatenation, shell-exec with user input, `eval`/`exec`, unsafe deserialization. Tag OWASP + CWE.
|
|
226
254
|
- **Input validation**: New endpoints without schema validation, file uploads without size/type limits, path params used directly in filesystem calls.
|
|
@@ -288,7 +316,7 @@ Verify the diff matches the project's code-style and comment standards. This is
|
|
|
288
316
|
4. Lint/format config in repo root: `.editorconfig`, `eslint.config.*`, `.prettierrc*`, `pyproject.toml` (ruff/black), `.rubocop.yml`, `rustfmt.toml`, `.golangci.yml`, etc.
|
|
289
317
|
5. **Neighboring files** in the touched directories — derive convention from what already exists when no config exists
|
|
290
318
|
|
|
291
|
-
If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped.
|
|
319
|
+
If none of the above pin a rule, **don't invent one**. Style preferences without a project anchor are dropped. The Pre-Report Gate (§2c) and §2d apply here too — most style nits without a config anchor are §2d false positives.
|
|
292
320
|
|
|
293
321
|
#### Evaluate
|
|
294
322
|
|