@static-var/keystone 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.agents/plugins/marketplace.json +24 -0
- package/.claude-plugin/marketplace.json +24 -0
- package/.claude-plugin/plugin.json +12 -0
- package/.codex-plugin/plugin.json +12 -0
- package/.pi/extensions/keystone.ts +172 -0
- package/HOW_IT_WORKS.md +424 -0
- package/Makefile +19 -0
- package/README.md +253 -0
- package/package.json +86 -0
- package/packaging.allowlist +32 -0
- package/scripts/build-metadata.py +99 -0
- package/scripts/package-keystone.sh +59 -0
- package/scripts/validate-keystone.py +261 -0
- package/scripts/validate-package.py +140 -0
- package/skills/keystone/SKILL.md +69 -0
- package/skills/keystone/modules/breakdown.md +239 -0
- package/skills/keystone/modules/build.md +284 -0
- package/skills/keystone/modules/debug.md +198 -0
- package/skills/keystone/modules/gates/isolation.md +56 -0
- package/skills/keystone/modules/gates/proof.md +54 -0
- package/skills/keystone/modules/gates/red.md +59 -0
- package/skills/keystone/modules/gates/review.md +56 -0
- package/skills/keystone/modules/gates/ship.md +57 -0
- package/skills/keystone/modules/health.md +124 -0
- package/skills/keystone/modules/helpers/subagents.md +134 -0
- package/skills/keystone/modules/research.md +86 -0
- package/skills/keystone/modules/review.md +270 -0
- package/skills/keystone/modules/router.md +36 -0
- package/skills/keystone/modules/shape.md +125 -0
- package/skills/keystone/modules/ship.md +130 -0
|
@@ -0,0 +1,198 @@
|
|
|
1
|
+
# Keystone Debug Module
|
|
2
|
+
|
|
3
|
+
## Core principle
|
|
4
|
+
Find the root cause before fixing. Debugging is an evidence ladder: observe the failure, reproduce it, minimize it, trace the mechanism, test falsifiable hypotheses, prove the cause, fix narrowly, guard against regression, verify with exact output, and clean up. No guess-and-check, no cargo-cult edits, no shipping/finalization work.
|
|
5
|
+
|
|
6
|
+
Quick start:
|
|
7
|
+
1. Capture the exact symptom and smallest known failing command/input.
|
|
8
|
+
2. Reproduce or preserve the best available evidence if reproduction is blocked.
|
|
9
|
+
3. Minimize, then test one falsifiable hypothesis at a time.
|
|
10
|
+
4. Prove the mechanism before fixing; add a regression guard and verify with exact output.
|
|
11
|
+
|
|
12
|
+
## Load when
|
|
13
|
+
Load when the user reports an error, failing test, broken behavior, regression, flaky result, performance anomaly, unexpected output, integration failure, suspicious logs, silent failure, data corruption, or asks to troubleshoot why something happened.
|
|
14
|
+
|
|
15
|
+
Also load when the task involves:
|
|
16
|
+
- deciding whether a failure is local, historical, environmental, data-dependent, timing-dependent, or cross-system;
|
|
17
|
+
- diagnosing nondeterminism, race conditions, retries, timeouts, queues, caches, or distributed boundaries;
|
|
18
|
+
- interpreting logs/traces/metrics to explain a symptom;
|
|
19
|
+
- proving whether a suspected fix actually addresses the cause.
|
|
20
|
+
|
|
21
|
+
## Not for
|
|
22
|
+
- Implementing new behavior unrelated to the failure.
|
|
23
|
+
- General code improvements without a reproduced problem.
|
|
24
|
+
- Shipping, release finalization, merge strategy, or deployment handoff; use `ship` after proof/review.
|
|
25
|
+
- Broad repository audits; use `health`.
|
|
26
|
+
- Spec decisions where no failure exists; use `shape`.
|
|
27
|
+
- Replacing review, test strategy, or gate validation modules.
|
|
28
|
+
|
|
29
|
+
## Outcome contract
|
|
30
|
+
Deliver a debug report with:
|
|
31
|
+
- symptom, impact, affected users/systems, and failure classification;
|
|
32
|
+
- reproduction steps, exact command/input/environment, or why reproduction was not possible;
|
|
33
|
+
- minimized failing case when feasible, including what was removed and what still fails;
|
|
34
|
+
- evidence gathered through logs, tests, code inspection, history, metrics, traces, or instrumentation;
|
|
35
|
+
- hypotheses considered, which were disproven, and the surviving root-cause hypothesis;
|
|
36
|
+
- proof that the root cause explains the symptom and predicts observed behavior;
|
|
37
|
+
- exact fix made or proposed, scoped to the proven cause;
|
|
38
|
+
- regression test, guard, monitor, or explicit reason none is feasible;
|
|
39
|
+
- verification commands and exact results/output evidence;
|
|
40
|
+
- cleanup performed, temporary diagnostics removed, and remaining uncertainty or escalation.
|
|
41
|
+
|
|
42
|
+
Stop only when one of these is true:
|
|
43
|
+
- the root cause is proven, the narrow fix is verified, and regression protection is in place or justified;
|
|
44
|
+
- reproduction is impossible after documented attempts and the best available evidence has been preserved;
|
|
45
|
+
- escalation criteria are met.
|
|
46
|
+
|
|
47
|
+
## Modes
|
|
48
|
+
- **Triage:** classify severity, scope, reproducibility, recency, ownership, and risk before fixing.
|
|
49
|
+
- **Reproduce/minimize:** create the smallest reliable case that demonstrates the failure.
|
|
50
|
+
- **Instrument/trace:** add temporary logs, probes, traces, assertions, metrics, or diagnostics to observe reality.
|
|
51
|
+
- **Hypothesis test:** test one falsifiable explanation at a time and record pass/fail evidence.
|
|
52
|
+
- **Fix:** make the smallest change that addresses the proven root cause.
|
|
53
|
+
- **Stabilize flaky behavior:** collect repeated runs, isolate nondeterminism, and prove the stabilizing change.
|
|
54
|
+
- **Performance investigation:** measure baseline, localize bottleneck, prove causality, then optimize narrowly.
|
|
55
|
+
- **Log/silent-failure investigation:** reconstruct the timeline from logs/traces/state when direct failure output is absent.
|
|
56
|
+
- **Escalation:** stop local edits and ask for data, access, owner input, incident handling, or risk approval.
|
|
57
|
+
|
|
58
|
+
## Process
|
|
59
|
+
1. **Classify the failure.** Decide whether it is deterministic, flaky, regression, environment-specific, data-dependent, multi-system boundary, performance, silent/log-only, data corruption, security-sensitive, or destructive. Record severity, scope, recency, and owner.
|
|
60
|
+
2. **Capture the exact symptom.** Include command, input, error text, expected vs actual, environment, versions, seed/timezone/locale, frequency, affected data, and recent changes.
|
|
61
|
+
3. **Reproduce before fixing.** Run the smallest known failing command or scenario. If reproduction is impossible, state the constraint, preserve available evidence, and switch to log/history/data analysis.
|
|
62
|
+
4. **Minimize the case.** Reduce inputs, files, flags, mocks, services, data rows, timing windows, browser/device matrix, or integration surface while keeping the failure. Do not minimize away the bug.
|
|
63
|
+
5. **Inspect code, data, and history.** Distinguish trigger from cause. If the failure is likely historical, use bisect or targeted history review before guessing.
|
|
64
|
+
6. **Trace/instrument narrowly.** Add the smallest temporary diagnostic that can confirm or falsify a hypothesis. Prefer assertions, structured logs, counters, spans, query plans, snapshots, or deterministic seeds over broad logging.
|
|
65
|
+
7. **Form falsifiable hypotheses.** A good hypothesis names a mechanism and prediction. Test one at a time. Record disproven hypotheses instead of silently abandoning them.
|
|
66
|
+
8. **Prove the root cause.** Show that the cause explains the symptom, reproduces or predicts the failure, and that removing/changing the cause removes the failure. Symptoms alone are not proof.
|
|
67
|
+
9. **Fix narrowly.** Change only the code/config/data handling required by the proven cause. Avoid opportunistic refactors, rewrites, or unrelated cleanup.
|
|
68
|
+
10. **Add a regression guard.** Prefer a failing-before/passing-after test. If impractical, add an assertion, monitor, fixture, replay, seed, contract test, migration check, or documented manual proof.
|
|
69
|
+
11. **Verify with exact output.** Run focused verification first, then broader commands if risk warrants. Capture command names and result snippets, not just “tests pass.”
|
|
70
|
+
12. **Clean up.** Remove temporary diagnostics, revert failed experiments, leave useful permanent observability only when justified, and report remaining uncertainty.
|
|
71
|
+
|
|
72
|
+
Debug decision tree / failure classification:
|
|
73
|
+
- **Cannot reproduce?** Verify environment/input parity, collect logs/traces/state, check recent changes, ask for missing data, and escalate if still blocked.
|
|
74
|
+
- **Reproduces reliably now but worked before?** Treat as historical regression; run targeted history review or bisect.
|
|
75
|
+
- **Fails intermittently?** Treat as flaky/race; run many iterations, fix seed/time/timezone/network where possible, and look for shared state, ordering, async waits, retries, clocks, locks, caches, and resource leaks.
|
|
76
|
+
- **Fails at a service boundary?** Trace request IDs across systems, compare contracts, schemas, auth, serialization, retries, timeouts, idempotency, and partial failure handling.
|
|
77
|
+
- **Slow or resource-heavy?** Measure before changing, identify the bounded bottleneck, compare profiles/plans/metrics, and prove the optimization changes the measured bottleneck.
|
|
78
|
+
- **Only logs show failure or behavior is silent?** Reconstruct timeline from logs, traces, metrics, state transitions, audit records, and exit codes; add diagnostics only to close evidence gaps.
|
|
79
|
+
- **Data is wrong or corrupted?** Freeze destructive actions, preserve samples, identify writer/read path, migration/import history, concurrency, validation gaps, and blast radius before fixing.
|
|
80
|
+
|
|
81
|
+
Operational playbook:
|
|
82
|
+
- **Reproduce:** exact command/input/environment; note frequency and baseline output.
|
|
83
|
+
- **Minimize:** remove variables until the smallest failing case remains; document removed variables.
|
|
84
|
+
- **Trace/instrument:** observe the suspected mechanism with scoped, reversible diagnostics.
|
|
85
|
+
- **Hypothesize:** write falsifiable mechanism + prediction; test one hypothesis per experiment.
|
|
86
|
+
- **Prove:** connect evidence to cause; show why alternatives fail or are less likely.
|
|
87
|
+
- **Fix narrowly:** edit only the proven mechanism.
|
|
88
|
+
- **Regression guard:** create failing-before/passing-after proof or equivalent guard.
|
|
89
|
+
- **Cleanup:** remove debug litter and failed experiments; keep only justified observability.
|
|
90
|
+
|
|
91
|
+
Bisect strategy when historical regression is likely:
|
|
92
|
+
- Establish one known-good and one known-bad revision using the same command, data, and environment.
|
|
93
|
+
- Make the reproduction deterministic enough for `git bisect`; if flaky, use a looped script with a clear pass/fail threshold.
|
|
94
|
+
- Keep the bisect command side-effect safe; reset generated files between runs.
|
|
95
|
+
- When bisect identifies a commit, inspect the diff for mechanism and still prove causality in current code.
|
|
96
|
+
- Do not treat the first bad commit as the root cause until the mechanism explains the symptom.
|
|
97
|
+
|
|
98
|
+
Scenario checklists:
|
|
99
|
+
- **Regression:** What changed? Is there a known-good revision? Can the same command prove good vs bad? Is the failing behavior tied to code, dependency, config, data, or environment?
|
|
100
|
+
- **Flaky test/race:** How often does it fail over 20/50/100 runs? Does order, parallelism, clock, seed, async wait, network, cache, filesystem, or shared state affect it? Does instrumentation change timing?
|
|
101
|
+
- **Multi-system boundary:** What is the correlation/request ID? Which system first diverges from expected state? Are contracts, schemas, auth, encoding, idempotency, retries, timeout budgets, and partial failures aligned?
|
|
102
|
+
- **Performance:** What metric is bad and by how much? What is the baseline? Is the bottleneck CPU, memory, IO, network, DB, lock contention, rendering, bundle size, or algorithmic complexity? Does the fix improve that metric?
|
|
103
|
+
- **Logs/silent failure:** What timeline do logs/traces/metrics imply? Are there swallowed exceptions, ignored return values, missing awaits, nonzero exits, dropped events, sampling gaps, or log-level/config differences?
|
|
104
|
+
- **Data corruption:** What data is affected? Is the source of truth known? Which writer last touched it? Are migrations, backfills, imports, concurrent writes, validation, serialization, timezone/locale, or precision involved? Is rollback/destructive repair safe?
|
|
105
|
+
|
|
106
|
+
Good/bad hypotheses:
|
|
107
|
+
- **Good:** “The checkout total is doubled because retrying `capturePayment` replays a non-idempotent side effect; if true, two calls with the same request ID will create two ledger rows.”
|
|
108
|
+
- **Bad:** “Payments are broken.”
|
|
109
|
+
- **Good:** “The test flakes because it asserts before the debounce timer fires; if true, using fake timers or awaiting the debounce settles it across 100 runs.”
|
|
110
|
+
- **Bad:** “Probably async weirdness.”
|
|
111
|
+
- **Good:** “The query slowed after commit X because the new filter prevents index `idx_orders_account_created` from being used; if true, `EXPLAIN` will show a sequential scan and restoring the predicate shape will restore the plan.”
|
|
112
|
+
- **Bad:** “The database is slow.”
|
|
113
|
+
|
|
114
|
+
Good/bad minimization:
|
|
115
|
+
- **Good:** Reduce a failing import from a production-sized CSV to three rows that preserve the bad encoding, duplicate key, and null timestamp that trigger the failure.
|
|
116
|
+
- **Bad:** Replace the import with a mock that no longer exercises parsing, deduplication, or timestamp handling.
|
|
117
|
+
- **Good:** Reduce a browser failure to one route, one viewport, one user role, and one API response fixture while keeping the visible defect.
|
|
118
|
+
- **Bad:** Disable authentication, caching, and the API client so the boundary bug disappears.
|
|
119
|
+
- **Good:** For a flaky test, run the same test alone, in file order, shuffled, with fixed seed, and in parallel to identify the minimal timing/order dependency.
|
|
120
|
+
- **Bad:** Add arbitrary sleeps until the failure stops appearing once.
|
|
121
|
+
|
|
122
|
+
Escalation/stuck criteria:
|
|
123
|
+
- Escalate after **three disproven hypotheses** without a stronger next test.
|
|
124
|
+
- Escalate when there is **no reproduction** and required logs/data/access are unavailable.
|
|
125
|
+
- Escalate immediately for destructive actions, data-loss risk, security/privacy exposure, production incident impact, legal/compliance concerns, or uncertain repair of corrupted data.
|
|
126
|
+
- Escalate when the next diagnostic requires credentials, production data, high-cost infrastructure, schema/data mutation, or owner approval.
|
|
127
|
+
- Escalation output must include symptom, impact, attempts, disproven hypotheses, missing evidence, requested help/access, and safest next action.
|
|
128
|
+
|
|
129
|
+
## Subagents and reasoning
|
|
130
|
+
Default reasoning: `high`. Use oracle/debug subagents for independent root-cause analysis, log review, performance profile interpretation, bisect planning, or hypothesis generation. Use `xhigh` for intermittent, cross-system, security, performance, data-loss, privacy, destructive, or production-impacting failures.
|
|
131
|
+
|
|
132
|
+
Subagents may inspect and reason independently, but fixes should converge on one evidence-backed root cause. Ask subagents for competing hypotheses and evidence gaps, not broad code review. When subagents disagree, run the smallest test that distinguishes their explanations. Do not let parallel analysis become parallel guess-and-check edits.
|
|
133
|
+
|
|
134
|
+
## Hard rules
|
|
135
|
+
- No fix before evidence supports the root cause.
|
|
136
|
+
- No “try this” edits unless explicitly labeled as diagnostic experiments and reverted if disproven.
|
|
137
|
+
- Symptoms are not root causes; keep asking what mechanism produced the symptom.
|
|
138
|
+
- One hypothesis per experiment; record the prediction and result.
|
|
139
|
+
- Three disproven hypotheses without progress triggers escalation or a new evidence source.
|
|
140
|
+
- Regression coverage is required when practical; if impractical, explain why and provide alternate proof.
|
|
141
|
+
- Temporary instrumentation must be removed or clearly documented before handoff.
|
|
142
|
+
- Do not declare fixed without command output, exact output proof, metric delta, trace evidence, or equivalent verification evidence.
|
|
143
|
+
- Do not broaden scope into feature work, refactoring, shipping, finalization, or unrelated cleanup.
|
|
144
|
+
- For data-loss/security/destructive risk, preserve evidence and escalate before mutation.
|
|
145
|
+
|
|
146
|
+
## Failure modes
|
|
147
|
+
- **Guess-and-check spiral:** changing code until symptoms disappear without knowing why.
|
|
148
|
+
- **Confirmation bias:** keeping the first hypothesis despite contradictory evidence.
|
|
149
|
+
- **No-op advice:** saying “check logs,” “add tests,” or “investigate further” without specifying which evidence, command, owner, or stop condition.
|
|
150
|
+
- **Overbroad fix:** refactoring or redesigning more than the failure requires.
|
|
151
|
+
- **Unreproducible confidence:** claiming success from one weak signal on flaky behavior.
|
|
152
|
+
- **Minimization that removes the bug:** simplifying the case until the relevant boundary, data shape, or timing condition disappears.
|
|
153
|
+
- **Instrumentation Heisenbug:** diagnostics change timing, ordering, load, or state enough to hide the failure.
|
|
154
|
+
- **First-bad-commit tunnel vision:** assuming bisect output is the mechanism without proof.
|
|
155
|
+
- **Boundary blame ping-pong:** assuming another service owns the issue without request-level evidence.
|
|
156
|
+
- **Diagnostic litter:** leaving logs, probes, sleeps, flags, generated data, or debug-only state behind.
|
|
157
|
+
|
|
158
|
+
## Output format
|
|
159
|
+
```markdown
|
|
160
|
+
## Debug report
|
|
161
|
+
Symptom: ...
|
|
162
|
+
Impact/scope: ...
|
|
163
|
+
Failure classification: ...
|
|
164
|
+
Stop condition: fixed / blocked / escalated ...
|
|
165
|
+
|
|
166
|
+
### Reproduction
|
|
167
|
+
- Steps/command: ...
|
|
168
|
+
- Environment/input: ...
|
|
169
|
+
- Frequency: ...
|
|
170
|
+
- Minimal case: ...
|
|
171
|
+
|
|
172
|
+
### Evidence and root cause
|
|
173
|
+
- Evidence ladder:
|
|
174
|
+
1. Observation: ...
|
|
175
|
+
2. Trace/minimization: ...
|
|
176
|
+
3. Hypothesis test: ...
|
|
177
|
+
4. Proof: ...
|
|
178
|
+
- Disproven hypotheses: ...
|
|
179
|
+
- Root cause: ...
|
|
180
|
+
|
|
181
|
+
### Fix
|
|
182
|
+
- Changed: ...
|
|
183
|
+
- Why this addresses the cause: ...
|
|
184
|
+
- Scope intentionally not changed: ...
|
|
185
|
+
|
|
186
|
+
### Regression protection
|
|
187
|
+
- Test/guard: ...
|
|
188
|
+
- Failing-before/passing-after proof or alternate guard: ...
|
|
189
|
+
|
|
190
|
+
### Verification
|
|
191
|
+
- Command/result: ...
|
|
192
|
+
- Exact output proof: ...
|
|
193
|
+
|
|
194
|
+
### Cleanup / remaining uncertainty
|
|
195
|
+
- Temporary diagnostics removed: ...
|
|
196
|
+
- Remaining risk/uncertainty: ...
|
|
197
|
+
- Escalation needed, if any: ...
|
|
198
|
+
```
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Isolation Gate
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
Confirm mutation can happen safely before the first file change.
|
|
5
|
+
|
|
6
|
+
This gate protects user work, local experiments, and unrelated files. It is binary: either the workspace is isolated for the requested blast radius, or mutation stops.
|
|
7
|
+
|
|
8
|
+
## Required checks
|
|
9
|
+
1. Identify the workspace:
|
|
10
|
+
- Run `git rev-parse --show-toplevel`.
|
|
11
|
+
- Run `git branch --show-current`.
|
|
12
|
+
- Run `git worktree list` or compare `git rev-parse --git-dir` with `git rev-parse --git-common-dir`.
|
|
13
|
+
2. Capture dirty state:
|
|
14
|
+
- Run `git status --porcelain` before editing.
|
|
15
|
+
- Treat every listed path as user-owned until proven otherwise.
|
|
16
|
+
3. State the planned blast radius:
|
|
17
|
+
- List the exact files or directories expected to change.
|
|
18
|
+
- List protected files, generated files, scripts, tests, and modules that must not change.
|
|
19
|
+
4. Compare dirty files to planned changes:
|
|
20
|
+
- Planned + clean: safe to edit.
|
|
21
|
+
- Planned + already dirty: ask whether to build on, inspect, or avoid those changes.
|
|
22
|
+
- Unplanned + dirty: do not touch.
|
|
23
|
+
5. Build a collision matrix before mutation.
|
|
24
|
+
|
|
25
|
+
## Collision matrix
|
|
26
|
+
|
|
27
|
+
| Dirty path | Planned to edit? | Owner known? | Action |
|
|
28
|
+
| --- | --- | --- | --- |
|
|
29
|
+
| No dirty paths | N/A | N/A | Pass |
|
|
30
|
+
| Dirty path inside blast radius | Yes | User/unknown | Ask before editing that file |
|
|
31
|
+
| Dirty path outside blast radius | No | User/unknown | Leave untouched |
|
|
32
|
+
| Dirty path is generated artifact | Maybe | Tool/unknown | Do not delete unless user approved |
|
|
33
|
+
| Dirty path conflicts with requested scope | Yes | Unknown | Fail until clarified |
|
|
34
|
+
|
|
35
|
+
## Pass condition
|
|
36
|
+
Pass only when all are true:
|
|
37
|
+
- Workspace root, branch, and worktree state are known.
|
|
38
|
+
- `git status --porcelain` has been captured.
|
|
39
|
+
- Planned blast radius is explicit.
|
|
40
|
+
- Dirty files are either absent, inside approved scope, or guaranteed untouched.
|
|
41
|
+
- No auto-stash, auto-commit, reset, checkout, cleanup, or generated-file deletion is needed.
|
|
42
|
+
|
|
43
|
+
## Fail action
|
|
44
|
+
Stop before changing files. Do not auto-stash, auto-commit, reset, or "clean up" user work.
|
|
45
|
+
|
|
46
|
+
Report:
|
|
47
|
+
|
|
48
|
+
```text
|
|
49
|
+
Isolation Gate: FAIL
|
|
50
|
+
Workspace: <root>
|
|
51
|
+
Branch/worktree: <branch and worktree state>
|
|
52
|
+
Planned blast radius: <files/dirs>
|
|
53
|
+
Dirty files: <git status --porcelain output>
|
|
54
|
+
Collision: <which dirty paths overlap or create risk>
|
|
55
|
+
Needed decision: <ask user to approve, narrow scope, or provide a clean workspace>
|
|
56
|
+
```
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
# Proof Gate
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
Verify claims with evidence before reporting success.
|
|
5
|
+
|
|
6
|
+
This gate rejects vibe-based completion. Code inspection can support a claim, but inspection alone is not proof that behavior works. Proof must connect the intended outcome to observable evidence.
|
|
7
|
+
|
|
8
|
+
## Required checks
|
|
9
|
+
1. Name the claim:
|
|
10
|
+
- What changed?
|
|
11
|
+
- What user-visible or maintainer-visible outcome is now true?
|
|
12
|
+
2. Choose evidence that matches the scope:
|
|
13
|
+
- Logic: unit tests, integration tests, property checks, reproduction scripts, or before/after examples.
|
|
14
|
+
- UI: browser/app interaction, screenshots, accessibility checks, visual diff, or manual steps with observed result.
|
|
15
|
+
- Config/build: validation command, dry run, parser/linter, build, deploy preview, or tool output.
|
|
16
|
+
- Docs/content: link/render check, validator, examples that exercise the documented path, or human-readable diff against requirements.
|
|
17
|
+
3. Run the strongest practical verification.
|
|
18
|
+
4. Capture concrete output:
|
|
19
|
+
- Command name.
|
|
20
|
+
- Pass/fail result.
|
|
21
|
+
- Relevant output, screenshot path, or inspected artifact.
|
|
22
|
+
5. Disclose gaps:
|
|
23
|
+
- Untested paths.
|
|
24
|
+
- Commands unavailable.
|
|
25
|
+
- Manual assumptions.
|
|
26
|
+
|
|
27
|
+
## Not proof
|
|
28
|
+
- "The code looks right."
|
|
29
|
+
- "I updated the file."
|
|
30
|
+
- "This should work."
|
|
31
|
+
- "No errors in my editor."
|
|
32
|
+
- Reviewing a diff without executing or validating the affected behavior.
|
|
33
|
+
|
|
34
|
+
## Good proof examples
|
|
35
|
+
- `pytest tests/test_pricing.py -q` passes and includes the changed rule.
|
|
36
|
+
- Browser flow: open checkout, apply discount, observe total updates to `$42`.
|
|
37
|
+
- `python3 scripts/validate-keystone.py` passes after editing skill modules.
|
|
38
|
+
- Config proof: `terraform plan` exits 0 and shows only expected resource changes.
|
|
39
|
+
|
|
40
|
+
## Pass condition
|
|
41
|
+
Pass only when the success claim is backed by evidence that directly exercises or validates the changed scope, and all known gaps are disclosed.
|
|
42
|
+
|
|
43
|
+
## Fail action
|
|
44
|
+
Do not claim completion. If verification cannot be run, provide a fallback proof plan instead of pretending.
|
|
45
|
+
|
|
46
|
+
```text
|
|
47
|
+
Proof Gate: FAIL
|
|
48
|
+
Claim needing proof: <claim>
|
|
49
|
+
Attempted evidence: <commands/inspection/manual steps>
|
|
50
|
+
Result: <output or blocker>
|
|
51
|
+
Unverified risk: <what may still be wrong>
|
|
52
|
+
Fallback proof plan: <exact command/manual check/human verification needed>
|
|
53
|
+
Completion language allowed: <"implemented, not verified" or "blocked pending proof">
|
|
54
|
+
```
|
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
# Red Gate
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
Establish a meaningful failing signal before implementation when practical.
|
|
5
|
+
|
|
6
|
+
The red signal proves the current system lacks the desired behavior and that the chosen check can detect the fix. It is binary: either a red-capable check exists, or an explicit alternative proof plan is recorded.
|
|
7
|
+
|
|
8
|
+
## Required checks
|
|
9
|
+
1. State expected behavior in concrete terms.
|
|
10
|
+
2. Identify a red-capable check:
|
|
11
|
+
- Test that should fail before the change.
|
|
12
|
+
- Reproduction command or script.
|
|
13
|
+
- Acceptance example with expected/actual output.
|
|
14
|
+
- UI flow that currently shows the defect.
|
|
15
|
+
- Validator that rejects the current artifact.
|
|
16
|
+
3. Run or document the check before implementation when safe.
|
|
17
|
+
4. Confirm the failure is meaningful:
|
|
18
|
+
- It fails for the right reason.
|
|
19
|
+
- It would pass after the intended behavior exists.
|
|
20
|
+
- It is not only testing mocks, fixtures, snapshots, or implementation details.
|
|
21
|
+
5. Preserve the red evidence in notes, test output, or commit/PR description.
|
|
22
|
+
|
|
23
|
+
## Good red examples
|
|
24
|
+
- A unit test expects tax to round half-up and currently receives half-even.
|
|
25
|
+
- A browser flow submits an empty required field and currently allows submission.
|
|
26
|
+
- A docs validator fails because a required gate file lacks a failure report template.
|
|
27
|
+
|
|
28
|
+
## Bad red examples
|
|
29
|
+
- A test that only checks a mocked service was called.
|
|
30
|
+
- A snapshot update with no behavioral assertion.
|
|
31
|
+
- A failing linter unrelated to the requested change.
|
|
32
|
+
- A test that fails because the test setup is broken.
|
|
33
|
+
|
|
34
|
+
## When red is impractical
|
|
35
|
+
Red may be impractical for copy-only changes, emergency fixes, unavailable environments, non-deterministic external systems, or when writing the check would exceed the change risk.
|
|
36
|
+
|
|
37
|
+
If red is skipped, name the reason and substitute a stronger proof plan:
|
|
38
|
+
- targeted validator,
|
|
39
|
+
- manual reproduction steps,
|
|
40
|
+
- review checklist,
|
|
41
|
+
- before/after artifact comparison,
|
|
42
|
+
- deploy-preview or staging verification.
|
|
43
|
+
|
|
44
|
+
## Pass condition
|
|
45
|
+
Pass only when either:
|
|
46
|
+
- a meaningful red-capable signal is documented before implementation, or
|
|
47
|
+
- red is explicitly impractical and an alternative proof plan is concrete enough to verify the outcome later.
|
|
48
|
+
|
|
49
|
+
## Fail action
|
|
50
|
+
Do not proceed as if behavior is proven. Stop or record the exception before implementation.
|
|
51
|
+
|
|
52
|
+
```text
|
|
53
|
+
Red Gate: FAIL
|
|
54
|
+
Expected behavior: <specific outcome>
|
|
55
|
+
Proposed red check: <test/repro/validator/manual flow>
|
|
56
|
+
Why it is not usable: <reason>
|
|
57
|
+
Risk of skipping red: <what could regress or remain unproven>
|
|
58
|
+
Alternative proof plan: <exact post-change evidence required>
|
|
59
|
+
```
|
|
@@ -0,0 +1,56 @@
|
|
|
1
|
+
# Review Gate
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
Confirm work has received the required review before finalization.
|
|
5
|
+
|
|
6
|
+
Review is evidence, not a feeling. This gate separates blocking findings from non-blocking follow-ups and prevents shipping work that needs independent review.
|
|
7
|
+
|
|
8
|
+
This gate is binary: pass or fail.
|
|
9
|
+
|
|
10
|
+
## Required checks
|
|
11
|
+
1. Identify required review type:
|
|
12
|
+
- Self-review is sufficient only for docs-only changes, low-risk config, or tiny low-risk refactors with green proof.
|
|
13
|
+
- Independent review is required for security, data, public API, billing/payment, auth/permissions, architecture, migrations, releases, broad refactors, or user-impacting changes.
|
|
14
|
+
- Use a human reviewer, automated reviewer, domain owner, security review, design review, or read-only review pointer as appropriate.
|
|
15
|
+
2. Provide review input:
|
|
16
|
+
- Scope summary.
|
|
17
|
+
- Files changed.
|
|
18
|
+
- Requirements or acceptance criteria.
|
|
19
|
+
- Proof evidence already gathered.
|
|
20
|
+
- Known risks and skipped checks.
|
|
21
|
+
3. Capture review evidence:
|
|
22
|
+
- Reviewer name/tool.
|
|
23
|
+
- Date or run identifier.
|
|
24
|
+
- Link, comment, command output, or quoted findings.
|
|
25
|
+
4. Separate findings:
|
|
26
|
+
- Blockers: correctness, scope violation, data loss, security, broken verification, missing required proof.
|
|
27
|
+
- Non-blockers: style, cleanup, future refactors, optional docs, nice-to-have tests.
|
|
28
|
+
5. Resolve or explicitly accept blockers before shipping.
|
|
29
|
+
|
|
30
|
+
## Read-only review pointer
|
|
31
|
+
When review cannot be performed by the worker, leave a read-only pointer that lets a reviewer inspect without changing work:
|
|
32
|
+
|
|
33
|
+
```text
|
|
34
|
+
Review requested for: <branch/worktree/PR/diff>
|
|
35
|
+
Scope: <requested change>
|
|
36
|
+
Changed files: <files>
|
|
37
|
+
Proof: <commands and results>
|
|
38
|
+
Known gaps: <unverified items>
|
|
39
|
+
Please classify findings as BLOCKER or NON-BLOCKER.
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Pass condition
|
|
43
|
+
Pass only when the required review evidence exists and no blocking findings remain unresolved, or the user explicitly accepts the remaining blockers.
|
|
44
|
+
|
|
45
|
+
## Fail action
|
|
46
|
+
Return to the appropriate module or gate. Do not ship while blockers are open.
|
|
47
|
+
|
|
48
|
+
```text
|
|
49
|
+
Review Gate: FAIL
|
|
50
|
+
Review source: <human/tool/pending>
|
|
51
|
+
Evidence: <link/output/comment or "none">
|
|
52
|
+
Blockers: <list>
|
|
53
|
+
Non-blockers: <list>
|
|
54
|
+
Required next action: <fix, re-review, user acceptance, or request review>
|
|
55
|
+
Read-only review pointer: <branch/PR/diff path if review is pending>
|
|
56
|
+
```
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
# Ship Gate
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
Ensure finalization happens only after completed, verified, reviewed work is ready for delivery.
|
|
5
|
+
|
|
6
|
+
Shipping is a handoff decision. This gate confirms the work can be understood, verified, rolled back, and continued by someone else. It is binary: ready to hand off or not ready.
|
|
7
|
+
|
|
8
|
+
## Required checks
|
|
9
|
+
1. Proof gate status:
|
|
10
|
+
- Evidence commands and results are recorded.
|
|
11
|
+
- Unverified areas are disclosed.
|
|
12
|
+
- Any proof exception is explicit.
|
|
13
|
+
2. Review gate status:
|
|
14
|
+
- Review evidence is recorded.
|
|
15
|
+
- Blockers are resolved or explicitly accepted by the user.
|
|
16
|
+
- Non-blockers are listed as follow-ups.
|
|
17
|
+
3. Delivery notes:
|
|
18
|
+
- What changed.
|
|
19
|
+
- Why it changed.
|
|
20
|
+
- Files changed.
|
|
21
|
+
- Validation performed.
|
|
22
|
+
- Risks and limitations.
|
|
23
|
+
4. Rollback or recovery evidence:
|
|
24
|
+
- Revert commit/PR guidance, feature flag, config rollback, backup path, or "docs-only revert is safe" note.
|
|
25
|
+
- Data migrations, destructive actions, or irreversible steps are called out.
|
|
26
|
+
5. Handoff evidence:
|
|
27
|
+
- Next human action, deployment step, release note, PR link, or review pointer.
|
|
28
|
+
- Owners for follow-ups are identified when known.
|
|
29
|
+
6. No stealth fix:
|
|
30
|
+
- Do not include unrelated fixes.
|
|
31
|
+
- Do not hide failed checks.
|
|
32
|
+
- Do not silently modify files outside the approved blast radius.
|
|
33
|
+
|
|
34
|
+
## Good ship note
|
|
35
|
+
```text
|
|
36
|
+
Changed: Expanded five Keystone gate docs into operational pass/fail gates.
|
|
37
|
+
Validation: python3 scripts/validate-keystone.py passed.
|
|
38
|
+
Review: Pending human review; no automated blockers found.
|
|
39
|
+
Rollback: Revert this docs-only change; no migration or runtime state.
|
|
40
|
+
Follow-ups: Human reviewer to confirm wording matches Keystone doctrine.
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Pass condition
|
|
44
|
+
Pass only when proof, review, delivery notes, rollback/handoff evidence, and scope compliance are all present.
|
|
45
|
+
|
|
46
|
+
## Fail action
|
|
47
|
+
Stop finalization and route to the missing gate. Do not merge, release, mark complete, or imply production readiness.
|
|
48
|
+
|
|
49
|
+
```text
|
|
50
|
+
Ship Gate: FAIL
|
|
51
|
+
Missing proof: <none or details>
|
|
52
|
+
Missing review: <none or details>
|
|
53
|
+
Missing delivery notes: <none or details>
|
|
54
|
+
Rollback/handoff gap: <none or details>
|
|
55
|
+
Scope violation or stealth fix risk: <none or details>
|
|
56
|
+
Required next action: <which gate/module to run before shipping>
|
|
57
|
+
```
|
|
@@ -0,0 +1,124 @@
|
|
|
1
|
+
# Keystone Health Module
|
|
2
|
+
|
|
3
|
+
## Core principle
|
|
4
|
+
Health is a read-only, whole-project condition scan. It turns repository evidence into mechanical findings about system drift, fragility, and maintenance risk; it does not repair, refactor, update, or reconfigure anything by default.
|
|
5
|
+
|
|
6
|
+
Health answers “what is the condition of this project or subsystem?” Review answers “is this specific change acceptable?” If the task centers on a PR/diff/patch, use `review`; if it centers on repo-wide readiness, drift, tooling, docs, tests, releases, or operational condition, use `health`.
|
|
7
|
+
|
|
8
|
+
## Load when
|
|
9
|
+
Load when the user asks for a health check, readiness scan, risk assessment, project status audit, tooling audit, dependency/config review, maintenance assessment, drift detection, “what should we worry about?”, or “is this repo in good shape?”
|
|
10
|
+
|
|
11
|
+
## Not for
|
|
12
|
+
- Fixing issues found during the audit.
|
|
13
|
+
- Reviewing a specific change, PR, patch, or commit; use `review`.
|
|
14
|
+
- Debugging a specific failure to root cause; use `debug`.
|
|
15
|
+
- Shipping a completed branch; use `ship`.
|
|
16
|
+
- Designing new product behavior; use `shape`.
|
|
17
|
+
- Research briefs about a narrow question; use `research`.
|
|
18
|
+
|
|
19
|
+
## Outcome contract
|
|
20
|
+
Deliver a health report where every finding maps:
|
|
21
|
+
|
|
22
|
+
`finding -> evidence -> impact -> confidence -> next Keystone module`
|
|
23
|
+
|
|
24
|
+
The report must include:
|
|
25
|
+
- audit scope and evidence inspected;
|
|
26
|
+
- concrete checklist status for tooling/CI drift, docs-vs-reality, config/env rot, dependency health, test health/flakiness, package/release health, and instruction/skill drift;
|
|
27
|
+
- risks ranked by severity and confidence using the Health priority rubric;
|
|
28
|
+
- checks not run and why;
|
|
29
|
+
- explicit no-fix confirmation unless repairs were requested.
|
|
30
|
+
|
|
31
|
+
Health priority rubric:
|
|
32
|
+
- **Critical:** Broken now, release-blocking, security-sensitive, data-loss-prone, or prevents required project operation. Urgency: act before `ship` or before depending on the affected subsystem. Next Keystone module: usually `debug` for failing behavior or `build` for repairs; use `ship` only for release gate follow-up after the issue is fixed.
|
|
33
|
+
- **Watch:** Risky, stale, drifting, or likely to become blocking, but not proven broken under current evidence. Urgency: schedule remediation or investigation soon; do not let it become untracked backlog. Next Keystone module: usually `research` to verify unknowns, `breakdown` to plan multi-step remediation, or `build` for contained repairs.
|
|
34
|
+
- **Info:** Healthy signal, minor inconsistency, low-impact cleanup, or explicitly unknown/unchecked area. Urgency: no immediate action unless priorities change. Next Keystone module: `no-op` when informational, or `review`/`ship` when the next step is validation rather than repair.
|
|
35
|
+
|
|
36
|
+
## Modes
|
|
37
|
+
- **Project snapshot:** summarize structure, active areas, scripts, tests, CI, docs, and current branch state.
|
|
38
|
+
- **Tooling/CI drift audit:** compare manifests, scripts, Make targets, task runners, hooks, CI workflows, matrix versions, cache keys, and required checks against actual files and commands.
|
|
39
|
+
- **Docs-vs-reality audit:** compare README, contributor docs, runbooks, release docs, examples, and command snippets with actual repo layout and executable scripts.
|
|
40
|
+
- **Config/env rot audit:** inspect templates, `.env.example`, config schemas, secrets documentation, Docker/compose/devcontainer files, SDK/runtime pins, and environment assumptions for staleness or inconsistency.
|
|
41
|
+
- **Dependency health audit:** inspect manifests, lockfiles, version pins, deprecated packages, engine/toolchain constraints, known update pressure, and manifest-lock consistency.
|
|
42
|
+
- **Test health/flakiness audit:** inspect test commands, skipped/quarantined tests, retries, snapshots, coverage signals, slow/flaky markers, CI-only behavior, and failure history when available.
|
|
43
|
+
- **Package/release health audit:** inspect package metadata, allowlists, build artifacts, version/changelog flow, release scripts, publish dry-run support, tags, and multi-target release paths.
|
|
44
|
+
- **Instruction/skill drift audit:** inspect project instructions, agent docs, skill/module docs, validator rules, and examples for contradictions or stale references.
|
|
45
|
+
- **Release readiness health:** assess project-level risk categories before `ship`, without preparing release artifacts.
|
|
46
|
+
- **Risk triage:** rank issues by impact, likelihood, evidence strength, and recommended next module.
|
|
47
|
+
|
|
48
|
+
## Process
|
|
49
|
+
1. Define scope: whole repo, subsystem, tooling, release readiness, dependencies, docs, instructions, or operational process.
|
|
50
|
+
2. Confirm read-only posture. Say what will be inspected and avoid state-changing commands unless the user explicitly requested repairs.
|
|
51
|
+
3. Inspect before judging: read manifests, scripts, CI, tests, docs, configs, package/release files, instruction files, recent status, and relevant gates.
|
|
52
|
+
4. Apply the concrete audit checklists:
|
|
53
|
+
- **Tooling/CI drift:** declared scripts exist; CI calls valid commands; local and CI tool versions align; required checks match current project; generated/cache paths are current; hooks and Make/package targets agree.
|
|
54
|
+
- **Docs-vs-reality:** documented setup/test/build/release commands exist; referenced paths still exist; examples match current APIs/CLIs; screenshots/output snippets are not misleading; contributor docs match workflow.
|
|
55
|
+
- **Config/env rot:** sample env files cover required variables; config names match code; defaults are safe; obsolete variables are not documented as required; runtime/container/dev environment pins are current.
|
|
56
|
+
- **Dependency health:** manifests and lockfiles agree; package managers are not mixed accidentally; runtime engine constraints are plausible; deprecated/abandoned/high-risk dependencies are called out with evidence; update risk is separated from breakage.
|
|
57
|
+
- **Test health/flakiness:** test entrypoints are discoverable; skips/todos/quarantine/retry settings are listed; flaky markers or timing-sensitive tests are noted; coverage signals are reported only if measured; CI-only gaps are identified.
|
|
58
|
+
- **Package/release health:** package allowlists include required files and exclude junk; build artifacts are reproducible or documented; version/changelog/release scripts align; publish dry-run or validation exists; multi-target outputs are accounted for.
|
|
59
|
+
- **Instruction/skill drift:** AGENTS/CLAUDE/GEMINI/Codex/plugin docs and Keystone skill docs agree; module boundaries are current; validators and examples match required headings/behavior.
|
|
60
|
+
5. Run safe focused checks when useful and allowed, such as `git status --short`, listing workflow files, reading manifests, `--help`, dry-run validation, or project validators. Avoid install/update/format/fix/publish commands by default.
|
|
61
|
+
6. Detect drift mechanically: documentation pointing to missing scripts, scripts referencing missing files, stale generated assets, inconsistent versions, orphaned configs, CI mismatch, package metadata mismatch, or contradictory instructions.
|
|
62
|
+
7. For each finding, write the required chain: severity, finding, evidence, impact, confidence, and next Keystone module (`research`, `debug`, `shape`, `breakdown`, `build`, `review`, `ship`, or no-op). Assign severity from the Health priority rubric; do not produce unranked findings. If the finding is about tests or package metadata, still route to one of those modules, usually `build` for repairs, `debug` for failing behavior, `review` for validation, or `ship` for final package readiness.
|
|
63
|
+
8. Separate statuses: **broken now**, **risky**, **stale**, **unknown**, and **healthy**. Map broken-now items to Critical unless evidence shows low impact; map risky or stale items to Watch unless release/security impact makes them Critical; map healthy, minor, and unchecked informational notes to Info. Do not convert unknowns into failures.
|
|
64
|
+
9. Stop at reporting unless the user explicitly requested fixes. If repairs are requested, route to the appropriate module instead of silently switching modes.
|
|
65
|
+
|
|
66
|
+
## Subagents and reasoning
|
|
67
|
+
Default reasoning: `medium`. Use read-only scout subagents for broad inventory and reviewer subagents for independent risk triage. Use `high` for release readiness, security-sensitive audits, large monorepos, severe tooling drift, instruction drift affecting agent behavior, or when health findings affect go/no-go decisions. Subagents must remain read-only unless repairs are explicitly requested.
|
|
68
|
+
|
|
69
|
+
## Hard rules
|
|
70
|
+
- Read-only by default: no fixing, formatting, dependency updates, cleanup, generation, or config changes unless explicitly requested.
|
|
71
|
+
- Health is whole-project/system condition; Review is a specific change. Do not use Health to approve a PR diff.
|
|
72
|
+
- Every finding must include severity, finding, evidence, impact, confidence, and next Keystone module.
|
|
73
|
+
- Use only the Health priority rubric for severity (`Critical`, `Watch`, `Info`) unless the user explicitly requests equivalent labels; do not emit unranked dumps.
|
|
74
|
+
- Evidence categories must be named; unchecked areas must be listed with reasons.
|
|
75
|
+
- Do not overstate confidence. Label inferred risks and explain what would verify them.
|
|
76
|
+
- Prefer safe read-only or focused validation commands.
|
|
77
|
+
- Separate “broken now,” “risky,” “stale,” and “unknown.”
|
|
78
|
+
- Health can recommend `ship`, but does not replace ship proof gates.
|
|
79
|
+
|
|
80
|
+
## Failure modes
|
|
81
|
+
- **Audit-as-fix:** making opportunistic changes during a scan.
|
|
82
|
+
- **Review confusion:** judging a specific PR/change instead of system condition.
|
|
83
|
+
- **Checklist theater:** listing categories without evidence.
|
|
84
|
+
- **Evidence gaps:** reporting findings that lack impact, confidence, or next module.
|
|
85
|
+
- **False certainty:** declaring healthy because a narrow check passed.
|
|
86
|
+
- **Drift blindness:** missing mismatches between docs, scripts, CI, package metadata, instructions, and actual files.
|
|
87
|
+
- **Unranked dump:** overwhelming the user with findings but no severity or next module.
|
|
88
|
+
|
|
89
|
+
## Output format
|
|
90
|
+
```markdown
|
|
91
|
+
## Health report
|
|
92
|
+
Scope: ...
|
|
93
|
+
No-fix status: confirmed / repairs requested
|
|
94
|
+
Boundary: Health system-condition scan, not Review of a specific change
|
|
95
|
+
|
|
96
|
+
### Evidence inspected
|
|
97
|
+
- ...
|
|
98
|
+
|
|
99
|
+
### Checklist status
|
|
100
|
+
| Checklist | Status | Evidence | Confidence |
|
|
101
|
+
|---|---|---|---|
|
|
102
|
+
| Tooling/CI drift | ... | ... | ... |
|
|
103
|
+
| Docs-vs-reality | ... | ... | ... |
|
|
104
|
+
| Config/env rot | ... | ... | ... |
|
|
105
|
+
| Dependency health | ... | ... | ... |
|
|
106
|
+
| Test health/flakiness | ... | ... | ... |
|
|
107
|
+
| Package/release health | ... | ... | ... |
|
|
108
|
+
| Instruction/skill drift | ... | ... | ... |
|
|
109
|
+
|
|
110
|
+
Classify findings using the Health priority rubric above: `Critical`, `Watch`, or `Info`.
|
|
111
|
+
|
|
112
|
+
### Findings
|
|
113
|
+
1. Critical / Watch / Info — finding
|
|
114
|
+
- Evidence:
|
|
115
|
+
- Impact:
|
|
116
|
+
- Confidence: High / Medium / Low
|
|
117
|
+
- Next Keystone module:
|
|
118
|
+
|
|
119
|
+
### Checks not run
|
|
120
|
+
- ...
|
|
121
|
+
|
|
122
|
+
### Overall assessment
|
|
123
|
+
Healthy / Watch / At risk / Blocked — rationale
|
|
124
|
+
```
|