@static-var/keystone 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,134 @@
1
+ # Keystone Subagents and Reasoning Helper
2
+
3
+ ## Purpose
4
+ Teach Keystone when subagents can be used, which hosts support them, and what reasoning level each Keystone module should prefer.
5
+
6
+ This helper is advisory. Host capability wins over Keystone preference. If the current harness cannot set reasoning per subagent, encode the desired reasoning in the subagent prompt or do the work inline.
7
+
8
+ ## Delegation rule
9
+ Use subagents only when the task has a clear boundary and a useful handoff artifact.
10
+
11
+ Good delegation targets:
12
+ - read-only reconnaissance
13
+ - independent implementation slices
14
+ - focused debugging/root-cause analysis
15
+ - read-only review
16
+ - documentation/copy drafting
17
+
18
+ Do not delegate when:
19
+ - the task needs tight conversational clarification
20
+ - one agent must continuously coordinate shared mutable state
21
+ - the host cannot preserve enough context for safe handoff
22
+ - the subagent would need secrets or permissions the parent should not share
23
+ - delegation setup, context packaging, merge/review effort, or verification overhead is likely greater than doing the task inline
24
+
25
+ ## Delegation cost heuristic
26
+
27
+ Before spawning a subagent, compare expected benefit with coordination cost.
28
+
29
+ Delegate when at least one is true:
30
+ - the subtask can run in parallel with other independent work
31
+ - the subtask requires a different role, perspective, or reasoning depth
32
+ - the repository/source search is large enough that a scout can save parent context
33
+ - independent review or root-cause analysis materially reduces release risk
34
+
35
+ Do not delegate when most of these are true:
36
+ - the task can be done inline in a few minutes
37
+ - the parent would need to explain more context than the subagent can save
38
+ - outputs will require complex merge arbitration
39
+ - the work touches the same mutable files as another active agent without isolation
40
+ - verification cannot be independently stated in the prompt
41
+
42
+ If uncertain, prefer one narrow read-only scout/reviewer over multiple workers.
43
+
44
+ ## Host capability matrix
45
+
46
+ | Harness | Subagents | Per-subagent reasoning/effort | How to configure | Keystone policy |
47
+ |---|---:|---:|---|---|
48
+ | Pi coding agent with `pi-subagents` | yes | yes | `.pi/agents/<name>.md` frontmatter `model`, `thinking`, `profile`, or `Agent({ subagent_type, thinking, model, profile })` | Use native roles; prefer role defaults unless Keystone needs a one-off override. |
49
+ | Claude Code | yes | partial | `.claude/agents/<name>.md` supports subagent config and `model`; built-in Explore accepts quick/medium/very-thorough style detail, but custom agents do not expose a general reasoning knob | Use `model` plus explicit prompt instructions; use built-in Explore detail when applicable. |
50
+ | Codex CLI/app | unclear/host-dependent | partial/global | known global config includes `model_reasoning_effort`; no stable per-subagent effort schema confirmed | Treat Keystone reasoning as advisory text unless the active Codex host exposes a per-agent effort control. |
51
+ | T3 Code | not confirmed | not confirmed | no confirmed public/local schema | Treat as unsupported; run inline or through the underlying Claude/Codex/OpenCode provider if available. |
52
+ | OpenCode | yes | partial/provider-dependent | agent config supports `mode: "subagent"`, `model`, and provider-specific `variant`; no universal reasoning field is confirmed | Use subagent mode; map Keystone reasoning to model/variant only where the provider exposes effort variants, otherwise write the desired reasoning in the prompt. |
53
+ | GitHub Copilot / VS Code | yes | partial | `.github/agents/*.agent.md` supports `model`, `agents`, `user-invocable`, `disable-model-invocation`; no general reasoning knob found | Use custom agents and model choice; put reasoning expectation in the agent prompt. |
54
+
55
+ ## Canonical reasoning scale
56
+
57
+ Keystone uses this host-neutral scale:
58
+
59
+ | Level | Use for |
60
+ |---|---|
61
+ | `off` | deterministic formatting, mechanical edits, no reasoning needed |
62
+ | `minimal` | tiny lookups, simple classification, trivial copy changes |
63
+ | `low` | ordinary reading, straightforward writing, small scoped tasks |
64
+ | `medium` | normal implementation, UI decisions, moderate research |
65
+ | `high` | architecture, debugging, review, planning, ambiguous tradeoffs |
66
+ | `xhigh` | hard root-cause analysis, security-sensitive review, major design decisions |
67
+
68
+ If a host uses another vocabulary, map to the nearest equivalent. If no setting exists, write the desired level into the prompt, for example: "Use high reasoning; explore alternatives before deciding."
69
+
70
+ ## Keystone module defaults
71
+
72
+ | Keystone file | Preferred role | Default reasoning | Escalate when |
73
+ |---|---|---:|---|
74
+ | `modules/router.md` | none or lightweight classifier | `low` | request is ambiguous across several irreversible actions |
75
+ | `modules/research.md` | scout/read-only explorer or oracle | `medium` | repository is large, source relationships are unclear, or claims affect architecture, market, safety, or release decisions (`high`) |
76
+ | `modules/shape.md` | writer, UI/design reviewer, or oracle | `medium` | visual systems, accessibility, complex positioning, architecture, product viability, or irreversible scope decisions are involved (`high`/`xhigh`) |
77
+ | `modules/breakdown.md` | planner plus reviewer | `high` | plan spans multiple independent agents or risky sequencing (`xhigh`) |
78
+ | `modules/build.md` | worker | `medium` | concurrency, migrations, broad refactors, or unfamiliar stack (`high`) |
79
+ | `modules/debug.md` | oracle/root-cause investigator | `high` | intermittent, cross-system, performance, or data-loss failures (`xhigh`) |
80
+ | `modules/review.md` | reviewer/read-only | `high` | security, release, data migration, or public API review (`xhigh`) |
81
+ | `modules/ship.md` | ship coordinator | `medium` | release has unresolved risk or multi-host packaging (`high`) |
82
+ | `modules/health.md` | scout plus reviewer | `medium` | broad repository/tooling drift or release readiness audit (`high`) |
83
+ | `modules/gates/*.md` | none | `low` | evidence is contradictory or safety-critical (`medium`) |
84
+
85
+ ## Pi role mapping
86
+
87
+ When Pi subagents are available, use the narrowest role and usually keep its configured defaults:
88
+
89
+ | Need | Pi role | Typical thinking |
90
+ |---|---|---:|
91
+ | codebase exploration | `scout` | `low` |
92
+ | implementation | `worker` | `medium` |
93
+ | code/spec review | `reviewer` | `high` |
94
+ | architecture/root-cause second opinion | `oracle` | `high` or `xhigh` |
95
+ | docs/copy | `writer` | `low` |
96
+
97
+ Only override `thinking` when the module table says to escalate or de-escalate.
98
+
99
+ ## Safe parallel work pattern
100
+
101
+ 1. `breakdown` identifies independent tasks and their verification commands.
102
+ 2. `build` passes `gates/isolation.md` before any mutation.
103
+ 3. If the host supports isolated worktrees, each worker gets a separate worktree or host-isolated workspace.
104
+ 4. Each worker reports files changed, tests run, and concerns.
105
+ 5. Parent reconciles outputs before further mutation: accept, reject, or send back with a narrower prompt.
106
+ 6. `review` runs as read-only, preferably in a separate reviewer subagent.
107
+ 7. `ship` finalizes only after proof and review gates pass.
108
+
109
+ ## Prompt contract for delegated work
110
+
111
+ Every subagent prompt should include:
112
+
113
+ - exact task scope
114
+ - files or areas allowed to change/read
115
+ - protected files
116
+ - expected output artifact or report
117
+ - reasoning level requested if the host cannot enforce it
118
+ - verification command expected
119
+ - instruction not to broaden scope
120
+ - timeout or stopping condition when the host supports it
121
+
122
+ For read-only subagents, explicitly say: "Do not edit files." For review subagents, explicitly say: "Return findings only; do not fix."
123
+
124
+ ## Subagent result handling
125
+
126
+ Treat subagent output as evidence, not truth. The parent remains responsible for verification and final routing.
127
+
128
+ - Timeout or no response: mark the subtask incomplete, preserve any partial logs, and either finish inline or re-delegate with a smaller scope if the remaining work is still worth the overhead.
129
+ - Bad output or scope creep: reject the result, record why, and re-delegate only with a tighter prompt, protected-file list, and explicit expected artifact. Otherwise do it inline.
130
+ - Partial completion: accept only independently verified pieces; carry unfinished work in the handoff packet with files touched, tests run, and remaining risks.
131
+ - Conflicting outputs: compare evidence and reproduction/proof commands first. If both are plausible and risk is material, ask an oracle/reviewer for read-only arbitration or route to `debug`; do not merge contradictory fixes blindly.
132
+ - Failed verification: route through the appropriate Keystone module (`debug` for unexplained failures, `build` for contained fixes, `review` for risk assessment) and rerun the original proof before continuing.
133
+
134
+ Re-delegate only when the next prompt can be narrower than the failed one, the expected artifact is concrete, and the benefit still exceeds coordination cost. Otherwise continue inline and record the reason.
@@ -0,0 +1,86 @@
1
+ # Keystone Research Module
2
+
3
+ ## Core principle
4
+ Research is evidence gathering before action. Inspect available material first, preserve source quality, separate facts from assumptions, and do not mutate the project unless the user explicitly asks for a durable research artifact.
5
+
6
+ ## Load when
7
+ Load when the user asks to read, inspect, summarize, inventory, extract, compare, explain, investigate options, gather technical or market context, validate claims, or answer “what is true here?” before a decision.
8
+
9
+ ## Not for
10
+ - Implementing, refactoring, editing, or fixing code.
11
+ - Shaping product direction beyond evidence-backed options.
12
+ - Broad tooling risk audits; use `health`.
13
+ - Root-cause repair of a failure; use `debug` after initial context.
14
+ - Guessing when evidence can be inspected.
15
+
16
+ ## Outcome contract
17
+ Deliver a research brief that states:
18
+ - question or decision being supported;
19
+ - sources inspected, with file paths, commands, URLs, or other citations;
20
+ - source-quality notes (primary vs secondary, current vs stale, authoritative vs anecdotal);
21
+ - findings separated from assumptions and unknowns;
22
+ - confidence level and why;
23
+ - recommended next module or no-op if no action is warranted.
24
+
25
+ ## Modes
26
+ - **Repository read:** inspect files, history, configs, tests, docs, and existing behavior. Prefer primary project evidence.
27
+ - **External research:** compare outside documentation, standards, issues, market examples, or APIs. Cite URLs and note recency.
28
+ - **Synthesis:** combine several sources into a decision-ready summary with tradeoffs and confidence.
29
+ - **Discovery scout:** map a large unknown area without drawing strong conclusions until evidence is sampled.
30
+
31
+ ## Process
32
+ 1. Restate the research question and the decision it informs.
33
+ 2. Inspect before asking: search/read the repo, docs, logs, or provided sources before requesting more context.
34
+ 3. Prefer primary evidence: source code, tests, product docs, official docs, reproducible commands, direct user-provided material.
35
+ 4. Track citations as you go. Every important claim should point to evidence or be labeled as an assumption.
36
+ 5. Evaluate source quality: age, authority, completeness, bias, and whether evidence is direct or inferred.
37
+ 6. Compare alternatives when relevant, including costs, risks, constraints, and no-op implications.
38
+ 7. State unknowns explicitly. Do not fill gaps with confident-sounding speculation.
39
+ 8. Recommend the smallest next step: `shape`, `debug`, `health`, `breakdown`, `build`, `review`, or stop.
40
+
41
+ ## Subagents and reasoning
42
+ Default reasoning: `medium`. Use read-only scout subagents when the search space is large or evidence can be gathered independently. Use `low` for narrow file summaries. Use `high` when findings affect architecture, security, safety, release decisions, legal/market claims, or irreversible product direction. Subagents must remain read-only unless the user requested an artifact.
43
+
44
+ ## Hard rules
45
+ - No mutation by default: do not edit files, run formatters, or alter state except harmless read-only commands.
46
+ - Cite evidence for material claims; if evidence is unavailable, say so.
47
+ - Distinguish facts, interpretations, assumptions, and recommendations.
48
+ - Do not ask for information that can be inspected first.
49
+ - Do not present search results or model knowledge as authoritative without source-quality caveats.
50
+
51
+ ## Failure modes
52
+ - **Context theater:** long summaries without citations or decision relevance.
53
+ - **Source laundering:** treating blogs, stale docs, or guesses as facts.
54
+ - **Premature shaping:** deciding product behavior before evidence is clear.
55
+ - **Mutation creep:** “just fixing” or rewriting while researching.
56
+ - **Hidden uncertainty:** omitting confidence, unknowns, or contradictory evidence.
57
+
58
+ ## Worked example
59
+ Good research finding: “Official Stripe docs show idempotency keys apply per unique key and preserve the first result, including failures; this means retrying payment capture should reuse the original key, not generate a new one. Confidence: High — primary docs, current page.”
60
+
61
+ Bad research finding: “Stripe probably handles retries safely, so we can just retry the request.”
62
+
63
+ ## Output format
64
+ ```markdown
65
+ ## Research brief
66
+ Question: ...
67
+
68
+ ### Evidence inspected
69
+ - `path/or/source`: what it shows, quality note
70
+
71
+ ### Findings
72
+ - Fact — citation
73
+ - Interpretation — citation + reasoning
74
+
75
+ ### Assumptions / unknowns
76
+ - ...
77
+
78
+ ### Options or implications
79
+ - ...
80
+
81
+ ### Confidence
82
+ High/Medium/Low — why
83
+
84
+ ### Recommended next step
85
+ Module or no-op, with rationale
86
+ ```
@@ -0,0 +1,270 @@
1
+ # Keystone Review Module
2
+ ## Core principle
3
+ Review is an independent, read-only attempt to disprove readiness.
4
+
5
+ Ask two questions at the same time:
6
+ 1. **Spec axis:** does the work satisfy the stated requirements and acceptance criteria?
7
+ 2. **Standards axis:** is it secure, correct, maintainable, tested, and safe to operate?
8
+
9
+ Do not assume changed lines are the blast radius. Trace callers, callees, contracts,
10
+ data flow, tests, runtime paths, and user impact before giving a verdict.
11
+
12
+ ## Load when
13
+ Load when the user asks for code review, critique, audit, readiness assessment,
14
+ release/merge review, security review, regression review, or review of a diff, branch,
15
+ PR, patch, migration, fix, plan output, or completed implementation.
16
+
17
+ Also load when another Keystone module needs `gates/review.md` satisfied before ship.
18
+
19
+ ## Not for
20
+ Do not use Review for:
21
+ - fixing, refactoring, formatting, or rewriting code
22
+ - committing, merging, tagging, publishing, or shipping
23
+ - initial implementation planning before a reviewable artifact exists
24
+ - open-ended research with no concrete artifact to assess
25
+ - debugging where the requested outcome is a fix
26
+
27
+ If asked to review and fix, review first, stop, and hand findings to `build`, `debug`,
28
+ `research`, `ship`, or a human only after explicit permission.
29
+
30
+ ## Outcome contract
31
+ A complete review returns:
32
+ - verdict: **Block**, **Caution**, or **Looks good**
33
+ - findings ordered P0, P1, P2, P3, then Nitpicks
34
+ - evidence for every finding: file/line, behavior path, contract, test, log, or doc
35
+ - user impact and why the severity is justified
36
+ - remediation guidance without applying the fix
37
+ - tests that should be added or updated for affected behavior
38
+ - scope reviewed, validation run, limitations, and read-only confirmation
39
+
40
+ The review is incomplete if it only inspects the diff, only comments on style, or
41
+ cannot explain how the work behaves at runtime.
42
+
43
+ ## Review passes
44
+ Perform multiple passes. New evidence from one pass expands later passes.
45
+ ### Pass 0: scope and baseline
46
+ - Identify artifact reviewed: diff, branch, files, release candidate, or plan result.
47
+ - Read the user request, issue, spec, acceptance criteria, and claimed completion.
48
+ - Check repository status without modifying files.
49
+ - Record uncommitted work as context, not cleanup.
50
+
51
+ ### Pass 1: spec compliance
52
+ - Compare implementation against explicit requirements and non-goals.
53
+ - Check edge cases, error states, and acceptance criteria.
54
+ - Separate spec misses from standards concerns.
55
+ - Treat a clean implementation of the wrong behavior as a finding.
56
+
57
+ ### Pass 2: correctness and runtime paths
58
+ - Trace primary success and failure paths end to end.
59
+ - Follow changed functions into helpers, services, adapters, persistence, UI, jobs, and
60
+ serializers.
61
+ - Validate inputs, outputs, invariants, state transitions, retries, ordering,
62
+ concurrency assumptions, and error propagation.
63
+ - Look for nullability, off-by-one, time, encoding, pagination, caching, idempotency,
64
+ cancellation, and partial-failure issues.
65
+
66
+ ### Pass 3: regression and compatibility
67
+ - Identify callers, consumers, and workflows that rely on old behavior.
68
+ - Check public APIs, CLIs, schemas, migrations, persisted data, environment variables,
69
+ feature flags, configuration defaults, and documentation.
70
+ - Consider rollback, downgrade, mixed-version, and incremental rollout risks.
71
+ - Search for tests or fixtures that encode previous behavior.
72
+
73
+ ### Pass 4: security, privacy, and abuse resistance
74
+ - Review authentication, authorization, tenancy, secrets, logging, validation,
75
+ injection, XSS, SSRF, path traversal, unsafe deserialization, and RCE surfaces.
76
+ - Check whether sensitive data leaks through errors, logs, telemetry, URLs, caches,
77
+ exports, screenshots, or third-party calls.
78
+ - Consider malicious users, compromised clients, replay, races, resource exhaustion,
79
+ privilege escalation, and denial of service.
80
+
81
+ ### Pass 5: tests and proof
82
+ - Map changed behavior to existing tests.
83
+ - Identify missing unit, integration, contract, regression, migration, security,
84
+ accessibility, performance, or end-to-end coverage.
85
+ - Prefer behavior assertions over implementation trivia.
86
+ - Run focused read-only validation when practical: existing tests, type checks, lint,
87
+ builds, or targeted commands.
88
+ - If validation cannot run, state why and what should be run.
89
+
90
+ ### Pass 6: maintainability and architecture
91
+ - Assess clarity, cohesion, naming, dependency direction, duplication, complexity,
92
+ observability, and debuggability.
93
+ - Check architectural boundaries, local conventions, and API contracts.
94
+ - Flag brittle abstractions, hidden coupling, unnecessary cleverness, and premature
95
+ generalization when they create real maintenance risk.
96
+
97
+ ### Pass 7: user impact and final consistency
98
+ - Translate technical issues into affected personas, workflows, data, accessibility,
99
+ performance, reliability, and support burden.
100
+ - Re-rank findings by blast radius, likelihood, recoverability, and detectability.
101
+ - De-duplicate findings, verify evidence, and state limitations honestly.
102
+
103
+ ## Severity rubric
104
+ Severity reflects realistic impact, not fix size.
105
+
106
+ ### P0: Critical blocker
107
+ Immediate or likely severe harm. Examples:
108
+ - data loss, corruption, or irreversible destructive action
109
+ - unauthorized access, privilege escalation, secret exposure, or major privacy breach
110
+ - production outage or release artifact that cannot safely deploy
111
+ - legal/compliance risk with material impact
112
+
113
+ P0 means do not ship or merge without accountable human acceptance and mitigation.
114
+
115
+ ### P1: Blocking defect
116
+ High-impact issue that violates core requirements or creates serious regression risk.
117
+ Examples:
118
+ - primary workflow broken for a meaningful user segment
119
+ - incorrect billing, permissions, persistence, or business logic
120
+ - migration or compatibility gap that can break real deployments
121
+ - high-risk behavior lacking tests plus a plausible failure mode
122
+
123
+ P1 normally blocks ship.
124
+
125
+ ### P2: Important non-blocker or conditional blocker
126
+ Material issue with bounded impact, lower likelihood, or workaround. Examples:
127
+ - edge case with clear user impact
128
+ - moderate-risk test gap
129
+ - maintainability issue likely to cause near-term bugs
130
+ - weak observability for a risky path
131
+
132
+ State whether release context makes it blocking.
133
+
134
+ ### P3: Low-risk improvement
135
+ Valid concern with limited impact. Examples:
136
+ - confusing name or local complexity that slows future work
137
+ - minor non-hot-path performance inefficiency
138
+ - incomplete docs for non-critical behavior
139
+ - small test organization weakness
140
+
141
+ P3 should not block unless it compounds with related risks.
142
+
143
+ ### Nitpick
144
+ Cosmetic, preference-level, or optional feedback: unenforced formatting, wording tweaks,
145
+ or style suggestions with no correctness or maintainability impact. Keep nitpicks
146
+ separate from severity findings.
147
+
148
+ ## Impact tracing
149
+ For each meaningful change, trace:
150
+ - **Entry points:** user action, API route, CLI, job, event, hook, or import.
151
+ - **Callers:** who invokes this and what assumptions they make.
152
+ - **Callees:** helpers, libraries, persistence, network calls, and side effects.
153
+ - **Data flow:** input, validation, transformation, storage, serialization, output.
154
+ - **Contracts:** types, schemas, public APIs, flags, config, docs, and errors.
155
+ - **Runtime paths:** success, failure, retry, timeout, cancellation, concurrency.
156
+ - **Tests:** existing coverage, missing assertions, fixtures, mocks, snapshots.
157
+ - **Users:** visible behavior, accessibility, performance, reliability, trust.
158
+
159
+ If tracing leaves uncertainty, gather more read-only evidence or report the limitation.
160
+ Do not invent confidence.
161
+
162
+ ## Security and regression checklist
163
+ Ask for every non-trivial review:
164
+ - Can a user access, modify, infer, or delete data they should not?
165
+ - Are authn, authz, tenancy, and ownership checked at the right layer?
166
+ - Can untrusted input reach queries, interpreters, shells, paths, templates, redirects,
167
+ or deserializers unsafely?
168
+ - Are secrets, tokens, PII, or internal identifiers exposed in logs, errors, telemetry,
169
+ URLs, caches, or client bundles?
170
+ - Did defaults, permissions, feature flags, or safeguards become unsafe?
171
+ - Are races, duplicate submissions, retries, replay, and out-of-order events safe?
172
+ - Can persisted data be corrupted, stranded, or made hard to rollback?
173
+ - Are public APIs, stored data, configs, and integrations backward compatible?
174
+ - Does failure degrade safely without hidden partial success?
175
+ - Are performance, resource use, accessibility, localization, and platform differences
176
+ acceptable for realistic users and abuse?
177
+ - Do tests cover the affected behavior and important regression paths?
178
+
179
+ ## Subagents and reasoning
180
+ Default reasoning: `high`.
181
+
182
+ Use read-only reviewer subagents for separable risks: security/privacy, test coverage,
183
+ architecture/API compatibility, persistence/migration, accessibility/user impact,
184
+ performance, concurrency, or release risk. Escalate to `xhigh` for security-sensitive,
185
+ data-loss, billing, permissions, public API, migration, or cross-system reviews.
186
+
187
+ Subagents must receive the read-only contract and return evidence-backed findings, not
188
+ patches. Reconcile duplicates and conflicts before reporting. The primary reviewer
189
+ owns final severity and verdict.
190
+
191
+ ## Hard rules
192
+ - Read-only only: do not edit, format, generate, stage, commit, merge, tag, publish,
193
+ or ship files.
194
+ - Do not silently fix issues discovered during review.
195
+ - Do not run destructive or project-mutating commands.
196
+ - Do not rely only on changed lines; inspect impacted code paths and contracts.
197
+ - Do not approve solely because tests pass.
198
+ - Do not report speculation as fact; mark uncertainty.
199
+ - Do not bury blockers under minor comments.
200
+ - Do not disguise style preferences as correctness findings.
201
+ - Do not omit needed tests when behavior changed.
202
+ - Do not satisfy `gates/review.md` unless blockers and non-blockers are separated.
203
+
204
+ ## Failure modes
205
+ Avoid these anti-patterns:
206
+ - **Single-pass skim:** one read of changed lines plus generic comments.
207
+ - **Diff tunnel vision:** missing callers, callees, contracts, and user impact.
208
+ - **Checklist theater:** naming security/tests without tracing actual risk.
209
+ - **Green-test rubber stamp:** assuming current tests prove new behavior.
210
+ - **Spec blindness:** judging code quality while requirements are unmet.
211
+ - **Standards blindness:** accepting unsafe or fragile code because the narrow spec passes.
212
+ - **Severity inflation:** turning preferences into blockers.
213
+ - **Severity deflation:** downgrading real user harm because the fix is small.
214
+ - **Patch creep:** fixing, refactoring, or committing instead of reviewing.
215
+ - **Unowned uncertainty:** failing to state what was not verified.
216
+
217
+ ## Output format
218
+ Worked finding example:
219
+ ```markdown
220
+ ### P1
221
+ - Missing tenant check on invoice export
222
+ - Evidence: `api/exportInvoice.ts:42` accepts `invoiceId` and loads the invoice without comparing `invoice.accountId` to the authenticated account; `/invoices/:id/export` is reachable by any logged-in user.
223
+ - Impact: A user who guesses another invoice ID can download billing data from a different account, which is a privacy and authorization breach.
224
+ - Recommendation: Enforce tenant ownership before export and return the existing unauthorized response on mismatch.
225
+ - Tests needed: Add an integration test where account A requests account B's invoice and receives 403/no file, plus a happy-path same-account export test.
226
+ ```
227
+
228
+ Use this structure:
229
+ ```markdown
230
+ ## Verdict
231
+ Block | Caution | Looks good
232
+
233
+ ## Scope reviewed
234
+ - Artifact reviewed:
235
+ - Key files/paths inspected:
236
+ - Validation run:
237
+ - Review limitations:
238
+ - Read-only confirmation: no files changed by this review
239
+
240
+ ## Findings
241
+ ### P0
242
+ - [Title]
243
+ - Evidence:
244
+ - Impact:
245
+ - Recommendation:
246
+ - Tests needed:
247
+ ### P1
248
+ None
249
+
250
+ ### P2
251
+ None
252
+
253
+ ### P3
254
+ None
255
+
256
+ ## Nitpicks
257
+ None
258
+
259
+ ## Tests to add or update
260
+ - Behavior:
261
+ - Suggested coverage:
262
+ - Why it matters:
263
+ ## Handoff
264
+ - Blockers:
265
+ - Non-blocking follow-up:
266
+ - Suggested owner module: build, debug, research, ship, or human
267
+ ```
268
+
269
+ If a severity has no findings, write `None`. Recommendations must be actionable but
270
+ must not be applied by Review.
@@ -0,0 +1,36 @@
1
+ # Keystone Router Module
2
+
3
+ ## Intent
4
+ Classify the user's request and select one Keystone primary module.
5
+
6
+ ## Load when
7
+ The task is ambiguous, explicitly asks for routing, or starts from `/keystone` without a clear module fit.
8
+
9
+ ## Allowed mutation
10
+ None, except writing a short routing decision in the conversation.
11
+
12
+ ## Must not
13
+ Modify files, perform implementation, or expose internal modules as public slash commands.
14
+
15
+ ## May call
16
+ One primary module after classification. Gates only if that primary module requires them.
17
+
18
+ ## Subagents and reasoning
19
+ Default reasoning: `low`. Do not deploy subagents for simple routing; ask one clarifying question if a safe single route is not clear. See `helpers/subagents.md`.
20
+
21
+ ## Routing heuristics
22
+ Prefer the module indicated by the user's strongest current need, not the first verb alone. Weigh multiple signals together:
23
+ - Failure, error, broken behavior, repro, or "fix" language points to `debug` before implementation.
24
+ - Review, verify, approve, merge, or ship language points to `review` before release or cleanup.
25
+ - New capability requests route by maturity: use `shape` when intent is fuzzy, `breakdown` when the outcome is clear but work needs decomposition, and `build` when scope and acceptance criteria are already concrete.
26
+
27
+ ## Disambiguation examples
28
+ - "debug this and fix it" -> select `debug`; establish cause before changing code.
29
+ - "review and ship" -> select `review`; verify readiness before any shipping step.
30
+ - "add feature" -> select `shape`, `breakdown`, or `build` based on maturity; ask one concise clarifying question if maturity is not inferable.
31
+
32
+ ## Handoff
33
+ Name the selected primary module and the reason in one sentence, then continue under that module's contract.
34
+
35
+ ## Exit gate
36
+ Exactly one primary module is selected, or one clarifying question is asked.
@@ -0,0 +1,125 @@
1
+ # Keystone Shape Module
2
+
3
+ ## Core principle
4
+ Shape is a specification algorithm: turn an unclear intent into exact behavior, constraints, tradeoffs, and acceptance criteria before anyone builds. It decides what should be true, not whether code is complete.
5
+
6
+ ## Load when
7
+ Load when the user asks to draft, rewrite, design, spec, define product behavior, improve UI/UX, choose visual direction, name or explain a feature, make scope or architecture tradeoffs, prepare acceptance criteria, or turn research into an implementation-ready direction.
8
+
9
+ ## Not for
10
+ - Writing implementation code or changing runtime behavior; hand off to `build`.
11
+ - Diagnosing failures; use `debug`.
12
+ - Broad repository health or release readiness; use `health` or `ship`.
13
+ - Inventing facts that should be researched first.
14
+ - Polishing completed work as shippable proof.
15
+
16
+ ## Outcome contract
17
+ Deliver a shaped proposal that includes:
18
+ - goal, user/audience, and success criteria;
19
+ - product behavior and UX states, including happy, empty, loading/pending, error/failure, and edge/constraint states where relevant;
20
+ - copy or content direction when user-facing text matters;
21
+ - architecture and scope tradeoffs at the level needed for planning, not implementation;
22
+ - alternatives considered and why one direction is preferred;
23
+ - acceptance criteria and non-goals;
24
+ - recommended next module (`breakdown`, `build`, `review`, `research`, or no-op).
25
+
26
+ ## Modes
27
+ - **Product shape:** specify the user job, trigger, actor permissions, core flow, business rules, constraints, success metrics, non-goals, and acceptance criteria.
28
+ - **UX/UI shape:** specify layout hierarchy, navigation, interaction model, responsiveness, accessibility, visual constraints, and the 5-state UX checklist: happy, empty, loading/pending, error/failure, edge/constraint.
29
+ - **Copy shape:** specify audience, message hierarchy, claims, tone, CTA, labels, empty/error text, and prohibited vague claims.
30
+ - **Technical shape:** specify boundary placement, API granularity, data flow, state ownership, dependency direction, persistence/integration seams, and architectural tradeoffs without writing code.
31
+ - **Alternative exploration:** present multiple viable directions before choosing or asking the user to choose.
32
+
33
+ ## Process
34
+ 1. Classify the request into one or more modes: product, UX/UI, copy, technical, or alternatives.
35
+ 2. Identify the goal, primary user/audience, job-to-be-done, context of use, and success criteria.
36
+ 3. Inspect existing product patterns, domain language, and provided material before inventing new conventions.
37
+ 4. Convert intent into exact rules:
38
+ - Product: actor, trigger, preconditions, action, result, permissions, limits, and measurable success.
39
+ - UX/UI: screen/region hierarchy, controls, transitions, accessibility behavior, responsive behavior, and 5-state UX checklist.
40
+ - Copy: exact headline/body/CTA/error text or content rules, with claims grounded in known facts.
41
+ - Technical: components/modules involved, ownership boundaries, contracts, data flow, failure handling, migration or rollout constraints.
42
+ 5. Apply technical shaping heuristics when architecture matters:
43
+ - **Boundary placement:** put boundaries where ownership, volatility, testability, or external systems change; do not split stable one-step logic.
44
+ - **API granularity:** prefer operations that match caller intent; avoid both chatty micro-methods and god endpoints that hide unrelated behavior.
45
+ - **Data flow:** name source of truth, state transitions, sync/async edges, validation points, and where errors surface.
46
+ - **Architectural tradeoffs:** state what becomes simpler, harder, slower, safer, more testable, or more coupled.
47
+ 6. Ban fluffy terms unless translated to behavior. Words like “modern,” “clean,” “intuitive,” “delightful,” “seamless,” or “user-friendly” must become observable rules.
48
+ 7. Offer alternatives when the direction is not obvious. Include the no-op option if legitimate.
49
+ 8. Convert the chosen direction into acceptance criteria that can be implemented and reviewed.
50
+ 9. Stop at the spec boundary. If the user asks for design plus build, finish Shape with the spec and recommended handoff to `build`; do not implement code.
51
+
52
+ ## Subagents and reasoning
53
+ Default reasoning: `medium`. Use writer, UI, design, or architecture subagents for bounded alternatives, critique, or parallel concepts. Use `high` for multi-screen flows, accessibility-sensitive experiences, design-system impact, pricing/positioning, architecture boundaries, or major scope decisions. Subagents should produce options or critique, not unrequested implementation.
54
+
55
+ ## Hard rules
56
+ - Shape is not build: do not edit production code or runtime behavior.
57
+ - If the user asks for design and implementation together, Shape stops after the specification and hands off to `build`.
58
+ - Ground claims in research or existing product evidence; call `research` when facts are missing.
59
+ - Always identify user/audience and success criteria for product-facing work.
60
+ - Include acceptance criteria before handing off to implementation.
61
+ - Translate fluffy descriptors into exact behavior; otherwise remove them.
62
+ - Avoid no-op avoidance: if the best answer is “do nothing” or “decide later,” say so with criteria.
63
+
64
+ ## Failure modes
65
+ - **Abstract advice:** principles without actors, states, rules, tradeoffs, or acceptance criteria.
66
+ - **Pretty but unusable:** visual ideas without behavior, states, or acceptance criteria.
67
+ - **Fluffy spec:** “modern/user-friendly” language without exact behavior.
68
+ - **Spec as proof:** implying a design solves the problem before implementation or validation.
69
+ - **Audience blur:** writing for everyone and satisfying no one.
70
+ - **Scope fog:** hiding hard tradeoffs until build time.
71
+ - **Premature code:** implementing while still deciding what should exist.
72
+
73
+ ## Examples
74
+ Good product shape: “When a workspace has no projects, show an empty state with title ‘Create your first project,’ one-sentence explanation, primary ‘New project’ CTA, and no table chrome. Success: first project creation rate increases.”
75
+ Bad product shape: “Make the dashboard more useful and modern.”
76
+
77
+ Good UX/UI shape: “On save, disable the Save button, keep the form editable fields visible, show inline progress text ‘Saving…’, then restore focus to the first invalid field on failure.”
78
+ Bad UX/UI shape: “Use a clean, user-friendly save experience.”
79
+
80
+ Good copy shape: “CTA says ‘Start free trial’ because billing is not required; avoid ‘Buy now.’ Error text names the failed action and recovery: ‘We couldn’t send the invite. Check the email address and try again.’”
81
+ Bad copy shape: “Use friendly copy that reduces friction.”
82
+
83
+ Good technical shape: “Keep validation in the domain service because API and background import both need it; expose one `createInvite` operation that returns accepted, duplicate, or invalid-email outcomes.”
84
+ Bad technical shape: “Add a helper/manager layer so the architecture is scalable.”
85
+
86
+ Worked technical shape: “For export retries, keep the queue worker as the owner of retry state, expose `requestExport(accountId, format)` from the API, persist `pending|running|failed|ready` status in `exports`, and surface failures through the existing job status endpoint. Tradeoff: one extra status read, but retry policy stays out of controllers and can be tested without HTTP.”
87
+
88
+ ## Output format
89
+ ```markdown
90
+ ## Shaped direction
91
+ Goal: ...
92
+ Audience/user: ...
93
+ Mode(s): product | UX/UI | copy | technical | alternatives
94
+
95
+ ### Proposed behavior / experience
96
+ - Actor/trigger/preconditions: ...
97
+ - Rules/results: ...
98
+
99
+ ### UX states and copy
100
+ - Happy: ...
101
+ - Empty: ...
102
+ - Loading/pending: ...
103
+ - Error/failure: ...
104
+ - Edge/constraint: ...
105
+ - Key copy: ...
106
+
107
+ ### Technical shape
108
+ - Boundaries/API/data flow: ...
109
+ - Tradeoffs: ...
110
+
111
+ ### Scope and tradeoffs
112
+ - In: ...
113
+ - Out: ...
114
+ - Tradeoffs: ...
115
+
116
+ ### Alternatives considered
117
+ - Option A: ...
118
+ - Option B/no-op: ...
119
+
120
+ ### Acceptance criteria
121
+ - ...
122
+
123
+ ### Recommended next step
124
+ Module and rationale
125
+ ```