devlyn-cli 1.13.0 → 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (24) hide show
  1. package/CLAUDE.md +28 -149
  2. package/README.md +30 -1
  3. package/config/skills/devlyn:auto-resolve/SKILL.md +167 -453
  4. package/config/skills/devlyn:auto-resolve/evals/evals.json +21 -0
  5. package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +42 -0
  6. package/config/skills/devlyn:auto-resolve/references/build-gate.md +36 -22
  7. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +43 -165
  8. package/config/skills/devlyn:auto-resolve/references/findings-schema.md +103 -0
  9. package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +54 -0
  10. package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +45 -0
  11. package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +84 -0
  12. package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +114 -0
  13. package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +201 -0
  14. package/config/skills/devlyn:auto-resolve/scripts/archive_run.py +104 -0
  15. package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +96 -0
  16. package/config/skills/devlyn:ideate/SKILL.md +17 -78
  17. package/config/skills/devlyn:ideate/references/codex-critic-template.md +42 -0
  18. package/config/skills/devlyn:ideate/references/templates/item-spec.md +4 -0
  19. package/config/skills/devlyn:preflight/SKILL.md +25 -40
  20. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +6 -10
  21. package/config/skills/devlyn:reap/SKILL.md +104 -0
  22. package/config/skills/devlyn:reap/scripts/reap.sh +129 -0
  23. package/config/skills/devlyn:reap/scripts/scan.sh +116 -0
  24. package/package.json +5 -1
@@ -31,20 +31,11 @@ Parse these from the user's invocation message:
31
31
 
32
32
  **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
33
33
  - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
34
- - Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort.
35
- - Read `references/challenge-rubric.md` up front. The engine routing table lives in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)" — read that on demand when routing decisions are needed.
34
+ - Call `mcp__codex-cli__ping` to verify Codex MCP availability. On failure, **silently fall back to `--engine claude`** and note `engine downgraded: codex-ping failed` in your eventual output summary. Do not present a menu; do not abort. This matches auto-resolve's hands-off contract.
35
+ - Read `references/challenge-rubric.md` up front.
36
36
 
37
37
  **Consolidated flag**: `--with-codex` was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which routes the CHALLENGE rubric pass to Codex automatically. No flag needed. Continuing with `--engine auto`."
38
38
 
39
- <why_this_matters>
40
- When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
41
- - Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
42
- - Full roadmaps create attention noise (49 irrelevant items dilute focus on item #3)
43
- - Done criteria generated from vague prompts miss the user's actual intent
44
-
45
- This skill solves the context engineering problem by producing **self-contained specs** — each carries just enough context for auto-resolve to work autonomously.
46
- </why_this_matters>
47
-
48
39
  ## Output Architecture
49
40
 
50
41
  The skill produces a three-layer progressive disclosure structure:
@@ -105,6 +96,8 @@ Before starting, identify what the user needs:
105
96
  | User shares links/resources to process | **Research-first** | Lead with Explore (research synthesis), then standard flow |
106
97
  | Existing roadmap, user wants to reprioritize | **Replan** | Read existing docs, focus on Converge, update documents |
107
98
 
99
+ **Tie-breaks when a request matches two modes:** choose the narrowest mode that satisfies the request. Quick Add wins over Expand when the user has one concrete item in mind. Research-first wins over Deep-dive when links or resources are the primary input. Deep-dive wins over Expand when one topic specifically needs depth. Replan is chosen only when priority or order changes are explicit. If two modes still look equally plausible after applying these rules, present the top two to the user and let them pick — silently choosing one wastes the session if the other was right.
100
+
108
101
  Announce the detected mode and confirm before proceeding.
109
102
 
110
103
  ### Expand Mode Detail
@@ -158,23 +151,20 @@ To implement:
158
151
 
159
152
  ### Context Archiving
160
153
 
161
- ROADMAP.md is the tactical index. Every row that isn't Planned / In Progress / Blocked is noise — it dilutes attention, pads the file past its 150-line target, and makes future ideation sessions read stale context they'll have to mentally filter out. Done work should move; it shouldn't disappear.
154
+ ROADMAP.md is the tactical index. Done work should move to a collapsed `## Completed` block at the bottom, not clutter the active view. Item spec files stay on disk at `docs/roadmap/phase-N/{id}.md` only the index row moves.
162
155
 
163
- The goal state: the active section of ROADMAP.md only lists work that still needs doing. Everything completed lives under a collapsed `## Completed` block at the bottom. Item spec files themselves stay in place — they remain on disk at `docs/roadmap/phase-N/{id}.md` because other specs may reference them — only the index row moves.
156
+ #### The Archive Pass (conditional)
164
157
 
165
- #### The Archive Pass
158
+ Run this at the start of Quick Add / Expand / Replan **only when** `docs/ROADMAP.md` contains at least one phase where every row is `Done`. A quick scan tells you within seconds. Skip the pass otherwise — running it on a roadmap with no fully-done phases is no-op bookkeeping that burns the user's turn.
166
159
 
167
- Run this at the start of every Quick Add, Expand, and Replan session (each mode's "On entry" checklist tells you when). It's deterministic and cheap. Never skip it to "save time" — the time you save by skipping it is immediately spent by you and the user arguing about a roadmap that shows phantom work.
160
+ When it runs:
168
161
 
169
- 1. **Read `docs/ROADMAP.md`.** For each phase, look at the Status column of every row.
170
- 2. **For each phase where every row is `Done`:** archive the whole phase.
171
- - Cut the phase's `## Phase N: …` heading and table out of the active section.
172
- - If no `## Completed` section exists yet at the bottom of the file, create one.
173
- - Add a `<details>` block inside Completed for this phase (see format below). Use the latest completion date you can find in the item spec frontmatter (`completed:` field, or today's date if absent). Item count is the row count.
174
- 3. **For individual `Done` rows inside an otherwise-active phase:** leave them in place. A row only moves when its whole phase is finished. (Mixed-state phases stay mixed so the user can see recent wins alongside open work.)
175
- 4. **Scan the Backlog table.** Surface any row whose "Revisit" date has passed — mention it to the user as a replan candidate. Don't auto-promote it; that's a conversation.
176
- 5. **Scan `docs/roadmap/decisions/`.** Flag any decision whose status is `accepted` but whose reasoning is visibly contradicted by the work that's now Done. Don't silently edit decisions; raise them as open questions.
177
- 6. **Report what you did.** Before moving on to the mode's main work, tell the user in one short paragraph: "Archived Phase 1 (3 items). Active roadmap is now Phase 2 (2 items). Proceeding with [Quick Add / Expand / Replan]." Skip the report only if nothing changed.
162
+ 1. Read `docs/ROADMAP.md`.
163
+ 2. For each phase where every row is `Done`: cut the `## Phase N: …` heading and table, move it into a new or existing `## Completed` block at the bottom as a `<details>` entry (see format below). Use the latest completion date found in item spec frontmatter (`completed:`), or today's if absent. Item count is the row count.
164
+ 3. Individual `Done` rows inside an otherwise-active phase stay put mixed phases show recent wins alongside open work.
165
+ 4. Scan the Backlog table; surface any row whose `Revisit` date has passed as a replan candidate (don't auto-promote that's a conversation).
166
+ 5. Scan `docs/roadmap/decisions/` for `accepted` decisions whose reasoning is visibly contradicted by newly-Done work; raise them as open questions rather than silently editing.
167
+ 6. One-sentence report of what was archived, then proceed with the mode's main work. Skip the report if nothing changed.
178
168
 
179
169
  **Completed block format** (place at the bottom of ROADMAP.md, below Decisions):
180
170
 
@@ -207,7 +197,7 @@ When a decision becomes wrong because the world changed under it:
207
197
  The biggest risk in ideation is premature convergence — jumping to solutions before understanding the problem. This phase prevents that.
208
198
 
209
199
  Establish through conversation:
210
- 1. **Problem statement**: What problem or opportunity? For whom? Why now?
200
+ 1. **Job-to-be-Done**: In one sentence — "When [situation], [user] wants to [motivation], so they can [outcome]." Capture this before anything else. If the user cannot produce it, that is itself the finding — pause and explore the situation until the sentence exists. A bare problem statement without this frame is a state description, not a job, and downstream specs built from it will describe system behavior instead of customer progress.
211
201
  2. **Constraints**: What can't change? (tech stack, timeline, existing commitments)
212
202
  3. **Success criteria**: How will we know this worked? (outcomes, not outputs)
213
203
  4. **Anti-goals**: What are we explicitly NOT trying to do?
@@ -232,6 +222,7 @@ When relevant, actively research before and during brainstorming:
232
222
  - **Technical feasibility**: Can this be built within the constraints? Where are the hard parts?
233
223
  - **Patterns and prior art**: How have similar problems been solved?
234
224
  - **Market/user context**: Who else needs this? What do they currently use?
225
+ - **Evidence discipline**: Treat prior art as source-backed only when verified by a fetched link or documentation the user can open. If a pattern is inferred from memory or analogy, label it `[UNVERIFIED]` inline and do not present it as market fact. The CHALLENGE rubric's NO GUESSWORK axis fires hard on unlabeled claims that look authoritative but are actually recall.
235
226
 
236
227
  Not every ideation needs all of these — a personal side project doesn't need market research. Judge what's relevant and use subagents for parallel research when multiple topics need investigation.
237
228
  </research_protocol>
@@ -317,8 +308,6 @@ Engage maximum thinking effort here — both the solo rubric pass and, if enable
317
308
  Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
318
309
  </thinking_effort>
319
310
 
320
- The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
321
-
322
311
  ### The rubric — single source of truth
323
312
 
324
313
  Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
@@ -329,48 +318,13 @@ Apply the rubric to the internal convergence draft. Produce findings in the form
329
318
 
330
319
  For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
331
320
 
332
- If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
333
-
334
321
  ### Codex critic pass (engine-routed)
335
322
 
336
323
  **If `--engine auto`** (default): Codex runs the CHALLENGE rubric pass automatically as critic.
337
324
 
338
325
  Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, `workingDirectory: <project root>`. The `prompt` parameter is built from the packaged plan + the inlined rubric + the appended Codex instructions. Codex has no filesystem access to this project, so everything it needs travels in the prompt.
339
326
 
340
- **Step 1 — Package the post-solo plan.** Build the prompt with these sections in this order:
341
-
342
- ```
343
- ## Problem framing (from FRAME phase)
344
- [problem statement, constraints, success criteria, anti-goals]
345
-
346
- ## Confirmed facts vs assumptions
347
- Confirmed by user: [list each fact the user explicitly confirmed]
348
- Assumptions (not yet confirmed): [list each assumption the agent made]
349
-
350
- ## Plan (post-solo-CHALLENGE)
351
- Vision: [one sentence]
352
- Phase 1 ([theme]): [items with one-line descriptions and dependencies]
353
- Phase 2 ([theme]): ...
354
- Architecture decisions: [each with what / why / alternatives considered]
355
- Deferred to backlog: [items + reason]
356
-
357
- ## Findings from the solo rubric pass
358
- [list each with: severity, axis, quote, why, fix, whether applied]
359
-
360
- ## Rubric
361
- [INLINE the full text of references/challenge-rubric.md here verbatim — Codex needs the rubric definition in the prompt itself]
362
-
363
- ## Your job
364
- You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user explicitly asked to be challenged because soft-pedaled plans waste their time.
365
-
366
- You are running AFTER a solo pass by Claude. Catch what the solo pass missed; do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" (with one-line agreement) or "I would frame this differently" (with a reason). Then add your own findings that the solo pass missed.
367
-
368
- Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
369
-
370
- Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let the orchestrator surface it to the user.
371
-
372
- End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, plus a one-line explanation.
373
- ```
327
+ **Step 1 — Package the post-solo plan.** Build the prompt per `references/codex-critic-template.md` (section order, rubric inlining, Codex-specific instructions all live there verbatim — follow the template structure, fill in the plan/findings sections).
374
328
 
375
329
  **Step 2 — Reconcile.** Merge the two finding lists:
376
330
  - Same finding from both → keep the more specific wording, mark "confirmed by both"
@@ -508,21 +462,6 @@ After completing each item:
508
462
 
509
463
  The auto-resolve prompt explicitly tells the build agent to read the spec file — this ensures done-criteria are adopted from the spec rather than generated from scratch, preserving the ideation context through to implementation.
510
464
 
511
- ## Quality Checklist
512
-
513
- Before finalizing, verify:
514
- - [ ] Every roadmap item has a linked spec file
515
- - [ ] Every spec has testable requirements (not vague statements)
516
- - [ ] Every spec has an Out of Scope section
517
- - [ ] Every spec's Context section is 3 sentences or fewer
518
- - [ ] ROADMAP.md is an index only — no inline specifications
519
- - [ ] No spec requires reading VISION.md to be understood (self-contained)
520
- - [ ] Dependencies between items are documented in both specs
521
- - [ ] Architecture decisions include reasoning and alternatives considered
522
- - [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex critic on `--engine auto`); no item still fails any axis at CRITICAL or HIGH severity
523
- - [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
524
- - [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
525
-
526
465
  ## Language
527
466
 
528
467
  Generate all documents in the language the user communicates in. If the user mixes languages, match their primary language for prose and keep technical terms in English.
@@ -0,0 +1,42 @@
1
+ # Codex Critic Prompt Template (Phase 3.5)
2
+
3
+ Used by `devlyn:ideate` when `--engine auto` or `--engine claude` (role reversal). Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, `workingDirectory: <project root>`. Codex has no filesystem access to this project — everything it needs travels in the prompt.
4
+
5
+ Assemble the prompt with these sections in this exact order, filling in placeholders:
6
+
7
+ ```
8
+ ## Problem framing (from FRAME phase)
9
+ [problem statement, constraints, success criteria, anti-goals]
10
+
11
+ ## Confirmed facts vs assumptions
12
+ Confirmed by user: [list each fact the user explicitly confirmed]
13
+ Assumptions (not yet confirmed): [list each assumption the agent made]
14
+
15
+ ## Plan (post-solo-CHALLENGE)
16
+ Vision: [one sentence]
17
+ Phase 1 ([theme]): [items with one-line descriptions and dependencies]
18
+ Phase 2 ([theme]): ...
19
+ Architecture decisions: [each with what / why / alternatives considered]
20
+ Deferred to backlog: [items + reason]
21
+
22
+ ## Findings from the solo rubric pass
23
+ [list each with: severity, axis, quote, why, fix, whether applied]
24
+
25
+ ## Rubric
26
+ [INLINE the full text of references/challenge-rubric.md here verbatim — Codex needs the rubric definition in the prompt itself]
27
+
28
+ ## Your job
29
+ You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user explicitly asked to be challenged because soft-pedaled plans waste their time.
30
+
31
+ You are running AFTER a solo pass by Claude. Catch what the solo pass missed; do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" (with one-line agreement) or "I would frame this differently" (with a reason). Then add your own findings that the solo pass missed.
32
+
33
+ Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
34
+
35
+ Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let the orchestrator surface it to the user.
36
+
37
+ End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, plus a one-line explanation.
38
+ ```
39
+
40
+ ## Why a separate file
41
+
42
+ Inlining the rubric and the boilerplate instructions into the orchestrator SKILL.md burned ~30 lines per load of the ideate skill. The critic packaging runs exactly once per session; the template only needs to be read at Phase 3.5 time. On-demand loading matches the progressive-disclosure pattern used across the devlyn harness.
@@ -22,6 +22,10 @@ depends-on: []
22
22
  <!-- Extract only the relevant context from the vision — don't make the implementation agent read the full vision document. -->
23
23
  [Project] does [what]. This feature [enables/improves/fixes] [specific user capability].
24
24
 
25
+ ## Customer Frame
26
+ <!-- One sentence. When [situation], [user] wants to [motivation] so they can [outcome]. -->
27
+ <!-- Use this to resolve ambiguous requirements: prefer the behavior that best serves this user outcome, and do not add capabilities outside this frame. -->
28
+
25
29
  ## Objective
26
30
  <!-- One sentence: what the user can do after this is implemented. -->
27
31
 
@@ -1,18 +1,14 @@
1
1
  ---
2
2
  name: devlyn:preflight
3
3
  description: >
4
- Final alignment check between vision/roadmap documents and the actual codebase the last step
5
- before declaring a roadmap phase complete. Reads every commitment from VISION.md, ROADMAP.md,
6
- and item specs, then audits the implementation with evidence-based analysis citing file:line
7
- for every finding. Catches missing features, incomplete implementations, spec divergence, bugs,
8
- and documentation drift. Also validates in the browser for web projects and checks documentation
9
- alignment. Use when the user has finished implementing a roadmap and wants to verify nothing was
10
- missed. Triggers on "preflight", "preflight check", "gap analysis", "gap check", "did I miss
11
- anything", "check against the roadmap", "verify implementation", "alignment check", "are we done",
12
- "final check before shipping", or when the user says they've finished implementing and wants
13
- verification. This is different from /devlyn:evaluate (which grades a single changeset) and
14
- /devlyn:review (which reviews code quality) — preflight audits the ENTIRE project against its
15
- planning documents holistically.
4
+ Final alignment check between vision/roadmap documents and the actual codebase before declaring
5
+ a roadmap phase complete. Reads commitments from VISION.md, ROADMAP.md, and item specs, then
6
+ audits the implementation with file:line evidence. Catches missing/incomplete features, spec
7
+ divergence, bugs, and doc drift; validates browser behavior for web projects. Use when
8
+ implementation is finished and you want a holistic roadmap-vs-code verification. Triggers on
9
+ "preflight", "gap analysis", "did I miss anything", "check against the roadmap", "verify
10
+ implementation", "are we done". Differs from /devlyn:evaluate (single changeset) and
11
+ /devlyn:review (code quality) preflight audits the entire project against planning docs.
16
12
  ---
17
13
 
18
14
  # Vision-to-Implementation Preflight Check
@@ -54,7 +50,7 @@ Example with engine: `/devlyn:preflight --engine auto`
54
50
 
55
51
  **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
56
52
  - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — NOT `claude`.
57
- - Call `mcp__codex-cli__ping` to verify Codex MCP availability. If ping fails, fall back to `--engine claude` with a warning.
53
+ - Call `mcp__codex-cli__ping` to verify Codex MCP availability. On failure, **silently fall back to `--engine claude`** and note `engine downgraded: codex-ping failed` in the final preflight report header. Do not abort. Matches the hands-off contract used by auto-resolve and ideate.
58
54
 
59
55
  ## PHASE 0: DISCOVER & SCOPE
60
56
 
@@ -104,11 +100,12 @@ Read all in-scope planning documents and build a **commitment registry** — eve
104
100
  4. **Filter out** (excluded from audit entirely):
105
101
  - Items in `backlog/` or `deferred.md`
106
102
  - Items with `status: cut` in ROADMAP.md
107
- - Out of Scope entries — these are anti-commitments (things promised NOT to build)
108
103
 
109
- 5. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are not expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do **not** audit them or report them as findings. Flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
104
+ 5. **Anti-commitments ARE audited** (Out of Scope entries in each spec). These are "must NOT build" claims if the codebase has shipped something the spec explicitly excluded, that is a WORKAROUND / scope-creep finding, not a success. The code-auditor checks each anti-commitment: "is this excluded behavior present in the code?" If yes emit a finding with `rule_id: "scope.anti-commitment-violation"` (severity HIGH).
110
105
 
111
- 6. **Write to `.devlyn/commitment-registry.md`**:
106
+ 6. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are not expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do **not** audit them as missing. Flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
107
+
108
+ 7. **Write to `.devlyn/commitment-registry.md`**:
112
109
 
113
110
  ```markdown
114
111
  # Commitment Registry
@@ -124,15 +121,15 @@ Total commitments: [N]
124
121
  - [INTEGRATION] Auth middleware applied to all /api/* routes
125
122
  - [TEST] Auth flow covered by E2E tests
126
123
 
127
- ## Anti-Commitments (Out of Scope)
128
- - [item 1.1] Does NOT include social login
129
- - [item 1.2] Does NOT include real-time inventory sync
124
+ ## Anti-Commitments (Out of Scope — audited as "must NOT exist in code")
125
+ - [item 1.1] Must NOT include social login
126
+ - [item 1.2] Must NOT include real-time inventory sync
130
127
 
131
- ## Not Started (Planned — excluded from audit)
128
+ ## Not Started (Planned — not audited for presence, but still anti-commitments inside them apply)
132
129
  ### 2.1 [item title] (spec status: planned)
133
130
  - [FEATURE] WebSocket connection on page load
134
131
  - [FEATURE] Real-time task list updates
135
- [These items are tracked for visibility but NOT audited or reported as findings]
132
+ [Planned items are tracked for visibility; code-auditor does not flag as MISSING.]
136
133
  ```
137
134
 
138
135
  ## PHASE 2: AUDIT
@@ -168,36 +165,23 @@ Tests user-facing features in the browser against commitment registry. Writes to
168
165
 
169
166
  ## PHASE 3: SYNTHESIZE & REPORT
170
167
 
171
- After all auditors report:
168
+ Auditors already emit each finding with its category (`MISSING`/`INCOMPLETE`/`DIVERGENT`/`BROKEN`/`UNDOCUMENTED`/`STALE_DOC`/`scope.anti-commitment-violation`) and severity (`CRITICAL`/`HIGH`/`MEDIUM`/`LOW`). Synthesis passes them through — do NOT re-classify or re-severity-label. That would replace domain judgment with orchestrator mechanics.
172
169
 
173
170
  1. **Read all audit files** in parallel:
174
171
  - `.devlyn/audit-code.md`
175
172
  - `.devlyn/audit-docs.md` (if exists)
176
173
  - `.devlyn/audit-browser.md` (if exists)
177
174
 
178
- 2. **Deduplicate**: If multiple auditors flagged the same issue, merge into one finding at the highest severity.
179
-
180
- 3. **Filter accepted divergences**: If `.devlyn/preflight-accepted.md` exists, remove any findings that match accepted entries.
181
-
182
- 4. **Classify each finding** using these categories:
183
-
184
- | Category | Description | Typical source |
185
- |----------|-------------|----------------|
186
- | `MISSING` | In roadmap but not implemented | code-auditor |
187
- | `INCOMPLETE` | Implementation started but unfinished | code-auditor |
188
- | `DIVERGENT` | Implemented differently than spec says | code-auditor |
189
- | `BROKEN` | Implemented but has a bug | code-auditor, browser-auditor |
190
- | `UNDOCUMENTED` | Implemented but not in docs | docs-auditor |
191
- | `STALE_DOC` | Docs don't match current code | docs-auditor |
175
+ 2. **Deduplicate**: if multiple auditors flagged the same issue (same category + file:line), merge into one finding at the highest severity the reporting auditor assigned. Trust the auditor's severity — do not override.
192
176
 
193
- 5. **Assign severity**: CRITICAL (blocks shipping), HIGH (should fix), MEDIUM (fix or accept), LOW (cosmetic)
177
+ 3. **Filter accepted divergences**: if `.devlyn/preflight-accepted.md` exists, remove findings whose (category, commitment) matches an accepted entry.
194
178
 
195
- 6. **Compare with previous run** (if `.devlyn/PREFLIGHT-REPORT.md` existed):
179
+ 4. **Compare with previous run** (if `.devlyn/PREFLIGHT-REPORT.md` existed):
196
180
  - `RESOLVED`: finding from previous run no longer present
197
181
  - `PERSISTS`: finding still present
198
182
  - `NEW`: finding not in previous run
199
183
 
200
- 7. **Generate `.devlyn/PREFLIGHT-REPORT.md`**:
184
+ 5. **Generate `.devlyn/PREFLIGHT-REPORT.md`**:
201
185
 
202
186
  ```markdown
203
187
  # Preflight Report
@@ -212,6 +196,7 @@ Previous run: [timestamp / none]
212
196
  | INCOMPLETE | [N] |
213
197
  | DIVERGENT | [N] |
214
198
  | BROKEN | [N] |
199
+ | SCOPE_VIOLATION | [N] |
215
200
  | UNDOCUMENTED | [N] |
216
201
  | STALE_DOC | [N] |
217
202
  | **Total findings** | **[N]** |
@@ -266,7 +251,7 @@ These items are acknowledged future work per the roadmap. They will be audited w
266
251
  - [list any, or "None"]
267
252
  ```
268
253
 
269
- 8. **Present the report** to the user with a summary.
254
+ 6. **Present the report** to the user with a summary.
270
255
 
271
256
  ## PHASE 4: TRIAGE & PROMOTE
272
257
 
@@ -8,16 +8,7 @@ You are auditing a codebase against its planning commitments. Your job is to ver
8
8
 
9
9
  Read `.devlyn/commitment-registry.md` for the full list of commitments to verify. Skip any items in the "Not Started (Planned)" section — those are acknowledged future work, not gaps.
10
10
 
11
- **Step 0 — Build health check**: Before auditing individual commitments, verify the project actually builds. Detect the project type(s) and run their build/typecheck commands:
12
- - `package.json` with `next` → `npx tsc --noEmit && npx next build`
13
- - `package.json` with `vite` + `tsconfig.json` → `npx tsc --noEmit`
14
- - `Cargo.toml` → `cargo check --all-targets`
15
- - `go.mod` → `go build ./... && go vet ./...`
16
- - `foundry.toml` → `forge build`
17
- - `hardhat.config.*` → `npx hardhat compile`
18
- - Monorepo (`pnpm-workspace.yaml`/`turbo.json`) → workspace-wide build
19
- - `Dockerfile*` → `docker build` (if Docker available)
20
- - For other project types, look for a `build` script in `package.json` or equivalent
11
+ **Step 0 — Build health check**: Before auditing individual commitments, verify the project actually builds. Run the build gate exactly as defined in `config/skills/devlyn:auto-resolve/references/build-gate.md` (detection matrix, commands, package manager rules, monorepo handling, Docker). That file is the SINGLE source of truth for build commands across devlyn-cli; preflight does not maintain a second matrix.
21
12
 
22
13
  Any build/typecheck failure is a BROKEN finding at CRITICAL severity — code that doesn't compile cannot fulfill any commitment. Include the full compiler error output with file:line references. This catches type errors, missing imports, cross-package drift, and Dockerfile build failures that text-based code reading alone cannot detect.
23
14
 
@@ -33,6 +24,11 @@ Any build/typecheck failure is a BROKEN finding at CRITICAL severity — code th
33
24
  | INCOMPLETE | Implementation started but doesn't fully satisfy | What's there + what's missing, both with file:line |
34
25
  | DIVERGENT | Implementation does something different than specified | Spec requirement vs actual behavior, with file:line |
35
26
  | BROKEN | Implementation exists but has a bug preventing it from working | The bug with file:line |
27
+ | SCOPE_VIOLATION | Code ships behavior an anti-commitment (`Out of Scope`) explicitly excluded | file:line showing the prohibited behavior |
28
+
29
+ **Anti-commitment audit** (new in v3.4): the registry's `## Anti-Commitments` section lists features the spec promised NOT to build. Check each one against the code:
30
+ - If the excluded behavior is present, emit a finding with `rule_id: "scope.anti-commitment-violation"` and severity `HIGH` (or `CRITICAL` if it also violates a constraint). This catches scope-creep and workaround shipping that raw commitment checks would miss.
31
+ - If the excluded behavior is absent, no finding — anti-commitments are satisfied by absence.
36
32
 
37
33
  **Beyond the commitment checklist**, also investigate:
38
34
  - Cross-feature integration gaps: features that should connect but don't
@@ -0,0 +1,104 @@
1
+ ---
2
+ description: Safely count and kill orphaned child processes (PPID=1) left behind by Claude Code MCP plugins, Superset terminal tabs, and codex wrappers. Use this whenever the user says "too many processes", "can't open terminals", "pty/process limit", "hundreds of bun/codex/workerd piling up", "clean up orphans", "reap processes", or reports new terminals failing to spawn on macOS. Also use proactively after long Claude sessions to prevent hitting kern.maxprocperuid or kern.tty.ptmx_max limits. ONLY touches a conservative whitelist of known leaks — never guesses on unknown processes.
3
+ allowed-tools: Read, Bash(ps:*), Bash(lsof:*), Bash(pgrep:*), Bash(awk:*), Bash(id:*), Bash(sysctl:*), Bash(bash:*)
4
+ argument-hint: [scan | kill | kill --force | kill --include workerd | kill --only telegram-bun]
5
+ ---
6
+
7
+ <role>
8
+ You are a process-hygiene janitor for macOS. Your job is to find leaked orphan processes (PPID=1, user-owned) that accumulate from buggy tools — MCP plugins that don't reap children on stdin EOF, terminal apps that don't SIGTERM process groups on tab close, codex wrappers that leave `tail -F` behind — and let the user remove them safely.
9
+
10
+ Your operating principle: **the user's trust costs more than one missed cleanup.** If a process doesn't match a verified whitelist entry, leave it alone and report it as UNKNOWN so the user can decide. Never guess.
11
+ </role>
12
+
13
+ <user_input>
14
+ $ARGUMENTS
15
+ </user_input>
16
+
17
+ <process>
18
+
19
+ ## Phase 1: Parse intent
20
+
21
+ Look at `$ARGUMENTS` and classify:
22
+
23
+ | Input | Mode |
24
+ |---|---|
25
+ | empty, `scan`, `status`, `count`, `list`, or anything non-imperative | **SCAN only** (default) |
26
+ | starts with `kill`, `reap`, `clean`, `prune`, `죽여`, `정리` | **KILL** mode |
27
+
28
+ In KILL mode, also parse:
29
+ - `--force` → SIGKILL instead of SIGTERM
30
+ - `--include workerd` → extend the default whitelist with the workerd-dev category
31
+ - `--only <category>` → restrict to a single category
32
+ - `--dry-run` → list kills but don't send signals
33
+
34
+ If the user's intent is ambiguous (e.g., they say "지워줘" but didn't specify force or include), **default to SCAN first**, show the result, and then ask whether to proceed with kill. Never escalate to `--force` without an explicit request.
35
+
36
+ ## Phase 2: SCAN
37
+
38
+ Always run scan first — even in KILL mode — so the user sees what is about to happen.
39
+
40
+ Run the bundled scanner. The skill is installed at `~/.claude/skills/devlyn:reap/`:
41
+
42
+ ```bash
43
+ bash ~/.claude/skills/devlyn:reap/scripts/scan.sh
44
+ ```
45
+
46
+ Report the output verbatim to the user. Then add your own 2-line summary:
47
+
48
+ - total orphan count across whitelist categories
49
+ - any UNKNOWN_ORPHANS that the user might want to investigate manually
50
+
51
+ Also surface the macOS limits for context, only once per session:
52
+
53
+ ```bash
54
+ sysctl kern.maxprocperuid kern.tty.ptmx_max 2>/dev/null
55
+ ```
56
+
57
+ ## Phase 3: KILL (only when requested)
58
+
59
+ Run the reap script with the parsed flags:
60
+
61
+ ```bash
62
+ bash ~/.claude/skills/devlyn:reap/scripts/reap.sh [flags]
63
+ ```
64
+
65
+ Show the output verbatim. The script re-verifies `PPID==1 && user==current` for every PID right before signaling — a process that was legitimately adopted since the scan will be skipped, not killed.
66
+
67
+ After kill, re-run scan to confirm the counts dropped. If any whitelisted PIDs are still present after SIGTERM and 2 seconds, mention that `--force` (SIGKILL) is available.
68
+
69
+ ## Phase 4: Recommend (only if signals of chronic leak)
70
+
71
+ If `telegram-bun` count > 10 OR oldest whitelisted orphan > 24h, tell the user this is a recurring leak and suggest one of:
72
+
73
+ 1. **Patch the telegram plugin** — add `process.stdin.on('end', () => process.exit(0))` to `server.ts` so the child dies when Claude Code exits.
74
+ 2. **Schedule this skill** — run `/devlyn:reap kill` periodically (e.g., via the `/loop` skill or a launchd agent).
75
+ 3. **Update Superset** — newer versions may SIGTERM process groups on tab close.
76
+
77
+ Do NOT apply these automatically. Recommend and let the user choose.
78
+
79
+ </process>
80
+
81
+ <safety>
82
+
83
+ ## Never-touch rules
84
+
85
+ - **NEVER kill** a process whose command does not match a whitelist category in `scan.sh`. Unknown = informational only.
86
+ - **NEVER kill** anything where `ps -o ppid=` returns something other than `1` at signal time.
87
+ - **NEVER kill** processes owned by another user (the scripts check `id -un`).
88
+ - **NEVER use** `killall`, `pkill -9`, or wildcard `kill $(pgrep ...)` in this skill. Always iterate PIDs individually with per-PID re-verification.
89
+ - **NEVER suggest** `sudo` escalation — this is a user-scope cleanup tool.
90
+
91
+ ## Whitelist definitions
92
+
93
+ These are the ONLY categories reap.sh will touch:
94
+
95
+ | Category | Match | Why safe |
96
+ |---|---|---|
97
+ | `telegram-bun` | `bun server.ts` **AND** cwd contains `/plugins/cache/claude-plugins-official/telegram/` | Telegram MCP plugin leaks one per Claude session. Verified by cwd, not just cmdline. |
98
+ | `superset-codex-bash` | `/bin/bash .*/.superset/bin/codex` with PPID=1 | `.superset/bin/codex` wrapper exits without killing its tail child; bash copies left behind. |
99
+ | `superset-codex-tail` | `tail -F .*superset-codex-session-*.jsonl` with PPID=1 | Log tail from the same wrapper, always safe to stop. |
100
+ | `workerd` (opt-in) | `@cloudflare/workerd-darwin-*/bin/workerd serve ` with PPID=1 | moonmaker-engine dev server that survives tab close. Opt-in because the user may have an active dev session. |
101
+
102
+ If the user asks to add a new category, **edit scan.sh and reap.sh together** — both must know the same pattern so scan never promises a cleanup that reap won't deliver.
103
+
104
+ </safety>
@@ -0,0 +1,129 @@
1
+ #!/usr/bin/env bash
2
+ # devlyn:reap — kill orphan processes from safe whitelist categories.
3
+ # Verifies PPID==1 and user-ownership AGAIN at kill time to avoid racing a
4
+ # legitimately-reparented process. Unknown orphans are never killed.
5
+ #
6
+ # Usage:
7
+ # reap.sh # default categories, SIGTERM
8
+ # reap.sh --force # SIGKILL instead of SIGTERM
9
+ # reap.sh --include workerd # add workerd-dev to the default set
10
+ # reap.sh --only telegram-bun # restrict to a single category
11
+ # reap.sh --dry-run # print what WOULD be killed, kill nothing
12
+
13
+ set -u
14
+ LC_ALL=C
15
+ export LC_ALL
16
+
17
+ ME="$(id -un)"
18
+ SIGNAL="TERM"
19
+ DRY=0
20
+ INCLUDE=""
21
+ ONLY=""
22
+
23
+ while [ $# -gt 0 ]; do
24
+ case "$1" in
25
+ --force) SIGNAL="KILL" ;;
26
+ --dry-run) DRY=1 ;;
27
+ --include) shift; INCLUDE="${INCLUDE},$1" ;;
28
+ --only) shift; ONLY="$1" ;;
29
+ -h|--help)
30
+ sed -n '2,14p' "$0"; exit 0 ;;
31
+ *)
32
+ printf 'unknown flag: %s\n' "$1" >&2; exit 2 ;;
33
+ esac
34
+ shift
35
+ done
36
+
37
+ DEFAULT_CATEGORIES="telegram-bun,superset-codex-bash,superset-codex-tail"
38
+ if [ -n "$ONLY" ]; then
39
+ CATEGORIES="$ONLY"
40
+ else
41
+ CATEGORIES="${DEFAULT_CATEGORIES}${INCLUDE}"
42
+ fi
43
+
44
+ SNAPSHOT="$(ps -eo pid=,ppid=,user=,etime=,command= 2>/dev/null | awk -v me="$ME" '$2==1 && $3==me')"
45
+
46
+ collect_pids() {
47
+ local category="$1"
48
+ case "$category" in
49
+ telegram-bun)
50
+ # cwd-verified — same logic as scan.sh
51
+ printf '%s\n' "$SNAPSHOT" \
52
+ | grep -E '/bun[^ ]* server\.ts( |$)' \
53
+ | awk '{print $1}' \
54
+ | while read -r pid; do
55
+ cwd="$(lsof -a -d cwd -p "$pid" 2>/dev/null | awk 'NR==2 {for(i=9;i<=NF;i++) printf "%s ", $i; print ""}')"
56
+ case "$cwd" in
57
+ *"/plugins/cache/claude-plugins-official/telegram/"*) printf '%s\n' "$pid" ;;
58
+ esac
59
+ done
60
+ ;;
61
+ superset-codex-bash)
62
+ printf '%s\n' "$SNAPSHOT" | grep -E '/bin/bash .*/\.superset/bin/codex( |$)' | awk '{print $1}' ;;
63
+ superset-codex-tail)
64
+ printf '%s\n' "$SNAPSHOT" | grep -E 'tail .*superset-codex-session-.*\.jsonl' | awk '{print $1}' ;;
65
+ workerd)
66
+ printf '%s\n' "$SNAPSHOT" | grep -E '@cloudflare/workerd-darwin-[^/]+/bin/workerd serve ' | awk '{print $1}' ;;
67
+ *)
68
+ printf 'unknown category: %s\n' "$category" >&2
69
+ return 1 ;;
70
+ esac
71
+ }
72
+
73
+ TOTAL_KILLED=0
74
+ TOTAL_SKIPPED=0
75
+
76
+ # Split the comma-separated category list without letting IFS leak into the
77
+ # inner loop that iterates newline-separated PIDs.
78
+ CATS_ARR=()
79
+ OLD_IFS="$IFS"
80
+ IFS=,
81
+ for c in $CATEGORIES; do
82
+ [ -n "$c" ] && CATS_ARR+=("$c")
83
+ done
84
+ IFS="$OLD_IFS"
85
+
86
+ for cat in "${CATS_ARR[@]}"; do
87
+ pids="$(collect_pids "$cat")" || continue
88
+ if [ -z "$pids" ]; then
89
+ printf '[%s] nothing to kill\n' "$cat"
90
+ continue
91
+ fi
92
+ while IFS= read -r pid; do
93
+ [ -z "$pid" ] && continue
94
+ # Re-verify right before killing. Any of these mean "don't touch":
95
+ # - process already gone
96
+ # - PPID is no longer 1 (got adopted by a real parent — not our target)
97
+ # - owner changed (extremely unlikely but cheap to check)
98
+ live_info="$(ps -o ppid=,user= -p "$pid" 2>/dev/null)"
99
+ if [ -z "$live_info" ]; then
100
+ printf '[%s] %s skipped (already exited)\n' "$cat" "$pid"
101
+ TOTAL_SKIPPED=$((TOTAL_SKIPPED+1))
102
+ continue
103
+ fi
104
+ live_ppid="$(printf '%s' "$live_info" | awk '{print $1}')"
105
+ live_user="$(printf '%s' "$live_info" | awk '{print $2}')"
106
+ if [ "$live_ppid" != "1" ] || [ "$live_user" != "$ME" ]; then
107
+ printf '[%s] %s skipped (ppid=%s user=%s — no longer orphan)\n' "$cat" "$pid" "$live_ppid" "$live_user"
108
+ TOTAL_SKIPPED=$((TOTAL_SKIPPED+1))
109
+ continue
110
+ fi
111
+ if [ "$DRY" -eq 1 ]; then
112
+ printf '[%s] %s would SIG%s\n' "$cat" "$pid" "$SIGNAL"
113
+ else
114
+ if kill -s "$SIGNAL" "$pid" 2>/dev/null; then
115
+ printf '[%s] %s SIG%s sent\n' "$cat" "$pid" "$SIGNAL"
116
+ TOTAL_KILLED=$((TOTAL_KILLED+1))
117
+ else
118
+ printf '[%s] %s kill failed\n' "$cat" "$pid"
119
+ TOTAL_SKIPPED=$((TOTAL_SKIPPED+1))
120
+ fi
121
+ fi
122
+ done <<< "$pids"
123
+ done
124
+
125
+ if [ "$DRY" -eq 1 ]; then
126
+ printf '\ndry-run complete.\n'
127
+ else
128
+ printf '\ndone. killed=%s skipped=%s\n' "$TOTAL_KILLED" "$TOTAL_SKIPPED"
129
+ fi