npm - devlyn-cli - Versions diffs - 1.13.0 → 1.15.0 - Mend

devlyn-cli 1.13.0 → 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

package/config/skills/devlyn:ideate/SKILL.md CHANGED Viewed

@@ -31,20 +31,11 @@ Parse these from the user's invocation message:
 **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
 - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
-- Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude`, [2] Abort.
-- Read `references/challenge-rubric.md` up front. The engine routing table lives in the auto-resolve skill's `references/engine-routing.md` under "Pipeline Phase Routing (ideate)" — read that on demand when routing decisions are needed.
+- Call `mcp__codex-cli__ping` to verify Codex MCP availability. On failure, **silently fall back to `--engine claude`** and note `engine downgraded: codex-ping failed` in your eventual output summary. Do not present a menu; do not abort. This matches auto-resolve's hands-off contract.
+- Read `references/challenge-rubric.md` up front.
 **Consolidated flag**: `--with-codex` was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which routes the CHALLENGE rubric pass to Codex automatically. No flag needed. Continuing with `--engine auto`."
-<why_this_matters>
-When ideas flow directly from conversation to `/devlyn:auto-resolve`, context degrades at each handoff:
-- Abstract vision statements cause over-engineering (the agent optimizes for principles instead of deliverables)
-- Full roadmaps create attention noise (49 irrelevant items dilute focus on item #3)
-- Done criteria generated from vague prompts miss the user's actual intent
-This skill solves the context engineering problem by producing **self-contained specs** — each carries just enough context for auto-resolve to work autonomously.
-</why_this_matters>
 ## Output Architecture
 The skill produces a three-layer progressive disclosure structure:
@@ -105,6 +96,8 @@ Before starting, identify what the user needs:
 | User shares links/resources to process | **Research-first** | Lead with Explore (research synthesis), then standard flow |
 | Existing roadmap, user wants to reprioritize | **Replan** | Read existing docs, focus on Converge, update documents |
+**Tie-breaks when a request matches two modes:** choose the narrowest mode that satisfies the request. Quick Add wins over Expand when the user has one concrete item in mind. Research-first wins over Deep-dive when links or resources are the primary input. Deep-dive wins over Expand when one topic specifically needs depth. Replan is chosen only when priority or order changes are explicit. If two modes still look equally plausible after applying these rules, present the top two to the user and let them pick — silently choosing one wastes the session if the other was right.
 Announce the detected mode and confirm before proceeding.
 ### Expand Mode Detail
@@ -158,23 +151,20 @@ To implement:
 ### Context Archiving
-ROADMAP.md is the tactical index. Every row that isn't Planned / In Progress / Blocked is noise — it dilutes attention, pads the file past its 150-line target, and makes future ideation sessions read stale context they'll have to mentally filter out. Done work should move; it shouldn't disappear.
+ROADMAP.md is the tactical index. Done work should move to a collapsed `## Completed` block at the bottom, not clutter the active view. Item spec files stay on disk at `docs/roadmap/phase-N/{id}.md` — only the index row moves.
-The goal state: the active section of ROADMAP.md only lists work that still needs doing. Everything completed lives under a collapsed `## Completed` block at the bottom. Item spec files themselves stay in place — they remain on disk at `docs/roadmap/phase-N/{id}.md` because other specs may reference them — only the index row moves.
+#### The Archive Pass (conditional)
-#### The Archive Pass
+Run this at the start of Quick Add / Expand / Replan **only when** `docs/ROADMAP.md` contains at least one phase where every row is `Done`. A quick scan tells you within seconds. Skip the pass otherwise — running it on a roadmap with no fully-done phases is no-op bookkeeping that burns the user's turn.
-Run this at the start of every Quick Add, Expand, and Replan session (each mode's "On entry" checklist tells you when). It's deterministic and cheap. Never skip it to "save time" — the time you save by skipping it is immediately spent by you and the user arguing about a roadmap that shows phantom work.
+When it runs:
-1. **Read `docs/ROADMAP.md`.** For each phase, look at the Status column of every row.
-2. **For each phase where every row is `Done`:** archive the whole phase.
-   - Cut the phase's `## Phase N: …` heading and table out of the active section.
-   - If no `## Completed` section exists yet at the bottom of the file, create one.
-   - Add a `<details>` block inside Completed for this phase (see format below). Use the latest completion date you can find in the item spec frontmatter (`completed:` field, or today's date if absent). Item count is the row count.
-3. **For individual `Done` rows inside an otherwise-active phase:** leave them in place. A row only moves when its whole phase is finished. (Mixed-state phases stay mixed so the user can see recent wins alongside open work.)
-4. **Scan the Backlog table.** Surface any row whose "Revisit" date has passed — mention it to the user as a replan candidate. Don't auto-promote it; that's a conversation.
-5. **Scan `docs/roadmap/decisions/`.** Flag any decision whose status is `accepted` but whose reasoning is visibly contradicted by the work that's now Done. Don't silently edit decisions; raise them as open questions.
-6. **Report what you did.** Before moving on to the mode's main work, tell the user in one short paragraph: "Archived Phase 1 (3 items). Active roadmap is now Phase 2 (2 items). Proceeding with [Quick Add / Expand / Replan]." Skip the report only if nothing changed.
+1. Read `docs/ROADMAP.md`.
+2. For each phase where every row is `Done`: cut the `## Phase N: …` heading and table, move it into a new or existing `## Completed` block at the bottom as a `<details>` entry (see format below). Use the latest completion date found in item spec frontmatter (`completed:`), or today's if absent. Item count is the row count.
+3. Individual `Done` rows inside an otherwise-active phase stay put — mixed phases show recent wins alongside open work.
+4. Scan the Backlog table; surface any row whose `Revisit` date has passed as a replan candidate (don't auto-promote — that's a conversation).
+5. Scan `docs/roadmap/decisions/` for `accepted` decisions whose reasoning is visibly contradicted by newly-Done work; raise them as open questions rather than silently editing.
+6. One-sentence report of what was archived, then proceed with the mode's main work. Skip the report if nothing changed.
 **Completed block format** (place at the bottom of ROADMAP.md, below Decisions):
@@ -207,7 +197,7 @@ When a decision becomes wrong because the world changed under it:
 The biggest risk in ideation is premature convergence — jumping to solutions before understanding the problem. This phase prevents that.
 Establish through conversation:
-1. **Problem statement**: What problem or opportunity? For whom? Why now?
+1. **Job-to-be-Done**: In one sentence — "When [situation], [user] wants to [motivation], so they can [outcome]." Capture this before anything else. If the user cannot produce it, that is itself the finding — pause and explore the situation until the sentence exists. A bare problem statement without this frame is a state description, not a job, and downstream specs built from it will describe system behavior instead of customer progress.
 2. **Constraints**: What can't change? (tech stack, timeline, existing commitments)
 3. **Success criteria**: How will we know this worked? (outcomes, not outputs)
 4. **Anti-goals**: What are we explicitly NOT trying to do?
@@ -232,6 +222,7 @@ When relevant, actively research before and during brainstorming:
 - **Technical feasibility**: Can this be built within the constraints? Where are the hard parts?
 - **Patterns and prior art**: How have similar problems been solved?
 - **Market/user context**: Who else needs this? What do they currently use?
+- **Evidence discipline**: Treat prior art as source-backed only when verified by a fetched link or documentation the user can open. If a pattern is inferred from memory or analogy, label it `[UNVERIFIED]` inline and do not present it as market fact. The CHALLENGE rubric's NO GUESSWORK axis fires hard on unlabeled claims that look authoritative but are actually recall.
 Not every ideation needs all of these — a personal side project doesn't need market research. Judge what's relevant and use subagents for parallel research when multiple topics need investigation.
 </research_protocol>
@@ -317,8 +308,6 @@ Engage maximum thinking effort here — both the solo rubric pass and, if enable
 Before finalizing the rubric pass, verify your findings against the rubric one more time: every flagged item should have a specific Quote, a failing axis, and a concrete revision — not a vague concern.
 </thinking_effort>
-The user has been burned by plans that look good on the surface but fall apart under scrutiny. Every time they accept a plan and then ask "is this no-workaround, no-guesswork, no-overengineering, world-class best practice, optimized?" the honest answer is almost always no. This phase makes that the *default* behavior — the plan challenges itself before the user has to.
 ### The rubric — single source of truth
 Read `references/challenge-rubric.md` before starting. That file is the only definition of the 5 axes, the finding format, the hard rule about respecting explicit user intent, and the good-vs-bad examples. Both the solo pass and the Codex pass use the same rubric; do not re-derive it inline.
@@ -329,48 +318,13 @@ Apply the rubric to the internal convergence draft. Produce findings in the form
 For Quick Add with one new item, one solo pass is enough. For a full greenfield or expand plan, run the rubric once, revise, and run it again on the revision. If a third pass would be needed, the plan has structural problems that belong in the user-facing summary as open questions — surface them rather than iterating further.
-If the plan came from one model in one pass, it almost always fails at least one axis somewhere. Nodding along to your own draft defeats the entire point of the phase.
 ### Codex critic pass (engine-routed)
 **If `--engine auto`** (default): Codex runs the CHALLENGE rubric pass automatically as critic.
 Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, `workingDirectory: <project root>`. The `prompt` parameter is built from the packaged plan + the inlined rubric + the appended Codex instructions. Codex has no filesystem access to this project, so everything it needs travels in the prompt.
-**Step 1 — Package the post-solo plan.** Build the prompt with these sections in this order:
-```
-## Problem framing (from FRAME phase)
-[problem statement, constraints, success criteria, anti-goals]
-## Confirmed facts vs assumptions
-Confirmed by user: [list each fact the user explicitly confirmed]
-Assumptions (not yet confirmed): [list each assumption the agent made]
-## Plan (post-solo-CHALLENGE)
-Vision: [one sentence]
-Phase 1 ([theme]): [items with one-line descriptions and dependencies]
-Phase 2 ([theme]): ...
-Architecture decisions: [each with what / why / alternatives considered]
-Deferred to backlog: [items + reason]
-## Findings from the solo rubric pass
-[list each with: severity, axis, quote, why, fix, whether applied]
-## Rubric
-[INLINE the full text of references/challenge-rubric.md here verbatim — Codex needs the rubric definition in the prompt itself]
-## Your job
-You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user explicitly asked to be challenged because soft-pedaled plans waste their time.
-You are running AFTER a solo pass by Claude. Catch what the solo pass missed; do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" (with one-line agreement) or "I would frame this differently" (with a reason). Then add your own findings that the solo pass missed.
-Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
-Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let the orchestrator surface it to the user.
-End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, plus a one-line explanation.
-```
+**Step 1 — Package the post-solo plan.** Build the prompt per `references/codex-critic-template.md` (section order, rubric inlining, Codex-specific instructions all live there verbatim — follow the template structure, fill in the plan/findings sections).
 **Step 2 — Reconcile.** Merge the two finding lists:
 - Same finding from both → keep the more specific wording, mark "confirmed by both"
@@ -508,21 +462,6 @@ After completing each item:
 The auto-resolve prompt explicitly tells the build agent to read the spec file — this ensures done-criteria are adopted from the spec rather than generated from scratch, preserving the ideation context through to implementation.
-## Quality Checklist
-Before finalizing, verify:
-- [ ] Every roadmap item has a linked spec file
-- [ ] Every spec has testable requirements (not vague statements)
-- [ ] Every spec has an Out of Scope section
-- [ ] Every spec's Context section is 3 sentences or fewer
-- [ ] ROADMAP.md is an index only — no inline specifications
-- [ ] No spec requires reading VISION.md to be understood (self-contained)
-- [ ] Dependencies between items are documented in both specs
-- [ ] Architecture decisions include reasoning and alternatives considered
-- [ ] CHALLENGE ran against `references/challenge-rubric.md` (solo, plus Codex critic on `--engine auto`); no item still fails any axis at CRITICAL or HIGH severity
-- [ ] User saw the post-challenge plan as the first and only confirmation prompt — no pre-challenge draft was shown first
-- [ ] Any rubric finding that conflicted with explicit user intent was surfaced as an open question, not silently applied
 ## Language
 Generate all documents in the language the user communicates in. If the user mixes languages, match their primary language for prose and keep technical terms in English.

package/config/skills/devlyn:ideate/references/codex-critic-template.md ADDED Viewed

@@ -0,0 +1,42 @@
+# Codex Critic Prompt Template (Phase 3.5)
+Used by `devlyn:ideate` when `--engine auto` or `--engine claude` (role reversal). Call `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox: "read-only"`, `workingDirectory: <project root>`. Codex has no filesystem access to this project — everything it needs travels in the prompt.
+Assemble the prompt with these sections in this exact order, filling in placeholders:
+```
+## Problem framing (from FRAME phase)
+[problem statement, constraints, success criteria, anti-goals]
+## Confirmed facts vs assumptions
+Confirmed by user: [list each fact the user explicitly confirmed]
+Assumptions (not yet confirmed): [list each assumption the agent made]
+## Plan (post-solo-CHALLENGE)
+Vision: [one sentence]
+Phase 1 ([theme]): [items with one-line descriptions and dependencies]
+Phase 2 ([theme]): ...
+Architecture decisions: [each with what / why / alternatives considered]
+Deferred to backlog: [items + reason]
+## Findings from the solo rubric pass
+[list each with: severity, axis, quote, why, fix, whether applied]
+## Rubric
+[INLINE the full text of references/challenge-rubric.md here verbatim — Codex needs the rubric definition in the prompt itself]
+## Your job
+You are applying an independent rubric pass to the PLANNING document above. This is a roadmap, not code — judge the shape of the plan, not implementation details. The user explicitly asked to be challenged because soft-pedaled plans waste their time.
+You are running AFTER a solo pass by Claude. Catch what the solo pass missed; do not just agree with what it already caught. For each existing solo finding, reply either "confirmed" (with one-line agreement) or "I would frame this differently" (with a reason). Then add your own findings that the solo pass missed.
+Use the finding format from the rubric above: Severity / Quote / Axis / Why / Fix. The Quote field is load-bearing — anchor each finding to a specific line from the plan.
+Respect explicit user intent. If the user confirmed something in the "Confirmed facts" section, the rubric does not override it silently. Raise the conflict as a note and let the orchestrator surface it to the user.
+End with a verdict: PASS / PASS WITH MINOR FIXES / FAIL — REVISION REQUIRED, plus a one-line explanation.
+```
+## Why a separate file
+Inlining the rubric and the boilerplate instructions into the orchestrator SKILL.md burned ~30 lines per load of the ideate skill. The critic packaging runs exactly once per session; the template only needs to be read at Phase 3.5 time. On-demand loading matches the progressive-disclosure pattern used across the devlyn harness.

package/config/skills/devlyn:ideate/references/templates/item-spec.md CHANGED Viewed

@@ -22,6 +22,10 @@ depends-on: []
 <!-- Extract only the relevant context from the vision — don't make the implementation agent read the full vision document. -->
 [Project] does [what]. This feature [enables/improves/fixes] [specific user capability].
+## Customer Frame
+<!-- One sentence. When [situation], [user] wants to [motivation] so they can [outcome]. -->
+<!-- Use this to resolve ambiguous requirements: prefer the behavior that best serves this user outcome, and do not add capabilities outside this frame. -->
 ## Objective
 <!-- One sentence: what the user can do after this is implemented. -->

package/config/skills/devlyn:preflight/SKILL.md CHANGED Viewed

@@ -1,18 +1,14 @@
 ---
 name: devlyn:preflight
 description: >
-  Final alignment check between vision/roadmap documents and the actual codebase — the last step
-  before declaring a roadmap phase complete. Reads every commitment from VISION.md, ROADMAP.md,
-  and item specs, then audits the implementation with evidence-based analysis citing file:line
-  for every finding. Catches missing features, incomplete implementations, spec divergence, bugs,
-  and documentation drift. Also validates in the browser for web projects and checks documentation
-  alignment. Use when the user has finished implementing a roadmap and wants to verify nothing was
-  missed. Triggers on "preflight", "preflight check", "gap analysis", "gap check", "did I miss
-  anything", "check against the roadmap", "verify implementation", "alignment check", "are we done",
-  "final check before shipping", or when the user says they've finished implementing and wants
-  verification. This is different from /devlyn:evaluate (which grades a single changeset) and
-  /devlyn:review (which reviews code quality) — preflight audits the ENTIRE project against its
-  planning documents holistically.
+  Final alignment check between vision/roadmap documents and the actual codebase before declaring
+  a roadmap phase complete. Reads commitments from VISION.md, ROADMAP.md, and item specs, then
+  audits the implementation with file:line evidence. Catches missing/incomplete features, spec
+  divergence, bugs, and doc drift; validates browser behavior for web projects. Use when
+  implementation is finished and you want a holistic roadmap-vs-code verification. Triggers on
+  "preflight", "gap analysis", "did I miss anything", "check against the roadmap", "verify
+  implementation", "are we done". Differs from /devlyn:evaluate (single changeset) and
+  /devlyn:review (code quality) — preflight audits the entire project against planning docs.
 ---
 # Vision-to-Implementation Preflight Check
@@ -54,7 +50,7 @@ Example with engine: `/devlyn:preflight --engine auto`
 **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
 - The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — NOT `claude`.
-- Call `mcp__codex-cli__ping` to verify Codex MCP availability. If ping fails, fall back to `--engine claude` with a warning.
+- Call `mcp__codex-cli__ping` to verify Codex MCP availability. On failure, **silently fall back to `--engine claude`** and note `engine downgraded: codex-ping failed` in the final preflight report header. Do not abort. Matches the hands-off contract used by auto-resolve and ideate.
 ## PHASE 0: DISCOVER & SCOPE
@@ -104,11 +100,12 @@ Read all in-scope planning documents and build a **commitment registry** — eve
 4. **Filter out** (excluded from audit entirely):
    - Items in `backlog/` or `deferred.md`
    - Items with `status: cut` in ROADMAP.md
-   - Out of Scope entries — these are anti-commitments (things promised NOT to build)
-5. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are not expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do **not** audit them or report them as findings. Flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
+5. **Anti-commitments ARE audited** (Out of Scope entries in each spec). These are "must NOT build" claims — if the codebase has shipped something the spec explicitly excluded, that is a WORKAROUND / scope-creep finding, not a success. The code-auditor checks each anti-commitment: "is this excluded behavior present in the code?" If yes → emit a finding with `rule_id: "scope.anti-commitment-violation"` (severity HIGH).
-6. **Write to `.devlyn/commitment-registry.md`**:
+6. **Separate planned items**: Items with `status: planned` in their spec frontmatter or "Planned" in ROADMAP.md are not expected to be implemented yet. Include them in a `[PLANNED]` section of the registry for visibility, but do **not** audit them as missing. Flagging planned items as MISSING creates noise and buries the real gaps in work that was supposed to be done.
+7. **Write to `.devlyn/commitment-registry.md`**:
 ```markdown
 # Commitment Registry
@@ -124,15 +121,15 @@ Total commitments: [N]
 - [INTEGRATION] Auth middleware applied to all /api/* routes
 - [TEST] Auth flow covered by E2E tests
-## Anti-Commitments (Out of Scope)
-- [item 1.1] Does NOT include social login
-- [item 1.2] Does NOT include real-time inventory sync
+## Anti-Commitments (Out of Scope — audited as "must NOT exist in code")
+- [item 1.1] Must NOT include social login
+- [item 1.2] Must NOT include real-time inventory sync
-## Not Started (Planned — excluded from audit)
+## Not Started (Planned — not audited for presence, but still anti-commitments inside them apply)
 ### 2.1 [item title] (spec status: planned)
 - [FEATURE] WebSocket connection on page load
 - [FEATURE] Real-time task list updates
-[These items are tracked for visibility but NOT audited or reported as findings]
+[Planned items are tracked for visibility; code-auditor does not flag as MISSING.]
 ```
 ## PHASE 2: AUDIT
@@ -168,36 +165,23 @@ Tests user-facing features in the browser against commitment registry. Writes to
 ## PHASE 3: SYNTHESIZE & REPORT
-After all auditors report:
+Auditors already emit each finding with its category (`MISSING`/`INCOMPLETE`/`DIVERGENT`/`BROKEN`/`UNDOCUMENTED`/`STALE_DOC`/`scope.anti-commitment-violation`) and severity (`CRITICAL`/`HIGH`/`MEDIUM`/`LOW`). Synthesis passes them through — do NOT re-classify or re-severity-label. That would replace domain judgment with orchestrator mechanics.
 1. **Read all audit files** in parallel:
    - `.devlyn/audit-code.md`
    - `.devlyn/audit-docs.md` (if exists)
    - `.devlyn/audit-browser.md` (if exists)
-2. **Deduplicate**: If multiple auditors flagged the same issue, merge into one finding at the highest severity.
-3. **Filter accepted divergences**: If `.devlyn/preflight-accepted.md` exists, remove any findings that match accepted entries.
-4. **Classify each finding** using these categories:
-| Category | Description | Typical source |
-|----------|-------------|----------------|
-| `MISSING` | In roadmap but not implemented | code-auditor |
-| `INCOMPLETE` | Implementation started but unfinished | code-auditor |
-| `DIVERGENT` | Implemented differently than spec says | code-auditor |
-| `BROKEN` | Implemented but has a bug | code-auditor, browser-auditor |
-| `UNDOCUMENTED` | Implemented but not in docs | docs-auditor |
-| `STALE_DOC` | Docs don't match current code | docs-auditor |
+2. **Deduplicate**: if multiple auditors flagged the same issue (same category + file:line), merge into one finding at the highest severity the reporting auditor assigned. Trust the auditor's severity — do not override.
-5. **Assign severity**: CRITICAL (blocks shipping), HIGH (should fix), MEDIUM (fix or accept), LOW (cosmetic)
+3. **Filter accepted divergences**: if `.devlyn/preflight-accepted.md` exists, remove findings whose (category, commitment) matches an accepted entry.
-6. **Compare with previous run** (if `.devlyn/PREFLIGHT-REPORT.md` existed):
+4. **Compare with previous run** (if `.devlyn/PREFLIGHT-REPORT.md` existed):
    - `RESOLVED`: finding from previous run no longer present
    - `PERSISTS`: finding still present
    - `NEW`: finding not in previous run
-7. **Generate `.devlyn/PREFLIGHT-REPORT.md`**:
+5. **Generate `.devlyn/PREFLIGHT-REPORT.md`**:
 ```markdown
 # Preflight Report
@@ -212,6 +196,7 @@ Previous run: [timestamp / none]
 | INCOMPLETE | [N] |
 | DIVERGENT | [N] |
 | BROKEN | [N] |
+| SCOPE_VIOLATION | [N] |
 | UNDOCUMENTED | [N] |
 | STALE_DOC | [N] |
 | **Total findings** | **[N]** |
@@ -266,7 +251,7 @@ These items are acknowledged future work per the roadmap. They will be audited w
 - [list any, or "None"]
 ```
-8. **Present the report** to the user with a summary.
+6. **Present the report** to the user with a summary.
 ## PHASE 4: TRIAGE & PROMOTE

package/config/skills/devlyn:preflight/references/auditors/code-auditor.md CHANGED Viewed

@@ -8,16 +8,7 @@ You are auditing a codebase against its planning commitments. Your job is to ver
 Read `.devlyn/commitment-registry.md` for the full list of commitments to verify. Skip any items in the "Not Started (Planned)" section — those are acknowledged future work, not gaps.
-**Step 0 — Build health check**: Before auditing individual commitments, verify the project actually builds. Detect the project type(s) and run their build/typecheck commands:
-- `package.json` with `next` → `npx tsc --noEmit && npx next build`
-- `package.json` with `vite` + `tsconfig.json` → `npx tsc --noEmit`
-- `Cargo.toml` → `cargo check --all-targets`
-- `go.mod` → `go build ./... && go vet ./...`
-- `foundry.toml` → `forge build`
-- `hardhat.config.*` → `npx hardhat compile`
-- Monorepo (`pnpm-workspace.yaml`/`turbo.json`) → workspace-wide build
-- `Dockerfile*` → `docker build` (if Docker available)
-- For other project types, look for a `build` script in `package.json` or equivalent
+**Step 0 — Build health check**: Before auditing individual commitments, verify the project actually builds. Run the build gate exactly as defined in `config/skills/devlyn:auto-resolve/references/build-gate.md` (detection matrix, commands, package manager rules, monorepo handling, Docker). That file is the SINGLE source of truth for build commands across devlyn-cli; preflight does not maintain a second matrix.
 Any build/typecheck failure is a BROKEN finding at CRITICAL severity — code that doesn't compile cannot fulfill any commitment. Include the full compiler error output with file:line references. This catches type errors, missing imports, cross-package drift, and Dockerfile build failures that text-based code reading alone cannot detect.
@@ -33,6 +24,11 @@ Any build/typecheck failure is a BROKEN finding at CRITICAL severity — code th
 | INCOMPLETE | Implementation started but doesn't fully satisfy | What's there + what's missing, both with file:line |
 | DIVERGENT | Implementation does something different than specified | Spec requirement vs actual behavior, with file:line |
 | BROKEN | Implementation exists but has a bug preventing it from working | The bug with file:line |
+| SCOPE_VIOLATION | Code ships behavior an anti-commitment (`Out of Scope`) explicitly excluded | file:line showing the prohibited behavior |
+**Anti-commitment audit** (new in v3.4): the registry's `## Anti-Commitments` section lists features the spec promised NOT to build. Check each one against the code:
+- If the excluded behavior is present, emit a finding with `rule_id: "scope.anti-commitment-violation"` and severity `HIGH` (or `CRITICAL` if it also violates a constraint). This catches scope-creep and workaround shipping that raw commitment checks would miss.
+- If the excluded behavior is absent, no finding — anti-commitments are satisfied by absence.
 **Beyond the commitment checklist**, also investigate:
 - Cross-feature integration gaps: features that should connect but don't

package/config/skills/devlyn:reap/SKILL.md ADDED Viewed

@@ -0,0 +1,104 @@
+---
+description: Safely count and kill orphaned child processes (PPID=1) left behind by Claude Code MCP plugins, Superset terminal tabs, and codex wrappers. Use this whenever the user says "too many processes", "can't open terminals", "pty/process limit", "hundreds of bun/codex/workerd piling up", "clean up orphans", "reap processes", or reports new terminals failing to spawn on macOS. Also use proactively after long Claude sessions to prevent hitting kern.maxprocperuid or kern.tty.ptmx_max limits. ONLY touches a conservative whitelist of known leaks — never guesses on unknown processes.
+allowed-tools: Read, Bash(ps:*), Bash(lsof:*), Bash(pgrep:*), Bash(awk:*), Bash(id:*), Bash(sysctl:*), Bash(bash:*)
+argument-hint: [scan | kill | kill --force | kill --include workerd | kill --only telegram-bun]
+---
+<role>
+You are a process-hygiene janitor for macOS. Your job is to find leaked orphan processes (PPID=1, user-owned) that accumulate from buggy tools — MCP plugins that don't reap children on stdin EOF, terminal apps that don't SIGTERM process groups on tab close, codex wrappers that leave `tail -F` behind — and let the user remove them safely.
+Your operating principle: **the user's trust costs more than one missed cleanup.** If a process doesn't match a verified whitelist entry, leave it alone and report it as UNKNOWN so the user can decide. Never guess.
+</role>
+<user_input>
+$ARGUMENTS
+</user_input>
+<process>
+## Phase 1: Parse intent
+Look at `$ARGUMENTS` and classify:
+| Input | Mode |
+|---|---|
+| empty, `scan`, `status`, `count`, `list`, or anything non-imperative | **SCAN only** (default) |
+| starts with `kill`, `reap`, `clean`, `prune`, `죽여`, `정리` | **KILL** mode |
+In KILL mode, also parse:
+- `--force` → SIGKILL instead of SIGTERM
+- `--include workerd` → extend the default whitelist with the workerd-dev category
+- `--only <category>` → restrict to a single category
+- `--dry-run` → list kills but don't send signals
+If the user's intent is ambiguous (e.g., they say "지워줘" but didn't specify force or include), **default to SCAN first**, show the result, and then ask whether to proceed with kill. Never escalate to `--force` without an explicit request.
+## Phase 2: SCAN
+Always run scan first — even in KILL mode — so the user sees what is about to happen.
+Run the bundled scanner. The skill is installed at `~/.claude/skills/devlyn:reap/`:
+```bash
+bash ~/.claude/skills/devlyn:reap/scripts/scan.sh
+```
+Report the output verbatim to the user. Then add your own 2-line summary:
+- total orphan count across whitelist categories
+- any UNKNOWN_ORPHANS that the user might want to investigate manually
+Also surface the macOS limits for context, only once per session:
+```bash
+sysctl kern.maxprocperuid kern.tty.ptmx_max 2>/dev/null
+```
+## Phase 3: KILL (only when requested)
+Run the reap script with the parsed flags:
+```bash
+bash ~/.claude/skills/devlyn:reap/scripts/reap.sh [flags]
+```
+Show the output verbatim. The script re-verifies `PPID==1 && user==current` for every PID right before signaling — a process that was legitimately adopted since the scan will be skipped, not killed.
+After kill, re-run scan to confirm the counts dropped. If any whitelisted PIDs are still present after SIGTERM and 2 seconds, mention that `--force` (SIGKILL) is available.
+## Phase 4: Recommend (only if signals of chronic leak)
+If `telegram-bun` count > 10 OR oldest whitelisted orphan > 24h, tell the user this is a recurring leak and suggest one of:
+1. **Patch the telegram plugin** — add `process.stdin.on('end', () => process.exit(0))` to `server.ts` so the child dies when Claude Code exits.
+2. **Schedule this skill** — run `/devlyn:reap kill` periodically (e.g., via the `/loop` skill or a launchd agent).
+3. **Update Superset** — newer versions may SIGTERM process groups on tab close.
+Do NOT apply these automatically. Recommend and let the user choose.
+</process>
+<safety>
+## Never-touch rules
+- **NEVER kill** a process whose command does not match a whitelist category in `scan.sh`. Unknown = informational only.
+- **NEVER kill** anything where `ps -o ppid=` returns something other than `1` at signal time.
+- **NEVER kill** processes owned by another user (the scripts check `id -un`).
+- **NEVER use** `killall`, `pkill -9`, or wildcard `kill $(pgrep ...)` in this skill. Always iterate PIDs individually with per-PID re-verification.
+- **NEVER suggest** `sudo` escalation — this is a user-scope cleanup tool.
+## Whitelist definitions
+These are the ONLY categories reap.sh will touch:
+| Category | Match | Why safe |
+|---|---|---|
+| `telegram-bun` | `bun server.ts` **AND** cwd contains `/plugins/cache/claude-plugins-official/telegram/` | Telegram MCP plugin leaks one per Claude session. Verified by cwd, not just cmdline. |
+| `superset-codex-bash` | `/bin/bash .*/.superset/bin/codex` with PPID=1 | `.superset/bin/codex` wrapper exits without killing its tail child; bash copies left behind. |
+| `superset-codex-tail` | `tail -F .*superset-codex-session-*.jsonl` with PPID=1 | Log tail from the same wrapper, always safe to stop. |
+| `workerd` (opt-in) | `@cloudflare/workerd-darwin-*/bin/workerd serve ` with PPID=1 | moonmaker-engine dev server that survives tab close. Opt-in because the user may have an active dev session. |
+If the user asks to add a new category, **edit scan.sh and reap.sh together** — both must know the same pattern so scan never promises a cleanup that reap won't deliver.
+</safety>

package/config/skills/devlyn:reap/scripts/reap.sh ADDED Viewed

@@ -0,0 +1,129 @@
+#!/usr/bin/env bash
+# devlyn:reap — kill orphan processes from safe whitelist categories.
+# Verifies PPID==1 and user-ownership AGAIN at kill time to avoid racing a
+# legitimately-reparented process. Unknown orphans are never killed.
+#
+# Usage:
+#   reap.sh                       # default categories, SIGTERM
+#   reap.sh --force               # SIGKILL instead of SIGTERM
+#   reap.sh --include workerd     # add workerd-dev to the default set
+#   reap.sh --only telegram-bun   # restrict to a single category
+#   reap.sh --dry-run             # print what WOULD be killed, kill nothing
+set -u
+LC_ALL=C
+export LC_ALL
+ME="$(id -un)"
+SIGNAL="TERM"
+DRY=0
+INCLUDE=""
+ONLY=""
+while [ $# -gt 0 ]; do
+  case "$1" in
+    --force)     SIGNAL="KILL" ;;
+    --dry-run)   DRY=1 ;;
+    --include)   shift; INCLUDE="${INCLUDE},$1" ;;
+    --only)      shift; ONLY="$1" ;;
+    -h|--help)
+      sed -n '2,14p' "$0"; exit 0 ;;
+    *)
+      printf 'unknown flag: %s\n' "$1" >&2; exit 2 ;;
+  esac
+  shift
+done
+DEFAULT_CATEGORIES="telegram-bun,superset-codex-bash,superset-codex-tail"
+if [ -n "$ONLY" ]; then
+  CATEGORIES="$ONLY"
+else
+  CATEGORIES="${DEFAULT_CATEGORIES}${INCLUDE}"
+fi
+SNAPSHOT="$(ps -eo pid=,ppid=,user=,etime=,command= 2>/dev/null | awk -v me="$ME" '$2==1 && $3==me')"
+collect_pids() {
+  local category="$1"
+  case "$category" in
+    telegram-bun)
+      # cwd-verified — same logic as scan.sh
+      printf '%s\n' "$SNAPSHOT" \
+        | grep -E '/bun[^ ]* server\.ts( |$)' \
+        | awk '{print $1}' \
+        | while read -r pid; do
+            cwd="$(lsof -a -d cwd -p "$pid" 2>/dev/null | awk 'NR==2 {for(i=9;i<=NF;i++) printf "%s ", $i; print ""}')"
+            case "$cwd" in
+              *"/plugins/cache/claude-plugins-official/telegram/"*) printf '%s\n' "$pid" ;;
+            esac
+          done
+      ;;
+    superset-codex-bash)
+      printf '%s\n' "$SNAPSHOT" | grep -E '/bin/bash .*/\.superset/bin/codex( |$)' | awk '{print $1}' ;;
+    superset-codex-tail)
+      printf '%s\n' "$SNAPSHOT" | grep -E 'tail .*superset-codex-session-.*\.jsonl' | awk '{print $1}' ;;
+    workerd)
+      printf '%s\n' "$SNAPSHOT" | grep -E '@cloudflare/workerd-darwin-[^/]+/bin/workerd serve ' | awk '{print $1}' ;;
+    *)
+      printf 'unknown category: %s\n' "$category" >&2
+      return 1 ;;
+  esac
+}
+TOTAL_KILLED=0
+TOTAL_SKIPPED=0
+# Split the comma-separated category list without letting IFS leak into the
+# inner loop that iterates newline-separated PIDs.
+CATS_ARR=()
+OLD_IFS="$IFS"
+IFS=,
+for c in $CATEGORIES; do
+  [ -n "$c" ] && CATS_ARR+=("$c")
+done
+IFS="$OLD_IFS"
+for cat in "${CATS_ARR[@]}"; do
+  pids="$(collect_pids "$cat")" || continue
+  if [ -z "$pids" ]; then
+    printf '[%s] nothing to kill\n' "$cat"
+    continue
+  fi
+  while IFS= read -r pid; do
+    [ -z "$pid" ] && continue
+    # Re-verify right before killing. Any of these mean "don't touch":
+    #   - process already gone
+    #   - PPID is no longer 1 (got adopted by a real parent — not our target)
+    #   - owner changed (extremely unlikely but cheap to check)
+    live_info="$(ps -o ppid=,user= -p "$pid" 2>/dev/null)"
+    if [ -z "$live_info" ]; then
+      printf '[%s] %s  skipped (already exited)\n' "$cat" "$pid"
+      TOTAL_SKIPPED=$((TOTAL_SKIPPED+1))
+      continue
+    fi
+    live_ppid="$(printf '%s' "$live_info" | awk '{print $1}')"
+    live_user="$(printf '%s' "$live_info" | awk '{print $2}')"
+    if [ "$live_ppid" != "1" ] || [ "$live_user" != "$ME" ]; then
+      printf '[%s] %s  skipped (ppid=%s user=%s — no longer orphan)\n' "$cat" "$pid" "$live_ppid" "$live_user"
+      TOTAL_SKIPPED=$((TOTAL_SKIPPED+1))
+      continue
+    fi
+    if [ "$DRY" -eq 1 ]; then
+      printf '[%s] %s  would SIG%s\n' "$cat" "$pid" "$SIGNAL"
+    else
+      if kill -s "$SIGNAL" "$pid" 2>/dev/null; then
+        printf '[%s] %s  SIG%s sent\n' "$cat" "$pid" "$SIGNAL"
+        TOTAL_KILLED=$((TOTAL_KILLED+1))
+      else
+        printf '[%s] %s  kill failed\n' "$cat" "$pid"
+        TOTAL_SKIPPED=$((TOTAL_SKIPPED+1))
+      fi
+    fi
+  done <<< "$pids"
+done
+if [ "$DRY" -eq 1 ]; then
+  printf '\ndry-run complete.\n'
+else
+  printf '\ndone. killed=%s skipped=%s\n' "$TOTAL_KILLED" "$TOTAL_SKIPPED"
+fi