npm - devlyn-cli - Versions diffs - 1.15.0 → 2.1.0 - Mend

devlyn-cli 1.15.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (158) hide show

package/config/skills/devlyn:resolve/SKILL.md CHANGED Viewed

@@ -1,187 +1,175 @@
-<role>
-You are a Senior Debugging Engineer. Your specialty is systematic root cause analysis — tracing from symptoms to fundamental causes using evidence-based reasoning. You fix bugs at their source, never with workarounds.
-</role>
+---
+name: devlyn:resolve
+description: Hands-free pipeline for any coding task — bug fix, feature, refactor, debug, modify, PR review. Free-form goal or formal spec input. Plan → Implement → Build-gate → Cleanup → Verify (fresh subagent, findings-only). Mechanical-first verification; pair-mode optional in Verify. Use when the user says "resolve this", "fix this", "implement this", "refactor this", "debug this", "review this PR", or wants hands-off completion.
+---
-Perform deep root cause analysis for the following issue. Use extended reasoning to evaluate evidence systematically, then enter plan mode to design a comprehensive fix.
+Orchestrator for the 2-skill harness pipeline. One subagent per phase; file-based handoff via `.devlyn/pipeline.state.json`. VERIFY spawns a fresh-context subagent so independence is structural — not advisory.
-<issue>
+<pipeline_config>
 $ARGUMENTS
-</issue>
-<default_to_plan_mode>
-After completing root cause analysis, enter plan mode before implementing fixes. This ensures the user can review your understanding of the problem and approve your approach before changes are made.
-Only skip plan mode if ALL conditions are true:
-- Single-line or trivial change (typo, obvious syntax error)
-- Exactly one correct solution with no alternatives
-- Single file affected with no side effects
-When in doubt, enter plan mode.
-</default_to_plan_mode>
-<escalation>
-Escalate to `/devlyn:team-resolve` if ANY of the following are true:
-- Investigation reveals the issue spans 3+ modules
-- Root cause is unclear after applying 5 Whys to all plausible hypotheses
-- Competing hypotheses can't be ruled out without parallel investigation
-- The fix requires architectural changes affecting shared interfaces
-When escalating, output your partial findings first so the team lead has context to start from.
-</escalation>
-<investigate_before_answering>
-Never speculate about code you have not opened. If the user references a specific file, you MUST read the file before answering. Make sure to investigate and read relevant files BEFORE answering questions about the codebase. Never make any claims about code before investigating unless you are certain of the correct answer — give grounded and hallucination-free answers.
-1. Read the issue/error message and identify the symptom
-2. Run `git log --oneline -20` and `git blame` on the suspected file — establish when the regression was introduced and by what change
-3. Read relevant files and error logs in parallel (use parallel tool calls)
-4. Trace execution path from symptom to source
-5. Map the code paths involved:
-```
-Entry: `file.ts:123` functionName()
-  → calls `other.ts:45` helperFunction()
-    → calls `service.ts:89` apiCall()
-      → potential issue here
-```
-6. Find related test files that cover this area
-7. Verify each assumption with actual code inspection
-Evidence-based reasoning only. Every claim must reference specific file:line. Never use placeholders or guess missing details — use tools to discover them.
-</investigate_before_answering>
-<analysis_approach>
-Choose the right technique based on the issue:
-**Use 5 Whys** when the root cause is not obvious — chain from symptom to fundamental cause:
-- Why 1: Why did [symptom] happen? → Because [cause 1]. Evidence: [file:line]
-- Why 2: Why did [cause 1] happen? → Because [cause 2]. Evidence: [file:line]
-- Continue until you reach something ACTIONABLE (wrong logic, missing validation, bad assumption)
-- Stop when further "whys" leave the codebase (external dep, infrastructure)
-**Use competing hypotheses** when multiple causes are plausible:
-1. **[Hypothesis A]** — Evidence for: [...] Evidence against: [...]
-2. **[Hypothesis B]** — Evidence for: [...] Evidence against: [...]
-- Rule out hypotheses by reading the code — do not guess
-- If hypotheses can't be ruled out solo, escalate to `/devlyn:team-resolve`
-</analysis_approach>
-<test_driven_validation>
-Before implementing the fix:
-1. **Write a failing test** that reproduces the bug
-2. **Implement fix** for most likely hypothesis
-3. **Run test** — if fails, revert and try next hypothesis
-4. **Iterate** until test passes
-5. **Run full test suite** to check for regressions
-If fix doesn't work, revert completely before trying next approach. Never layer fixes on top of failed attempts.
-</test_driven_validation>
-<no_fallbacks_or_workarounds>
-Write a high-quality, general-purpose solution that addresses the actual root cause.
-Do not create helper scripts or workarounds to accomplish the task more efficiently.
-Do not hard-code values or create solutions that only work for specific failing cases.
-Instead, implement the actual logic that solves the problem generally.
-Workaround indicators (if you catch yourself doing any of these, STOP):
-- Adding `|| defaultValue` to mask null/undefined
-- Adding `try/catch` that swallows errors silently
-- Using optional chaining (?.) to bypass null when null IS the bug
-- Hard-coding a value for the specific failing case
-- Adding a "just in case" check that shouldn't be needed
-- Suppressing warnings/errors instead of fixing them
-- Adding retry logic instead of fixing why it fails
-Instead:
-- Fix the code path that produces incorrect state
-- Ensure solution works correctly for all valid inputs, not just the failing case
-- Follow codebase's existing patterns and idioms
-- Escalate blockers rather than shipping fragile patches
-If the task is unreasonable or infeasible, or if any of the tests are incorrect, inform the user rather than working around them. The solution should be robust, maintainable, and extendable.
-</no_fallbacks_or_workarounds>
-<code_quality_standards>
-Every fix must be **production-grade**. This is not a prototype — treat every fix as code that ships to real users at scale.
-**Non-negotiable standards**:
-- **Root cause fixes only** — never workarounds, never "good enough for now"
-- **Graceful error handling** — errors are caught, surfaced to the user with actionable context, and logged. No silent swallowing. No raw stack traces in UI. Every failure path has a recovery or clear error state.
-- **Robust edge case coverage** — handle nulls, empty states, concurrent access, network failures, partial data, and boundary conditions. If it can happen in production, handle it.
-- **Optimized for performance** — no unnecessary re-renders, no N+1 queries, no unbounded loops, no blocking I/O on hot paths.
-- **Scalable patterns** — solutions must work at 10x the current load. Avoid patterns that degrade with data size (O(n²) where O(n) is possible, in-memory aggregation of unbounded datasets, missing pagination).
-- **Best practice adherence** — follow the language/framework idioms of the codebase. Use established patterns over novel approaches. Leverage the type system.
-- **Clean interfaces** — clear contracts between modules. No leaky abstractions. Inputs validated at boundaries. Return types are explicit, not `any`.
-- **Defensive but not paranoid** — validate external inputs rigorously, trust internal interfaces. Don't add guards for impossible states — instead, make impossible states unrepresentable through types.
-</code_quality_standards>
-<commit_to_approach>
-When deciding how to approach a problem, choose an approach and commit to it. Avoid revisiting decisions unless you encounter new information that directly contradicts your reasoning. If you're weighing two approaches, pick the one with stronger evidence and see it through. Do not oscillate between strategies — diagnose, decide, execute.
-</commit_to_approach>
-<use_parallel_tool_calls>
-Read multiple potentially relevant files in parallel. If the issue might involve 3 modules, read all 3 simultaneously.
-</use_parallel_tool_calls>
-<output_format>
-Present findings before entering plan mode:
-<root_cause_analysis>
-**Symptom**: [What the user observed]
-**Regression introduced**: [git commit or "unknown" if pre-existing]
-**Code Path**: [Entry point → ... → issue location with file:line]
-**Root Cause**: [Fundamental issue with specific file:line]
-**Hypotheses Tested**: [Which hypotheses were validated/invalidated]
-**Why it matters**: [Impact if unfixed]
-**Complexity**: [Simple fix / Multiple files / Architectural change]
-</root_cause_analysis>
-After fix is implemented:
-<resolution>
-**Fix Applied**: [file:line — what changed and why]
-**Test Added**: [test file — what it validates]
-**Verification**:
-- [ ] Failing test now passes
-- [ ] No regressions in test suite
-- [ ] Manual verification (if applicable)
-</resolution>
-</output_format>
-<examples>
-### Example 1: Simple null reference bug
-**Issue**: "App crashes when clicking save on empty form"
-Analysis:
-- Symptom: `TypeError: Cannot read property 'trim' of undefined` at `form.ts:42`
-- Why 1: `name.trim()` called but `name` is undefined → form field wasn't validated
-- Why 2: Validation function at `validate.ts:15` skips empty strings (returns early)
-- Root cause: Early return in validation treats empty string as "no input" instead of invalid input
-- Fix: Change validation to treat empty string as validation error, add failing test for empty form submission
-### Example 2: Intermittent API failure
-**Issue**: "GET /api/users sometimes returns 500"
-Analysis:
-- Symptom: 500 error with "connection pool exhausted" in logs
-- Why 1: Pool runs out → connections aren't being released
-- Why 2: `userService.ts:67` opens connection but error path at line 78 doesn't close it
-- Why 3: Try/catch at line 72 catches the error but doesn't run cleanup in finally block
-- Root cause: Missing `finally` block for connection cleanup in error path
-- Fix: Move `connection.release()` to `finally` block, add test simulating query failure
-</examples>
-<next_steps>
-1. If Complexity is "Multiple files" or "Architectural change" → enter plan mode immediately
-2. In plan mode, present fix options if multiple valid solutions exist
-3. Write failing test before implementing
-4. Only mark complete after full test suite passes
-5. If stuck after 2 hypothesis attempts → escalate to `/devlyn:team-resolve`
-</next_steps>
+</pipeline_config>
+<orchestrator_context>
+Long-horizon agentic work; context auto-compacts. State lives in `.devlyn/pipeline.state.json` — the single authoritative verdict source. Schemas in `references/state-schema.md`. Best at `xhigh` effort.
+</orchestrator_context>
+<autonomy_contract>
+Hands-free. Measured by how far we get without human intervention.
+1. Do not prompt the user mid-pipeline. When tempted to ask, pick the safe default, proceed, and log it in the final report.
+2. Codex availability: on `--engine auto`/`codex`, follow `_shared/engine-preflight.md`. On failure, silently fall back to Claude and log `engine downgraded: codex-unavailable` in the final report.
+3. Phases run in declared order. No extra phases.
+4. Orchestrator does not write code. It parses input, spawns phases, reads state, branches on verdicts, emits the report.
+5. Continue by default. Halt only on (a) unrecoverable subagent failure, (b) IMPLEMENT producing zero code changes, (c) BUILD_GATE or VERIFY fix-loop exhausting `max_rounds`.
+</autonomy_contract>
+<harness_principles>
+Every phase reads `_shared/runtime-principles.md` (Subtractive-first / Goal-locked / No-workaround / Evidence). Codex routings receive the contract excerpt inlined in their prompt body.
+</harness_principles>
+<engine_routing>
+Each phase routes to an engine and prepends the per-engine adapter header from `_shared/adapters/<model>.md` to the canonical phase body. Adapter is the per-model delta (Anthropic Opus 4.7 guide for Claude, OpenAI GPT-5.5 guide for Codex). Canonical body is engine-agnostic.
+- Claude phases: spawn `Agent` (`mode: "bypassPermissions"`); prompt = adapter-header + canonical-body + task-context.
+- Codex phases: shell out via `bash _shared/codex-monitored.sh` with the same compounded prompt. The wrapper closes stdin and emits a heartbeat. No MCP.
+- Default engine: Claude. `--engine codex` routes IMPLEMENT to Codex; orchestration stays Claude. Pair-mode (only in VERIFY/JUDGE) selects a different engine for the fresh subagent than IMPLEMENT used.
+- Multi-LLM evolution: when a new model adapter ships in `_shared/adapters/`, that engine becomes selectable via `--engine <model>` without further skill changes (NORTH-STAR.md "Multi-LLM evolution direction").
+</engine_routing>
+<modes>
+Three input shapes:
+1. **Free-form**: `/devlyn:resolve "fix the login bug"`. PHASE 0 runs the complexity classifier and either proceeds with an internal mini-spec (trivial), drafts focused questions for in-prompt resolution (medium), or escalates to `/devlyn:ideate` (large/ambiguous). No mid-pipeline prompts in any branch.
+2. **Spec**: `/devlyn:resolve --spec docs/roadmap/phase-N/X.md`. Spec is read-only. Verification commands pre-staged from spec's `## Verification` block.
+3. **Verify-only**: `/devlyn:resolve --verify-only <diff-or-PR-ref> --spec <path>`. Skips PHASE 1-4. Runs PHASE 5 (VERIFY) on the supplied diff against the spec.
+</modes>
+<post_implement_invariant>
+Once `state.implement_passed_sha` is non-null (PHASE 2 returned and produced a diff), the post-IMPLEMENT phases (CLEANUP, VERIFY) operate under structural constraints:
+- CLEANUP may only mutate files in the cleanup allowlist (tooling artifacts, dead code added by this diff, doc references this diff invalidated). Other paths trigger revert.
+- VERIFY runs in a fresh subagent context with no code-mutation tools. Findings only — never edits files. The fresh-context spawn is the structural guarantee; the prompt body reinforces it but the spawn is what makes independence real.
+</post_implement_invariant>
+## PHASE 0: PARSE + CLASSIFY + ROUTE
+1. Parse flags from `<pipeline_config>`:
+   - `--max-rounds N` (default 4) — fix-loop budget shared across BUILD_GATE and VERIFY.
+   - `--engine MODE` (default `claude`) — picks the adapter for IMPLEMENT and CLEANUP.
+   - `--spec <path>` — switches to spec mode.
+   - `--verify-only <ref>` — switches to verify-only mode. Requires `--spec`.
+   - `--pair-verify` — force pair-mode JUDGE in PHASE 5 even when not auto-triggered.
+   - `--bypass <phase>[,...]` — skip specific phases. Valid: `build-gate`, `cleanup`. PLAN, IMPLEMENT, VERIFY are non-bypassable.
+   - `--perf` — opt in to per-phase timing.
+2. Engine pre-flight: follow `_shared/engine-preflight.md`. The downgrade banner surfaces in the final report.
+3. Initialize `.devlyn/pipeline.state.json` per `references/state-schema.md`. Set `state.run_id`, `started_at`, `engine`, `base_ref.{branch, sha}`, `rounds.{max_rounds, global: 0}`, `bypasses`, empty `phases`, empty `criteria`.
+4. **Mode-specific init**:
+   - **Free-form**: read `references/free-form-mode.md`. Run the complexity classifier deterministically (rules over keyword density / file count / spec-shape signals). Set `state.complexity ∈ {trivial, medium, large}`. Trivial: write internal mini-spec to `.devlyn/criteria.generated.md` and proceed. Medium: synthesize a minimal spec from the goal + add 1-2 context anchors from the codebase, write to `.devlyn/criteria.generated.md`, proceed. Large: log `recommend: /devlyn:ideate first` in the final report and either halt (default) or proceed with assumed defaults if `--continue-on-large` flag set.
+   - **Spec**: validate spec exists + `## Verification` block parses (run `python3 .claude/skills/_shared/spec-verify-check.py --check <spec-path>` to validate carrier shape). Compute `state.source.spec_sha256`. Stage `.devlyn/spec-verify.json` from the spec's verification block.
+   - **Verify-only**: skip to PHASE 5 with `state.source.spec_path` set, the supplied diff captured at `.devlyn/external-diff.patch`.
+5. Announce one line: `resolve starting — run <run_id> — engine <engine> — mode <mode> — complexity <complexity-or-na>`.
+## PHASE 1: PLAN
+Skip in verify-only mode. The heaviest phase by design — spec/criteria define non-negotiable invariants; plan formalizes how the implementation hits them.
+Engine: Claude (PLAN-pair is **unmeasured at HEAD** — iter-0033d is the first L1-vs-L2 measurement; iter-0020 falsified Codex-BUILD/IMPLEMENT, NOT PLAN-pair). Prompt body: `references/phases/plan.md`.
+Subagent output (writes `.devlyn/plan.md`): file list to touch, risk list (out-of-scope expansions, ambiguous spec sections), acceptance restatement (what `## Verification` actually requires verbatim).
+State write: `phases.plan.{started_at, verdict, completed_at, duration_ms}`.
+After return:
+1. If `.devlyn/plan.md` lists zero files → halt with verdict `BLOCKED:plan-empty`.
+2. If risk list flags an out-of-scope expansion the user did not authorize → re-spawn once with the reminder; second fail → halt.
+## PHASE 2: IMPLEMENT
+Skip in verify-only mode. Constrained design judgment within PLAN's invariants. Writes code, tests, and inline doc-comments. No standalone DOCS phase — what the spec licenses is updated here, what it does not is out of scope.
+Engine: per `--engine`. Prompt body: `references/phases/implement.md`.
+State write: `phases.implement.{started_at, verdict, completed_at, duration_ms}`.
+After return:
+1. `git diff --stat` — empty diff → halt with `BLOCKED:implement-empty`.
+2. Set `state.implement_passed_sha = git rev-parse HEAD` (activates `<post_implement_invariant>`).
+3. Checkpoint: `git add -A && git commit -m "chore(pipeline): implement"`.
+## PHASE 3: BUILD_GATE
+Skip in verify-only mode OR when `build-gate` in `state.bypasses`. Deterministic — same commands CI / Docker / production run.
+Spawn Claude `Agent` (`mode: "bypassPermissions"`) with prompt body `references/phases/build-gate.md`. The agent:
+1. Detects language/framework via project files (`package.json`, `pyproject.toml`, etc.).
+2. Runs language-specific gates (tsc / lint / test).
+3. Always runs `python3 .claude/skills/_shared/spec-verify-check.py` (verification_commands literal-match).
+4. If `spec.expected.json.browser_flows` declared OR diff touches web-surface files: invokes the browser runner (Chrome MCP → Playwright → curl tier as available).
+5. Emits `.devlyn/build_gate.findings.jsonl` + `.devlyn/build_gate.log.md`.
+State write: `phases.build_gate.{started_at, verdict, completed_at, duration_ms, artifacts}`.
+Branch:
+- `PASS` → PHASE 4.
+- `FAIL` → fix loop. Spawn IMPLEMENT-engine agent with the build_gate findings as input. Increment `state.rounds.global`. On second FAIL with `state.rounds.global >= state.rounds.max_rounds` → halt with verdict `BLOCKED:build-gate-exhausted`.
+## PHASE 4: CLEANUP
+Skip if `cleanup` in `state.bypasses`. Task-scoped pass.
+Engine: per `--engine`. Prompt body: `references/phases/cleanup.md`. Allowlist enforced post-spawn:
+- Tooling artifacts the spec did not list as deliverables (`test-results/`, `playwright-report/`, `.last-run.json`, coverage HTML).
+- Dead code added by this diff (not pre-existing dead code).
+- Doc references whose target this diff renamed or removed.
+Before spawn: capture `state.phases.cleanup.pre_sha = git rev-parse HEAD`.
+State write: `phases.cleanup.{started_at, verdict, completed_at, duration_ms}`.
+After return:
+1. Run `git diff --name-only <pre_sha>` — any path outside the cleanup allowlist → revert to `pre_sha` and emit `invariant.cleanup-out-of-scope` finding into `.devlyn/cleanup.findings.jsonl`.
+2. If allowlist honored and diff non-empty: `git add -A && git commit -m "chore(pipeline): cleanup"`.
+## PHASE 5: VERIFY (fresh subagent, findings-only)
+Independent quality layer. **Spawned with empty conversation context** — no carry-over from PHASE 1-4. Inputs limited to `spec.md` (or `.devlyn/criteria.generated.md`), `spec.expected.json`, the cumulative diff, and the spec hash. The fresh-context spawn is the structural guarantee of independence; the prompt body reinforces it.
+Two sub-phases:
+1. **MECHANICAL** (deterministic): re-run `python3 .claude/skills/_shared/spec-verify-check.py` against the post-CLEANUP code (independent of BUILD_GATE's earlier run). Re-scan `spec.expected.json.forbidden_patterns` against the diff. Re-check `required_files` and `forbidden_files`. Emit `.devlyn/verify-mechanical.findings.jsonl`.
+2. **JUDGE** (fresh-context Agent): grade the diff against the spec on rubric axes (spec compliance, scope, quality, consistency). Default engine = same as IMPLEMENT (solo). Pair-mode (cross-model JUDGE) fires when:
+   - `--pair-verify` flag set, OR
+   - MECHANICAL emits findings flagged `severity: warning` (not disqualifier — those route to fix loop directly), OR
+   - `state.verify.coverage_failed == true` (judge could not exercise a required spec axis from available evidence).
+Pair-mode JUDGE: spawn a second Agent with the OTHER engine's adapter; both judgments merge with the rule "any HIGH/CRITICAL finding either model surfaces is the verdict-binding finding." Cross-model disagreement on lower-severity findings is logged but does not change the verdict.
+Findings written to `.devlyn/verify.findings.jsonl`. **VERIFY agents have no code-mutation tools.** State write: `phases.verify.{started_at, verdict, completed_at, duration_ms, sub_verdicts: {mechanical, judge, pair_judge?}, artifacts}`.
+Branch:
+- `PASS` → PHASE 6.
+- `PASS_WITH_ISSUES` (LOW severity only) → PHASE 6 with banner.
+- `NEEDS_WORK` / `BLOCKED` → fix loop with `triggered_by: "verify"`. Spawn IMPLEMENT-engine agent with the verify findings; increment `state.rounds.global`. Second `NEEDS_WORK` → halt with verdict `BLOCKED:verify-exhausted`.
+## PHASE 6: FINAL REPORT + ARCHIVE
+State write: `phases.final_report.started_at` at the top of this phase.
+1. **Terminal verdict** — derive from `state.phases.{plan, implement, build_gate, cleanup, verify}.verdict` per the precedence rules in `references/state-schema.md#terminal-verdict`. Verify-only mode short-circuits to `state.phases.verify.verdict`.
+2. **Render report** — sections: header (run_id, engine, mode, verdict, wall-time), per-phase summary, findings table (verify findings only — post-IMPLEMENT phases are findings-only), follow-up notes (any `--continue-on-large` assumptions, any silent fallbacks).
+3. State write: `phases.final_report.{verdict, completed_at, duration_ms}` BEFORE archive runs (archive prune logic skips runs whose `final_report.verdict` is null).
+4. **Archive** — invoke the deterministic script: `python3 .claude/skills/_shared/archive_run.py`. The script reads `run_id` from `.devlyn/pipeline.state.json`, moves per-run artifacts (state.json + `*.findings.jsonl` + `*.log.md` + `fix-batch.round-*.json` + `criteria.generated.md` + `spec-verify*.json` + `spec-verify-findings.jsonl`) into `.devlyn/runs/<run_id>/`, then best-effort prunes to last 10 completed runs. Archive must run; running this step as deterministic-script-not-prose ensures the move actually happens (iter-0033a Smoke 3 caught a case where the agent claimed archive ran without moving the files).
+5. Kill any dev server PHASE 3 left running.
+## State management
+`.devlyn/pipeline.state.json` is the single authoritative verdict source. Branch on `state.phases.<name>.verdict` directly; never parse `.devlyn/*.findings.jsonl` for routing decisions. Schema: `references/state-schema.md`.

package/config/skills/devlyn:resolve/references/free-form-mode.md ADDED Viewed

@@ -0,0 +1,68 @@
+# Free-form mode — complexity classifier
+When `/devlyn:resolve` is invoked with a free-form goal (no `--spec`), PHASE 0 runs this classifier to set `state.complexity ∈ {trivial, medium, large}` and either proceeds with an internal mini-spec, drafts focused questions for in-prompt resolution, or recommends `/devlyn:ideate` first.
+The classifier is rules-based / deterministic — not an LLM judgment call. Decision rules below.
+## Classification rules
+Compute these signals from the goal text + project state:
+1. **goal_length** — word count of the user's goal.
+2. **file_scope_signals** — count of file paths or symbol names mentioned in the goal (`bin/cli.js`, `Login.tsx`, `parseArgs`, etc.).
+3. **verb_class** — primary verb of the goal: `fix | add | refactor | debug | review | rewrite | migrate | ...`.
+4. **codebase_size** — `git ls-files | wc -l`. Coarse buckets: `<50` / `<500` / `≥500`.
+5. **has_failing_test** — does the goal mention a specific failing test or include a stack trace?
+### Trivial branch
+Conditions (all must hold):
+- `goal_length ≤ 30` words.
+- `file_scope_signals ≥ 1` AND `≤ 3`.
+- `verb_class ∈ {fix, add}`.
+- `has_failing_test == true` OR the goal names a single specific symbol/file.
+Action: synthesize a minimal internal spec from the goal:
+- Write `.devlyn/criteria.generated.md` with sections `## Requirements` (the goal as a single bullet, optionally split into 2-3 if obviously separable), `## Out of Scope` ("anything not in the listed files"), `## Verification` (one runnable command if discoverable from the goal — e.g. the failing test, or a smoke command).
+- Set `state.complexity = "trivial"`. Proceed to PHASE 1.
+### Medium branch
+Conditions (any one):
+- `goal_length` between 30 and 80 words.
+- `file_scope_signals` between 4 and 10.
+- `verb_class ∈ {refactor, debug, review}` AND scope is a single subsystem.
+- `has_failing_test == false` but the goal implies a runnable acceptance check.
+Action: synthesize a richer internal spec:
+- Read the named files (or grep for the named symbols) to extract 1-2 context anchors (existing patterns, related tests).
+- Write `.devlyn/criteria.generated.md` with `## Requirements` (split into 3-5 testable bullets), `## Constraints` (anything implied by the existing patterns), `## Out of Scope` (adjacent code that "looks fixable"), `## Verification` (commands or checks discoverable from existing tests / patterns).
+- Set `state.complexity = "medium"`. Proceed to PHASE 1.
+### Large branch
+Conditions (any one):
+- `goal_length > 80` words.
+- `file_scope_signals > 10` OR zero signals (vague enough that the classifier cannot pick scope).
+- `verb_class ∈ {rewrite, migrate}` and scope is multi-subsystem.
+- The goal mentions a new feature whose surface area requires design decisions the harness cannot make from a one-shot prompt.
+Action: log `recommend: /devlyn:ideate first` in `.devlyn/criteria.generated.md` plus the final report. Two policies:
+- Default: halt with terminal verdict `BLOCKED:large-needs-ideation`.
+- `--continue-on-large` flag: synthesize a best-effort spec from the goal with explicit "assumptions made" block; proceed to PHASE 1; the final report flags every assumption for user review.
+## Anti-pattern: drift to LLM judgment
+The classifier MUST stay deterministic. If you're tempted to add "and the model assesses whether it's complex" — that is the failure mode this rule exists to prevent. LLM-judgment classifiers swing on prompt-prelude noise; rules over signals do not.
+When the rules are silent (rare — pathological goal text), default to `medium` and proceed.
+## Mini-spec quality bar
+The internal mini-spec written for trivial / medium / `--continue-on-large` paths must satisfy:
+- `## Requirements` non-empty, each bullet testable (CLI command, test command, observable file change).
+- `## Verification` non-empty if the goal implies any runnable acceptance check. Empty Verification is allowed only when all Requirements are pure-design (e.g. "follow existing pattern X").
+- Free-form mode mini-specs are written to `.devlyn/criteria.generated.md` (not to a roadmap path) — this is run-scoped artifact, not a documented spec.
+PLAN reads the mini-spec the same way it reads a real spec. The downstream pipeline cannot tell the difference.

package/config/skills/devlyn:resolve/references/phases/build-gate.md ADDED Viewed

@@ -0,0 +1,45 @@
+# PHASE 3 — BUILD_GATE (canonical body)
+Per-engine adapter header is prepended at runtime. BUILD_GATE is mechanical / deterministic — same commands CI / Docker / production run.
+<role>
+Run language-specific gates and the spec literal-match verification. Emit findings; the orchestrator's fix loop consumes them.
+</role>
+<detection>
+Detect the project shape from files in `state.base_ref.sha`:
+- `package.json` → Node. Use the declared package manager; default `npm`. If `tsconfig.json` exists → run `tsc --noEmit`.
+- `pyproject.toml` / `requirements.txt` → Python. If `pyproject.toml` declares a tool config (`ruff`, `mypy`, `pytest`), run the declared tool.
+- `go.mod` → Go. Run `go build ./... && go vet ./... && go test ./...`.
+- `Cargo.toml` → Rust. Run `cargo build && cargo clippy && cargo test`.
+- Mixed / monorepo: detect per-workspace; run only against changed workspaces (use `git diff --name-only <state.base_ref.sha>`).
+</detection>
+<gates>
+Run in this order; each emits findings into `.devlyn/build_gate.findings.jsonl`:
+1. **Type check** (TypeScript / mypy / etc.). Each error → one finding, severity `HIGH`, rule `correctness.type-check`.
+2. **Lint** (eslint / ruff / clippy / etc.). Each error → finding, severity `MEDIUM`, rule `quality.lint`. Warnings stay LOW unless the spec elevates them.
+3. **Test suite** (npm test / pytest / go test / cargo test). Each failing test → finding, severity `HIGH`, rule `correctness.test-failure`. Include the failing test's file:line and the assertion.
+4. **Spec literal verification**: `python3 .claude/skills/_shared/spec-verify-check.py`. The script reads `.devlyn/spec-verify.json` (pre-staged from spec or self-staged from `state.source.spec_path`). Each command mismatch → finding `correctness.spec-literal-mismatch`, severity `CRITICAL`. Missing/malformed carrier on a generated source → finding `correctness.spec-verify-malformed`, severity `CRITICAL`.
+5. **Browser** (only when `spec.expected.json.browser_flows` declared OR diff touches `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `page.*`, `layout.*`, `route.*`, `*.css`, `*.html`): start dev server, run declared flows via Chrome MCP if available, falling back to Playwright, falling back to curl. Each failed flow → finding, severity `HIGH`, rule `correctness.browser-flow-failed`.
+Append all findings; do not stop on the first failure.
+</gates>
+<output>
+- `.devlyn/build_gate.findings.jsonl` — JSONL stream, one finding per line. Schema: `{id, rule_id, severity, file, line, message, fix_hint, criterion_ref}`.
+- `.devlyn/build_gate.log.md` — human-readable summary of which gates ran and their raw output.
+- `state.phases.build_gate.{verdict, completed_at, duration_ms, artifacts}`. Verdict: `PASS` if zero CRITICAL/HIGH findings; `FAIL` otherwise.
+</output>
+<quality_bar>
+- Same commands every time. Configuration drift between this gate and CI is a defect; raise as a finding rather than soften this gate.
+- Forbidden-pattern check (regex against `git diff`) for `spec.expected.json.forbidden_patterns` runs as part of step 4. Disqualifier-severity matches → CRITICAL findings.
+- Reporter artifacts the gate generates (Playwright traces, coverage HTML) belong in gitignored paths. If they leak into `git diff --stat`, flag as `scope.tooling-artifact-leak` MEDIUM and let the fix loop / cleanup handle removal.
+</quality_bar>
+<runtime_principles>
+Read `_shared/runtime-principles.md`. The gate is mechanical — its discipline is "do not skip a check, do not paraphrase a verification command, do not narrow severity to mute noise." Findings drive the fix loop; muting findings without a justified spec exception is a workaround.
+</runtime_principles>

package/config/skills/devlyn:resolve/references/phases/cleanup.md ADDED Viewed

@@ -0,0 +1,39 @@
+# PHASE 4 — CLEANUP (canonical body)
+Per-engine adapter header is prepended at runtime. Task-scoped pass — only what this diff introduced or invalidated.
+<role>
+Remove tooling artifacts, dead code added by this diff, and doc references invalidated by this diff. The cleanup is bounded by an allowlist enforced post-spawn.
+</role>
+<input>
+- Cumulative diff since `state.base_ref.sha`.
+- Spec at `state.source.spec_path` or `state.source.criteria_path`.
+- `state.phases.cleanup.pre_sha` (the orchestrator captured this before spawn — your post-cleanup diff against this SHA must stay within the allowlist).
+</input>
+<allowlist>
+You may modify or delete:
+1. **Tooling artifacts** the spec did not list as deliverables: `test-results/`, `playwright-report/`, `.last-run.json`, coverage HTML output, build artifacts, runtime caches (`__pycache__/`, `*.pyc`, `.cache/`).
+2. **Dead code added by this diff** — symbols (functions, classes, types, exports) introduced by this diff that no other code added by this diff references AND that are not part of the spec's required surface. Pre-existing dead code is out of scope.
+3. **Doc references this diff invalidated** — links / file paths / symbol names in markdown files that this diff renamed or removed. Update only the references; do not rewrite surrounding prose.
+4. **Inline comments** that explain code this diff deleted but the comment still mentions.
+Files outside this allowlist must not change. Pre-existing tooling leaks (already in main before this run) belong to a future cleanup, not this one.
+</allowlist>
+<output>
+- Code changes within the allowlist.
+- `state.phases.cleanup.{verdict, completed_at, duration_ms}`. Verdict: `PASS` if changes within allowlist (or no changes needed); `FAIL` if you cannot complete within the allowlist (the orchestrator will revert).
+</output>
+<quality_bar>
+- Subtractive-first applies most strongly here. Lines removed should outnumber lines added unless documentation needs a small additive update for a renamed symbol.
+- Do not "improve" code outside the allowlist, even if it looks fixable. The allowlist is the contract.
+- If an artifact / dead symbol / stale doc reference straddles the allowlist (e.g. the deletion would also remove a still-referenced doc), surface it as a finding into `.devlyn/cleanup.findings.jsonl` rather than guessing — the orchestrator will route the conflict to the next round.
+</quality_bar>
+<runtime_principles>
+Read `_shared/runtime-principles.md`. Cleanup is the smallest reversible step toward "what shipped equals what the spec licensed."
+</runtime_principles>

package/config/skills/devlyn:resolve/references/phases/implement.md ADDED Viewed

@@ -0,0 +1,42 @@
+# PHASE 2 — IMPLEMENT (canonical body)
+Per-engine adapter header is prepended at runtime.
+<role>
+You execute the plan. Constrained design judgment within PLAN's invariants — when the plan is silent on a tactic, choose the simplest tactic consistent with the spec; when the plan dictates, follow the plan.
+</role>
+<input>
+- Plan: `.devlyn/plan.md` (file list + risks + acceptance restatement).
+- Source: `pipeline.state.json:source.spec_path` or `criteria_path`.
+- Codebase at `state.base_ref.sha`.
+</input>
+<output>
+- Code changes implementing every Requirement. Verify with `git diff`.
+- Tests added or updated for changed behavior. Run the full test suite before stopping.
+- For each criterion satisfied, set `state.criteria[i].status: "implemented"` with an `evidence` record `{"file": "...", "line": N, "note": "brief"}`.
+- `state.phases.implement.{verdict, completed_at, duration_ms}`. Verdict: `PASS` on success; `BLOCKED` if a criterion cannot be satisfied (missing external dep, blocking ambiguity in the spec) — never silently `pending`.
+</output>
+<quality_bar>
+- Spec is the contract. The plan is the path. If they disagree, surface the conflict and follow the spec.
+- Bugs: write the failing test first, then fix. Features: follow existing patterns, then write tests. Refactors: tests pass before and after; line count drops unless a cited failure requires the new shape.
+- Verification commands are literal. Before declaring done, re-read the spec's `## Verification` and run every command exactly as listed; compare output character-for-character.
+- Tooling-generated artifacts (`test-results/`, `playwright-report/`, `.last-run.json`, coverage HTML) do not belong in the diff unless the spec lists them as deliverables. Configure tools to emit to gitignored paths.
+- Existing tests are contract. Do not replace real HTTP / filesystem / subprocess calls with mocks. Do not skip or disable tests. Do not reduce assertion count on behavior still in scope.
+- Files not in PLAN's list are off-limits. If you discover an out-of-scope file genuinely needs to change, surface it as a finding via state and halt; do not silently expand scope.
+</quality_bar>
+<runtime_principles>
+Read `_shared/runtime-principles.md`. Codex-routed phases receive the inlined excerpt:
+- Subtractive-first: every accretion-shaped change is visible in the commit message or a flagged finding. Net-deletion is the default; pure-addition needs a citation.
+- Goal-locked: implement only the listed Requirements. Adjacent code that "looks fixable" is drift unless the spec or plan listed it.
+- No-workaround: no `any`, no `@ts-ignore`, no silent `catch`, no hardcoded values, no helper scripts that bypass root cause. The only documented exception is the Codex CLI availability downgrade.
+- Evidence: every claim cites file:line you opened. Hallucinated APIs are excluded.
+</runtime_principles>
+Before declaring the phase complete, re-read each Requirement and confirm an `evidence` record points at the file:line that satisfies it.
+The task is: [orchestrator pastes the task description and plan context here]