devlyn-cli 1.14.0 → 1.15.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (22) hide show
  1. package/CLAUDE.md +28 -149
  2. package/config/skills/devlyn:auto-resolve/SKILL.md +165 -515
  3. package/config/skills/devlyn:auto-resolve/evals/evals.json +21 -0
  4. package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +42 -0
  5. package/config/skills/devlyn:auto-resolve/references/build-gate.md +36 -22
  6. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +43 -165
  7. package/config/skills/devlyn:auto-resolve/references/findings-schema.md +103 -0
  8. package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +54 -0
  9. package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +45 -0
  10. package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +84 -0
  11. package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +114 -0
  12. package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +201 -0
  13. package/config/skills/devlyn:auto-resolve/scripts/archive_run.py +104 -0
  14. package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +96 -0
  15. package/config/skills/devlyn:ideate/SKILL.md +12 -64
  16. package/config/skills/devlyn:ideate/references/codex-critic-template.md +42 -0
  17. package/config/skills/devlyn:preflight/SKILL.md +25 -40
  18. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +6 -10
  19. package/config/skills/devlyn:reap/SKILL.md +104 -0
  20. package/config/skills/devlyn:reap/scripts/reap.sh +129 -0
  21. package/config/skills/devlyn:reap/scripts/scan.sh +116 -0
  22. package/package.json +5 -1
@@ -0,0 +1,21 @@
1
+ {
2
+ "skill_name": "devlyn:auto-resolve",
3
+ "note": "Real regression fixtures from measured A/B runs. Each eval is a task we have actually driven through the pipeline and scored blind against a bare-prompt baseline. Expand only with real runs, not synthesized prompts.",
4
+ "evals": [
5
+ {
6
+ "id": 1,
7
+ "name": "doctor-subcommand",
8
+ "route": "standard",
9
+ "complexity": "medium",
10
+ "prompt": "Implement per spec at evals/task-doctor-subcommand.md",
11
+ "spec": "evals/task-doctor-subcommand.md",
12
+ "expected_output": "Single-file CLI change adding a `doctor` subcommand to bin/devlyn.js with node-version, HOME/.claude, plugins, and skills checks, TTY-gated color, exit-code semantics, help integration. Zero new npm dependencies. No silent error catches.",
13
+ "baseline_source": "bare prompt (no skill pipeline) — see benchmark history commit 2d9a5f0..51a8aeb",
14
+ "historical_scores": {
15
+ "v3.4": { "skill": 57, "bare": 45, "margin": 12, "judge": "codex gpt-5.3-codex blind" },
16
+ "v3.4.1": { "skill": 59, "bare": 43, "margin": 16, "judge": "codex gpt-5.3-codex blind" }
17
+ },
18
+ "files": []
19
+ }
20
+ ]
21
+ }
@@ -0,0 +1,42 @@
1
+ # Task: `devlyn doctor` subcommand
2
+
3
+ Add a new `doctor` subcommand to `bin/devlyn.js`. When the user runs `npx devlyn-cli doctor` (or `node bin/devlyn.js doctor`), it diagnoses the local devlyn-cli installation and prints a status report.
4
+
5
+ ## Requirements
6
+
7
+ 1. **Node version check** — `process.version >= v18.0.0`. Emit a status line. If below, mark FAIL.
8
+ 2. **`$HOME/.claude/` check** — exists as directory AND is writable. Missing → FAIL. Exists but not writable (EACCES) → FAIL with a distinct "permission" message.
9
+ 3. **Installed plugins scan** — read subdirectories of `$HOME/.claude/plugins/cache/` and print a summary line with the count. `--verbose` lists names.
10
+ 4. **Installed skills scan** — count files matching `$HOME/.claude/skills/**/SKILL.md`. Print count; `--verbose` lists relative paths.
11
+ 5. **Colored output** — each line prefixed with `[OK]` (green), `[WARN]` (yellow), `[FAIL]` (red) using ANSI escape codes, **only when `process.stdout.isTTY` is true**. Non-TTY → no color codes.
12
+ 6. **Summary line** — e.g., `doctor: 3 ok, 1 warn, 0 fail`.
13
+ 7. **Exit code** — `0` if zero fails, `1` if any fail.
14
+ 8. **`--verbose` flag** — expands details for plugins/skills scans.
15
+ 9. **Help integration** —
16
+ - `node bin/devlyn.js doctor --help` prints a short help block and exits 0.
17
+ - `node bin/devlyn.js --help` / `node bin/devlyn.js help` lists `doctor` as an available subcommand.
18
+
19
+ ## Constraints
20
+
21
+ - **Zero new dependencies.** Use only Node.js built-ins (`fs`, `path`, `os`, `process`).
22
+ - **No silent error catches.** Per project CLAUDE.md error-handling philosophy, do not wrap operations in `try { … } catch { return fallbackValue }`. All errors visible to the user with actionable messages.
23
+ - **HOME guard.** If `process.env.HOME` is undefined or empty, emit a clear FAIL line ("HOME environment variable is not set") and exit 1. Do not attempt to read arbitrary paths.
24
+ - **EACCES handling.** If `readdirSync` fails with EACCES, emit a permission-specific message quoting the offending path. Do not silently return an empty list.
25
+
26
+ ## Acceptance verification
27
+
28
+ Run each of these and they must behave as described:
29
+
30
+ - `node bin/devlyn.js doctor` — produces the status report, exits 0 on a clean machine.
31
+ - `HOME=/nonexistent node bin/devlyn.js doctor` — prints a FAIL line clearly referencing the missing `/nonexistent/.claude`, exits 1.
32
+ - `node bin/devlyn.js doctor | cat` — piped output contains no ANSI escape codes (`\x1b[`).
33
+ - `node bin/devlyn.js doctor --help` — prints help, exits 0.
34
+ - `node bin/devlyn.js --help` — mentions `doctor` in the list of subcommands.
35
+ - `git diff -- package.json` — no new entries under `dependencies`.
36
+ - `node bin/devlyn.js doctor --verbose` — lists each plugin directory and each skill path.
37
+
38
+ ## Out of Scope
39
+
40
+ - Auto-repair (don't offer to fix detected problems; just report).
41
+ - Checking remote/registry state (npm, GitHub).
42
+ - Any feature requiring a new npm dependency.
@@ -82,35 +82,49 @@ Use `--build-gate no-docker` to skip Docker builds for faster iteration during d
82
82
 
83
83
  ## Output Format
84
84
 
85
- Write results to `.devlyn/BUILD-GATE.md`:
85
+ Emit two files plus one state update (schemas: `references/findings-schema.md`, `references/pipeline-state.md`).
86
+
87
+ ### 1. `.devlyn/build_gate.findings.jsonl`
88
+
89
+ One JSON line per failing command's extracted root cause. Do NOT emit findings for PASSING commands. Each line follows the canonical findings schema:
90
+
91
+ ```jsonl
92
+ {"id":"BGATE-0001","rule_id":"build.type-error","level":"error","severity":"HIGH","confidence":0.99,"message":"Property 'config' does not exist on type 'SettingsTabsProps'","file":"dashboard/app/(dashboard)/settings/page.tsx","line":90,"phase":"build_gate","criterion_ref":null,"fix_hint":"Read dashboard/app/(dashboard)/settings/page.tsx:88-93 and dashboard/components/settings/SettingsTabs.tsx (the SettingsTabsProps type definition). Either add 'config' to SettingsTabsProps or remove the prop from the parent.","blocking":true,"status":"open"}
93
+ ```
94
+
95
+ Dedup key is `(rule_id, file, line)` per `findings-schema.md` — no fingerprint bookkeeping (removed in v3.4).
96
+
97
+ Suggested `rule_id` values: `build.type-error`, `build.lint-violation`, `build.dep-missing`, `build.docker-copy-mismatch`, `build.module-not-found`, `build.compile-error`.
98
+
99
+ ### 2. `.devlyn/build_gate.log.md`
100
+
101
+ Human-readable run log. This is where the FULL raw stderr/stdout lives — not in the JSONL `message`. Structure:
86
102
 
87
103
  ```markdown
88
- # Build Gate Results
89
- ## Verdict: [PASS / FAIL]
104
+ # Build Gate Run Log
90
105
  ## Detected Project Types
91
106
  - [type] ([path/])
92
- ## Gate Commands Run
93
- | # | Command | Dir | Exit | Status | Time |
94
- |---|---|---|---|---|---|
95
- | 1 | `npx tsc --noEmit` | dashboard/ | 0 | PASS | 4.2s |
96
- | 2 | `npx next build` | dashboard/ | 1 | FAIL | 9.8s |
97
- | 3 | `cargo check --all-targets` | services/indexer/ | 0 | PASS | 12.1s |
98
107
 
99
- ## Failures
108
+ ## Commands
109
+ | # | Command | Dir | Exit | Time |
110
+ |---|---|---|---|---|
111
+ | 1 | `npx tsc --noEmit` | dashboard/ | 0 | 4.2s |
112
+ | 2 | `npx next build` | dashboard/ | 1 | 9.8s |
100
113
 
101
- ### Command #2: `npx next build` (dashboard/, exit 1)
102
- ```
103
- [full error output do NOT truncate. Build errors reference files from earlier in output.]
114
+ ## Raw Output failing commands only
115
+
116
+ ### #2: `npx next build` (dashboard/, exit 1)
117
+ \`\`\`
118
+ [full raw output — keep for debugging]
119
+ \`\`\`
104
120
  ```
105
121
 
106
- **Root file:line(s)**:
107
- - `dashboard/app/(dashboard)/settings/page.tsx:90` — Type error: Property 'config' does not exist on type 'SettingsTabsProps'
122
+ ### 3. Update `pipeline.state.json`
108
123
 
109
- **Fix guidance**:
110
- Read `dashboard/app/(dashboard)/settings/page.tsx:88-93` and `dashboard/components/settings/SettingsTabs.tsx` (the SettingsTabsProps type definition). Either add `config` to SettingsTabsProps or remove the prop from the parent. Then re-run `npx next build` from `dashboard/` to verify.
111
- ```
124
+ Set `phases.build_gate`:
125
+ - `verdict`: `PASS` if all exit codes == 0, else `FAIL`. If no gates detected, `PASS` with a note in log.md ("No build gate detected — project type unknown; consider adding `--build-gate deploy` if Dockerfiles are present.")
126
+ - `engine: "bash"`, `model: null`
127
+ - `started_at`, `completed_at`, `duration_ms`, `round`
128
+ - `artifacts.findings_file: ".devlyn/build_gate.findings.jsonl"`, `artifacts.log_file: ".devlyn/build_gate.log.md"`
112
129
 
113
- Verdict rules:
114
- - Any exit code != 0 → **FAIL**
115
- - All exit codes == 0 → **PASS**
116
- - No gates detected → **PASS** with note "No build gate detected — project type unknown. Consider adding `--build-gate deploy` if Dockerfiles are present."
130
+ The orchestrator branches on `phases.build_gate.verdict` — it does NOT re-read the findings or log file for routing decisions.
@@ -1,204 +1,82 @@
1
1
  # Engine Routing: Intelligent Model Selection
2
2
 
3
- Instructions for routing work to the optimal model (Claude or Codex) per role and phase. Only read this file when `--engine` is set to `auto` or `codex`.
3
+ Routing rules for Claude / Codex / Dual per role and phase. Only read when `--engine` is `auto` or `codex`.
4
4
 
5
- The routing table below is derived from published benchmarks (April 2026) comparing Claude Opus 4.6 and GPT-5.4 across task-relevant dimensions. The principle: each role's work goes to the model that objectively performs better at that task type.
6
-
7
- ---
8
-
9
- ## Benchmark Basis
10
-
11
- | Dimension | Claude Opus 4.6 | GPT-5.4 | Gap | Source |
12
- |-----------|-----------------|---------|-----|--------|
13
- | Long-context retrieval (256k) | 92% | ~64% | Claude +28pp | MRCR v2 |
14
- | Graduate-level reasoning | 87.4% | 83.9% | Claude +3.5pp | GPQA Diamond |
15
- | Hard coding problems | ~46% | 57.7% | Codex +11.7pp | SWE-bench Pro |
16
- | Function-level code gen | 90.4% | 93.1% | Codex +2.7pp | HumanEval |
17
- | Terminal/CLI tasks | 65.4% | 75.1% | Codex +9.7pp | Terminal-Bench 2.0 |
18
- | Real-world issue resolution | ~80% | ~80% | Tied | SWE-bench Verified |
19
- | Security vulnerability detection | — | — | Tied | Semgrep 2025 study |
20
- | Agentic computer use | 72.7% | 75.0% | Codex +2.3pp | OSWorld |
21
- | Ambiguous intent handling | Preferred by 70% devs | — | Claude | Developer surveys |
22
-
23
- ---
24
-
25
- ## Codex Call Defaults
26
-
27
- Every Codex call in this file uses these defaults unless stated otherwise:
28
-
29
- ```
30
- model: "gpt-5.4"
31
- reasoningEffort: "xhigh"
32
- sandbox: varies per role (see table)
33
- workingDirectory: project root
34
- ```
35
-
36
- The `model` field accepts any string — pass `"gpt-5.4"` even if the MCP schema lists older defaults. The Codex CLI resolves it.
37
-
38
- ---
39
-
40
- ## Role Routing Table
41
-
42
- ### team-resolve roles
43
-
44
- | Role | Engine | Sandbox | Rationale |
45
- |------|--------|---------|-----------|
46
- | root-cause-analyst | **Claude** | — | A/B test: Claude traced git history (15 tool calls) finding exact commit + unchecked migration plan. Codex analyzed structure well but lacked git history depth. Tool access > SWE-bench Pro advantage for this role. |
47
- | test-engineer | **Codex** | workspace-write | Test code generation = HumanEval (+2.7pp), needs file write |
48
- | security-auditor | **Dual** | read-only | Semgrep: both find unique vulns; GAN > single model |
49
- | implementation-planner | **Codex** | read-only | Implementation planning = SWE-bench Pro (+11.7pp) |
50
- | product-designer | **Claude** | — | Ambiguous requirements, user intent = Claude strength |
51
- | ui-designer | **Claude** | — | Visual spec, design reasoning = non-coding task |
52
- | ux-designer | **Claude** | — | User flow analysis = ambiguous intent handling |
53
- | accessibility-auditor | **Claude** | — | A/B test: Claude found 12 issues (1 CRITICAL) vs Codex 4. WCAG auditing requires thoroughness and domain knowledge depth, not code generation speed. Claude 3x coverage. |
54
- | product-analyst | **Claude** | — | Requirements clarity, scope judgment = ambiguity handling |
55
- | architecture-reviewer | **Claude** | — | Codebase-wide pattern review = MRCR long-context (+28pp) |
56
- | performance-engineer | **Codex** | read-only | Terminal tasks + algorithm analysis = Terminal-Bench (+9.7pp) |
57
- | api-designer | **Dual** | read-only | A/B test: Claude found 9 issues, Codex found 6, with unique findings on both sides (Claude: --version, exit codes; Codex: YAML folded scalar parsing bug). Dual maximizes coverage for API surface review. |
58
-
59
- ### team-review roles
60
-
61
- | Role | Engine | Sandbox | Rationale |
62
- |------|--------|---------|-----------|
63
- | security-reviewer | **Dual** | read-only | Same as team-resolve security-auditor |
64
- | quality-reviewer | **Dual** | read-only | A/B test: Claude found 14 issues (2 HIGH), Codex found 11 (3 HIGH), only ~6 overlap. Dual yields ~19 unique findings (+36-73% coverage). Both models find HIGH-severity issues the other misses. |
65
- | test-analyst | **Codex** | workspace-write | Test gap analysis + test code suggestions |
66
- | ux-reviewer | **Claude** | — | UX flow assessment = ambiguity handling |
67
- | ui-reviewer | **Claude** | — | Design token consistency = non-coding task |
68
- | accessibility-reviewer | **Claude** | — | Same rationale as team-resolve accessibility-auditor: Claude 3x finding coverage on WCAG audits |
69
- | product-validator | **Claude** | — | Business logic intent = ambiguity handling |
70
- | api-reviewer | **Dual** | read-only | Same rationale as team-resolve api-designer: both models find unique API issues |
71
- | performance-reviewer | **Codex** | read-only | Algorithm complexity = Terminal-Bench (+9.7pp) |
72
-
73
- ### Summary distribution
74
-
75
- | Engine | team-resolve (12) | team-review (9) | Total |
76
- |--------|-------------------|-----------------|-------|
77
- | Claude | 7 | 4 | 11 |
78
- | Codex | 2 | 2 | 4 |
79
- | Dual | 3 | 3 | 6 |
5
+ Codex call defaults: `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox` per role (below), `workingDirectory: <project root>`.
80
6
 
81
7
  ---
82
8
 
83
9
  ## Pipeline Phase Routing (auto-resolve)
84
10
 
85
- | Phase | --engine auto | --engine codex | --engine claude |
11
+ | Phase | `--engine auto` | `--engine codex` | `--engine claude` |
86
12
  |-------|--------------|----------------|-----------------|
87
- | BUILD (implementation) | **Codex** | Codex | Claude |
88
- | BUILD GATE | bash (model-agnostic) | bash | bash |
89
- | BROWSER VALIDATE | Claude (Chrome MCP only) | Claude | Claude |
13
+ | BUILD | **Codex** | Codex | Claude |
14
+ | BUILD GATE | bash | bash | bash |
15
+ | BROWSER VALIDATE | Claude (Chrome MCP) | Claude | Claude |
90
16
  | EVALUATE | **Claude** | Claude | Claude |
91
17
  | FIX LOOP | **Codex** | Codex | Claude |
92
- | SIMPLIFY | Claude | Codex | Claude |
93
- | REVIEW (team) | **Mixed per table** | Codex all | Claude all |
94
- | CHALLENGE | **Claude** | Claude | Claude |
95
- | SECURITY REVIEW | **Dual** | Codex | Claude |
96
- | CLEAN | Claude | Codex | Claude |
18
+ | CRITIC (design sub-pass) | Claude | Claude | Claude |
19
+ | CRITIC (security sub-pass) | **Dual** | Codex | Claude |
97
20
  | DOCS | Claude | Codex | Claude |
98
21
 
99
- Rationale for `--engine auto` choices:
100
- - BUILD/FIX: Codex — SWE-bench Pro 57.7% vs 46%. The biggest model gap is in hard coding tasks.
101
- - EVALUATE/CHALLENGE: Claude — evaluating a full diff requires long-context retrieval (MRCR +28pp) and skeptical reasoning (GPQA +3.5pp). Different model family from builder creates GAN dynamic.
102
- - BROWSER: Claude — Chrome MCP tools are Claude Code session-bound.
103
- - SECURITY: Dual — Semgrep study shows both models find unique vulnerabilities.
22
+ Rationale:
23
+ - BUILD/FIX: Codex — SWE-bench Pro advantage on hard coding.
24
+ - EVALUATE/CRITIC design sub-pass: Claude — long-context retrieval + skeptical reasoning; different model family from builder for GAN dynamic.
25
+ - BROWSER: Claude — Chrome MCP tools session-bound.
26
+ - CRITIC security: Dual on `auto` — Semgrep study shows each model finds unique vulnerabilities; for security, coverage trumps cost.
104
27
 
105
28
  ---
106
29
 
107
30
  ## Pipeline Phase Routing (ideate)
108
31
 
109
- | Phase | --engine auto | --engine codex | --engine claude |
32
+ | Phase | `--engine auto` | `--engine codex` | `--engine claude` |
110
33
  |-------|--------------|----------------|-----------------|
111
- | FRAME | **Claude** | Codex | Claude |
112
- | EXPLORE | **Claude** | Codex | Claude |
113
- | CONVERGE | **Claude** | Codex | Claude |
34
+ | FRAME | Claude | Codex | Claude |
35
+ | EXPLORE | Claude | Codex | Claude |
36
+ | CONVERGE | Claude | Codex | Claude |
114
37
  | CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
115
- | DOCUMENT | **Claude** | Codex | Claude |
38
+ | DOCUMENT | Claude | Codex | Claude |
116
39
 
117
- Rationale:
118
- - FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
119
- - CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic — automatic on every run. When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
120
- - DOCUMENT: Claude — writing quality for spec generation.
40
+ CHALLENGE: when `--engine auto`, Codex runs the rubric as critic (builder and critic always different models).
121
41
 
122
42
  ---
123
43
 
124
44
  ## Pipeline Phase Routing (preflight)
125
45
 
126
- | Phase | --engine auto | --engine codex | --engine claude |
46
+ | Phase | `--engine auto` | `--engine codex` | `--engine claude` |
127
47
  |-------|--------------|----------------|-----------------|
128
- | EXTRACT COMMITMENTS | Claude | Codex | Claude |
129
- | CODE AUDIT | **Codex** | Codex | Claude |
130
- | DOCS AUDIT | **Claude** | **Claude** | Claude |
131
- | BROWSER AUDIT | Claude (Chrome MCP) | Claude | Claude |
48
+ | EXTRACT | Claude | Codex | Claude |
49
+ | AUDIT (code) | **Codex** | Codex | Claude |
50
+ | AUDIT (docs) | **Claude** | Claude | Claude |
51
+ | AUDIT (browser) | Claude | Claude | Claude |
132
52
  | SYNTHESIZE | Claude | Claude | Claude |
133
53
 
134
- DOCS AUDIT is always Claude regardless of `--engine` — writing-quality strength on documentation drift detection (READMEs, VISION.md prose, spec status accuracy) is the deciding factor, not code analysis. BROWSER AUDIT is always Claude because Chrome MCP tools are session-bound to Claude Code.
54
+ Docs auditor is always Claude (writing-quality strength for prose-drift detection). Browser is always Claude (Chrome MCP session-bound).
135
55
 
136
56
  ---
137
57
 
138
- ## How to Spawn a Codex Role
139
-
140
- For roles marked **Codex** in the routing table, call `mcp__codex-cli__codex` instead of spawning a Claude Agent subagent. Package the role's full prompt (from the skill's teammate prompt section) into the Codex call.
141
-
142
- Template:
143
-
144
- ```
145
- mcp__codex-cli__codex({
146
- prompt: "[full role prompt with issue context, file paths, and deliverable format]",
147
- model: "gpt-5.4",
148
- reasoningEffort: "xhigh",
149
- sandbox: "[read-only or workspace-write per table]",
150
- workingDirectory: "[project root]"
151
- })
152
- ```
153
-
154
- **Important**: Codex has no access to team infrastructure (TeamCreate, SendMessage, TaskCreate). For Codex roles:
155
- - Include ALL context inline in the prompt (issue description, file paths from investigation, deliverable format)
156
- - The orchestrator collects Codex's response and routes it where it would have gone via SendMessage
157
- - Codex roles cannot communicate with other teammates directly — the orchestrator relays findings
158
-
159
- For roles marked **Claude**, spawn a normal Agent subagent as before.
160
-
161
- ---
58
+ ## Team-role routing (when `--team` is on OR route == strict in auto-resolve, OR team-review in standalone)
162
59
 
163
- ## How to Spawn a Dual Role
164
-
165
- For roles marked **Dual**, run BOTH models in parallel and merge findings:
166
-
167
- 1. Spawn a Claude Agent subagent with the role's prompt
168
- 2. Call `mcp__codex-cli__codex` with the same role's prompt (sandbox: "read-only")
169
- 3. Wait for both to complete
170
- 4. Merge findings:
171
- - Same finding from both keep more detailed description, mark "confirmed by both models"
172
- - Claude-only keep as-is
173
- - Codex-only prefix with `[codex]`
174
- - Conflicting findings → keep both, note the disagreement
175
- - Take the MORE SEVERE verdict between the two
176
-
177
- ---
178
-
179
- ## How to Spawn a Codex BUILD/FIX Agent
180
-
181
- For BUILD and FIX LOOP phases when engine routes to Codex:
182
-
183
- ```
184
- mcp__codex-cli__codex({
185
- prompt: "[full build/fix prompt with task description, done criteria, and implementation instructions]",
186
- model: "gpt-5.4",
187
- reasoningEffort: "xhigh",
188
- sandbox: "workspace-write",
189
- fullAuto: true,
190
- workingDirectory: "[project root]"
191
- })
192
- ```
60
+ | Role | Engine | Sandbox | Rationale |
61
+ |------|--------|---------|-----------|
62
+ | root-cause-analyst | Claude | | Git-history traversal + tool access beats SWE-bench Pro for this role |
63
+ | test-engineer | Codex | workspace-write | HumanEval edge, needs file write |
64
+ | security-auditor | Dual | read-only | Semgrep: each finds unique vulns |
65
+ | implementation-planner | Codex | read-only | SWE-bench Pro +11.7pp |
66
+ | architecture-reviewer | Claude | — | Codebase-wide pattern review = MRCR strength |
67
+ | performance-engineer | Codex | read-only | Terminal-Bench edge |
68
+ | api-designer / api-reviewer | Dual | read-only | Both find unique API issues |
69
+ | quality-reviewer | Dual | read-only | Measured ~36–73% coverage gain from dual |
70
+ | ux/ui/accessibility-* | Claude | — | Ambiguity handling + WCAG domain depth |
193
71
 
194
- **After Codex completes**: verify changes were made (`git diff --stat`), then proceed to the next phase as normal. The file-based handoff (`.devlyn/done-criteria.md`, `.devlyn/EVAL-FINDINGS.md`, etc.) works identically — Codex writes the same files Claude would.
72
+ For Codex roles: `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, sandbox per table. Include full role prompt inline; Codex has no access to TeamCreate/SendMessage/TaskCreate.
195
73
 
196
- **Session management**: For FIX LOOP iterations, use a fresh call each time (no `sessionId` reuse) because sandbox/fullAuto parameters only apply on the first call of a session.
74
+ For Dual roles: run both in parallel, merge findings. Same finding from both keep more detailed wording, mark "confirmed by both". Codex-only prefix `[codex]`. Conflicts keep both.
197
75
 
198
76
  ---
199
77
 
200
- ## Override Behavior
78
+ ## Override behavior
201
79
 
202
- - `--engine claude` → all roles and phases use Claude (no Codex calls)
203
- - `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration and Chrome MCP
204
- - `--engine auto` (default) → each role and phase routes to the optimal model per this table
80
+ - `--engine claude` → all roles/phases use Claude (no Codex calls).
81
+ - `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration/Chrome MCP.
82
+ - `--engine auto` (default) → each role/phase routes per this table.
@@ -0,0 +1,103 @@
1
+ # Findings Schema — `.devlyn/<phase>.findings.jsonl`
2
+
3
+ Structured findings format for phase outputs. One JSON object per line (JSONL). Written by Evaluate, Build Gate, Browser Validate, Critic, and any other phase that emits structured findings.
4
+
5
+ ## Purpose
6
+
7
+ Separate structured findings from prose summaries. The orchestrator and fix-loop need machine-readable data for:
8
+ - Filtering by severity / blocking / status.
9
+ - Cross-round dedup (single primary key — see below).
10
+ - Packing into fix-batch packets for the fix-loop subagent.
11
+ - Final report aggregation.
12
+
13
+ ## Canonical schema (per line)
14
+
15
+ ```json
16
+ {
17
+ "id": "<PHASE>-<4digit>",
18
+ "rule_id": "<category>.<kebab-name>",
19
+ "level": "note" | "warning" | "error",
20
+ "severity": "LOW" | "MEDIUM" | "HIGH" | "CRITICAL",
21
+ "confidence": <float 0.0..1.0>,
22
+ "message": "<one-line human description>",
23
+ "file": "<path relative to repo root>",
24
+ "line": <int, 1-based>,
25
+ "phase": "<phase name, e.g. 'evaluate'>",
26
+ "criterion_ref": "<anchor, e.g. 'spec://requirements/2'>" | null,
27
+ "fix_hint": "<concrete action quoting file:line to change>",
28
+ "blocking": true | false,
29
+ "status": "open" | "resolved" | "suppressed"
30
+ }
31
+ ```
32
+
33
+ ## Field semantics
34
+
35
+ ### Identity
36
+
37
+ - `id` — stable within a single run. Format: `<PHASE>-<4digit>` zero-padded. Examples: `EVAL-0007`, `BUILD-0001`, `CRIT-0003`, `BGATE-0004`.
38
+ - `rule_id` — stable across runs. Format: `<category>.<kebab-case-name>`. Use existing rule_ids before inventing new ones — keeps dedup working. Common categories:
39
+ - `correctness.*` — logic errors, silent failures, null access, wrong API contracts
40
+ - `design.*` — staff-engineer ship/no-ship concerns (non-atomic transactions, hidden assumptions, unidiomatic patterns)
41
+ - `security.*` — OWASP-anchored (sql-injection, xss, hardcoded-credential, missing-input-validation, missing-auth-check, insecure-dependency, permissive-cors, missing-csrf, privilege-escalation, data-exposure, path-traversal, ssrf)
42
+ - `ux.*` — missing error/loading/empty states
43
+ - `architecture.*` — pattern violations, duplication, missing integration
44
+ - `hygiene.*` — unused imports, dead code, unused deps (typically LOW)
45
+ - `types.*` — any-cast escapes, unsafe casts
46
+ - `scope.*` — out-of-scope violations (e.g. `scope.out-of-scope-violation`)
47
+ - `performance.*`, `style.*` — typically LOW
48
+
49
+ ### Severity and level
50
+
51
+ - `level` — SARIF-style coarse bucket: `error` blocks ship, `warning` should fix, `note` informational.
52
+ - `severity` — finer granularity for pipeline logic.
53
+
54
+ | severity | level | blocking default |
55
+ |----------|-------|------------------|
56
+ | `CRITICAL` | `error` | always true |
57
+ | `HIGH` | `error` | usually true |
58
+ | `MEDIUM` | `warning` | true (stricter for `security.*` — see pipeline-routing terminal state) |
59
+ | `LOW` | `note` | false |
60
+
61
+ - `confidence` — reporter's self-rating, `0.0`–`1.0`. Fix-loop prioritizes high-confidence HIGH findings first.
62
+
63
+ ### Message and location
64
+
65
+ - `message` — one line. NAME the issue, not the symptom. Good: `"Token validated on read path but not write path"`. Bad: `"Potential security issue"`.
66
+ - `file` — repo-relative path. No leading `./`, no absolute paths. Forward slashes.
67
+ - `line` — 1-based line number. Multi-line spans → primary (first) line.
68
+
69
+ ### Dedup primary key
70
+
71
+ The primary key is `(rule_id, file, line)`. Two findings with identical coordinates are the same issue. If EVAL runs again after a fix and the finding re-appears at the same spot, it's still unresolved. If the line shifted after a fix, the next EVAL regenerates the finding with the new line — cross-round drift heals naturally via re-evaluation, no hash-normalization bookkeeping.
72
+
73
+ ### Pipeline linkage
74
+
75
+ - `phase` — the phase that produced this finding (redundant with filename but keeps records self-describing when concatenated for fix-batch packets).
76
+ - `criterion_ref` — anchor to the criterion this finding affects, or `null` if cross-cutting.
77
+ - `fix_hint` — concrete action with file:line. Not "improve error handling" — e.g. `"Wrap read+write in db.transaction() at src/auth/session.ts:84-92; re-check order.status === 'pending' inside transaction before updating"`.
78
+
79
+ ### Lifecycle
80
+
81
+ - `blocking` — does this block ship?
82
+ - `status`:
83
+ - `open` — reported, awaiting action
84
+ - `resolved` — a fix-loop round applied a fix AND subsequent evaluation confirms the finding is gone (via `(rule_id, file, line)` absence in the new EVAL run)
85
+ - `suppressed` — intentionally not fixed with justification in the phase's `log.md`; requires user override or Out-of-Scope mapping in the spec
86
+
87
+ ## Dedup and round handling
88
+
89
+ Each phase writes a fresh `.findings.jsonl` on each execution. Fix rounds re-run EVAL, which produces a new file.
90
+
91
+ **Cross-round reconciliation** (fix-loop rounds only):
92
+ 1. Read prior round's `<phase>.findings.jsonl` (if any).
93
+ 2. For each finding in the new file: if prior open finding has same `(rule_id, file, line)` → reuse the prior `id`, keep `status: open`. Otherwise → new `id`.
94
+ 3. For each prior open finding not matched by the new file → set `status: resolved` in the prior file.
95
+
96
+ Fix-batch packet: orchestrator concatenates all phases' `.findings.jsonl`, filters `status == "open"`, drops `blocking == false` when round budget is tight, writes to `.devlyn/fix-batch.round-<N>.json`.
97
+
98
+ ## Non-goals
99
+
100
+ - Full SARIF 2.1.0 export (can derive later by mapping our fields to SARIF).
101
+ - Cross-project aggregation.
102
+ - Rule metadata catalogs.
103
+ - Code fix patches — `fix_hint` is prose; fix-loop subagent writes the patch.
@@ -0,0 +1,54 @@
1
+ # PHASE 1 — BUILD (agent prompt body)
2
+
3
+ Spawned when PHASE 1 runs. Engine: BUILD row of `engine-routing.md`.
4
+
5
+ Orchestrator passes the task description as the final section, and sets the team flag (`team: true|false`) per orchestrator rule: team only when `--team` flag OR `state.route.selected == "strict"`.
6
+
7
+ ---
8
+
9
+ <spec_integrity_check>
10
+ Before reading anything else:
11
+ - If `pipeline.state.json:source.type == "spec"`, compute `sha256(state.source.spec_path)`. If it differs from `state.source.spec_sha256`, write `phases.build.verdict: "BLOCKED"` with reason `"spec_sha256 mismatch"` and return. The spec changed mid-run — invariant violation per `references/pipeline-state.md`.
12
+ - If `source.type == "generated"` and `state.source.criteria_sha256` exists, verify the same way.
13
+ - If the hash field is absent (first phase to populate the file), skip this check this one time only.
14
+ </spec_integrity_check>
15
+
16
+ <goal>
17
+ Implement code changes that satisfy every pending criterion in `pipeline.state.json:criteria[]` without violating anything declared Out of Scope or Constraints. Make the source's intent run in the code.
18
+ </goal>
19
+
20
+ <input>
21
+ - Canonical criteria: `pipeline.state.json:source`. Follow `source.spec_path` (read directly, do not copy) or `source.criteria_path` (`.devlyn/criteria.generated.md` — may not yet exist; see OUTPUT CONTRACT).
22
+ - Codebase at `pipeline.state.json:base_ref.sha`.
23
+ - Task statement appended at the bottom of this prompt.
24
+ </input>
25
+
26
+ <output_contract>
27
+ - **Code changes** implementing every `pending` criterion. Verify with `git diff`.
28
+ - **state.json criteria updates**: for each criterion satisfied, set `status: "implemented"` and append an `evidence` record `{"file": "...", "line": N, "note": "brief"}`.
29
+ - **If `source.type == "generated"` and `.devlyn/criteria.generated.md` does not exist**: create it once with `## Requirements` (each `- [ ]` testable in under 30 seconds, specific, scoped), `## Out of Scope`, `## Verification`. Populate `state.criteria[]` with `{"id": "C<N>", "ref": "criteria.generated://requirements/<N-1>", "status": "pending", "evidence": [], "failed_by_finding_ids": []}`. Classify task complexity into `low` / `medium` / `high` and write to `phases.build.complexity`. Compute `criteria_sha256 = sha256(criteria.generated.md)` and store in `state.source.criteria_sha256`.
30
+ - **No pending criterion remains**: every `criteria[]` entry must transition to `status: "implemented"` with an `evidence` record before you exit. If a criterion genuinely cannot be satisfied (missing external dep, blocking ambiguity), set `phases.build.verdict: "BLOCKED"` and report. Never exit with a criterion still `pending`. BUILD must not mark any criterion `failed` — that's EVAL-only. Legal transitions: `pending → implemented`, or halt via `verdict: "BLOCKED"`.
31
+ - **Tests** added or updated for changed behavior. Run the full test suite before stopping.
32
+ - **Team** (only if orchestrator set `team: true`): use `TeamCreate` per the role table below; collect findings; shut down the team before exiting. Otherwise implement directly — the default.
33
+ </output_contract>
34
+
35
+ <quality_bar>
36
+ - Criteria and Out-of-Scope are the contract — never weaken, reword, or delete them.
37
+ - Read only files the source implicates (Architecture Notes + Dependencies + touched patterns), not the whole codebase.
38
+ - Bugs: failing test first, then fix. Features: follow existing patterns, then write tests. Refactors: tests pass before and after.
39
+ - Fix root causes only — no `any`, `@ts-ignore`, silent `catch`, or hardcoded values.
40
+ </quality_bar>
41
+
42
+ <principle>
43
+ The source is the contract. Your output is evidence that the contract now runs in code.
44
+ </principle>
45
+
46
+ <team_role_selection>
47
+ When `team: true`, select teammates by task type (per-role engine routing per `references/engine-routing.md`):
48
+ - Bug fix: root-cause-analyst + test-engineer (+ security-auditor / performance-engineer as needed)
49
+ - Feature: implementation-planner + test-engineer (+ ux-designer / architecture-reviewer / api-designer as needed)
50
+ - Refactor: architecture-reviewer + test-engineer
51
+ - UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
52
+ </team_role_selection>
53
+
54
+ The task is: [orchestrator pastes the task description here]
@@ -0,0 +1,45 @@
1
+ # PHASE 2 — EVALUATE (agent prompt body)
2
+
3
+ Spawned when PHASE 2 runs. Engine: Claude (cross-model critic when builder was Codex).
4
+
5
+ ---
6
+
7
+ <spec_integrity_check>
8
+ Before reading anything: verify source hash per `references/phases/phase-1-build.md#spec_integrity_check`. Apply the same rule (spec_sha256 for spec runs, criteria_sha256 for generated).
9
+ </spec_integrity_check>
10
+
11
+ <goal>
12
+ Independently verify whether every criterion in `pipeline.state.json:criteria[]` is satisfied by the current code. Surface every defect with file:line evidence. You are a skeptic, not a cheerleader — praise is not your job.
13
+ </goal>
14
+
15
+ <input>
16
+ - Canonical rubric: `pipeline.state.json:source`. Follow `source.spec_path` or `source.criteria_path` and read Requirements + Out of Scope + Verification directly.
17
+ - Change surface: `git diff <pipeline.state.json:base_ref.sha>` + `git status`. Read every changed/new file in full — not just the hunks.
18
+ - Prior browser findings at `.devlyn/browser_validate.findings.jsonl` (if that phase ran).
19
+ </input>
20
+
21
+ <output_contract>
22
+ - **`.devlyn/evaluate.findings.jsonl`** — one JSON per line (schema: `references/findings-schema.md`). Per finding:
23
+ `id` (`EVAL-<4digit>`), `rule_id` (stable kebab-case, e.g. `correctness.silent-error`, `ux.missing-error-state`, `architecture.duplication`, `security.missing-validation`, `types.any-cast-escape`, `style.let-vs-const`, `scope.out-of-scope-violation`, `hygiene.unused-import`), `level` (`error`/`warning`/`note` — map from severity: CRITICAL/HIGH → error, MEDIUM → warning, LOW → note), `severity` (`CRITICAL`/`HIGH`/`MEDIUM`/`LOW`), `confidence` (0.0–1.0), `message` (one line naming the issue, not symptoms), `file`, `line` (1-based primary location), `phase: "evaluate"`, `criterion_ref` (exact `ref` string from a `criteria[]` entry — e.g. `"spec://requirements/2"` — when the finding fails a specific criterion; or a section anchor from `state.source.criteria_anchors` such as `"spec://constraints"` / `"spec://out-of-scope"` when cross-cutting; `null` when scope-broader than any anchor), `fix_hint` (concrete action quoting file:line), `blocking` (CRITICAL/HIGH/MEDIUM default true, LOW false), `status: "open"`. Dedup key is `(rule_id, file, line)` — no fingerprint bookkeeping.
24
+ - **`.devlyn/evaluate.log.md`** — 3–5 line human summary: verdict + criteria pass/fail counts + top 3 risks + cross-cutting patterns if any. Prose here; structured data in the JSONL.
25
+ - **state.json criteria updates** — every `criteria[]` entry leaves Evaluate in a terminal state. Incoming status from BUILD is normally `implemented`; transition each to `status: "verified"` (append `evidence` record confirming satisfaction) OR `status: "failed"` (set `failed_by_finding_ids` to the IDs you emitted). If a criterion is still `pending` (BUILD did not satisfy it), mark it `failed` with a finding whose `rule_id` is `correctness.criterion-unimplemented`. No `criteria[]` entry may remain `pending` or `implemented` after Evaluate.
26
+ - **state.json phases.evaluate** — `verdict` per taxonomy, `engine: "claude"`, `model`, timing, `round`, `artifacts.{findings_file, log_file}`.
27
+
28
+ Verdict taxonomy: `BLOCKED` (any CRITICAL) / `NEEDS_WORK` (HIGH or MEDIUM present) / `PASS_WITH_ISSUES` (LOW only) / `PASS` (clean).
29
+ </output_contract>
30
+
31
+ <quality_bar>
32
+ - Every finding points at a file:line you have opened and read. No real anchor = speculation; exclude it.
33
+ - Every failed criterion maps to ≥1 finding `id`.
34
+ - **Coverage over comfort**: report uncertain and LOW findings too; downstream filters rank them. Missing a real defect ships broken code — the asymmetry is decisive.
35
+ - Audit each changed file for: correctness (logic errors, silent failures, null access, wrong API contracts), architecture (pattern violations, duplication, missing integration), security (if auth/secrets/user-data touched: injection, hardcoded credentials, missing validation), frontend (if UI changed: missing error/loading/empty states, React anti-patterns, server/client boundaries), test coverage (untested modules, missing edge cases), hygiene (unused imports, dead code, unused deps — `hygiene.*` at LOW), defensive programming (recursion depth/cycle guards, boundary conditions, missing null checks — severity per blast radius: `correctness.*` when it can crash or corrupt, `hygiene.*` when cosmetic).
36
+ - Calibration: a catch block that logs but doesn't surface the error to the user → HIGH, not MEDIUM (logging ≠ error handling). A `let` that could be `const` → LOW (linters catch it). "Error handling is generally quite good" is not a finding — count instances, name files.
37
+ - "Pre-existing" findings still count if they relate to the criteria. Working software, not blame attribution.
38
+ - **Out-of-Scope violations are findings**: if BUILD added behavior the source's `## Out of Scope` excludes, emit `rule_id: "scope.out-of-scope-violation"`, `severity: HIGH`, `criterion_ref: "spec://out-of-scope"` (or `"criteria.generated://out-of-scope"`), `fix_hint` naming what to remove.
39
+ </quality_bar>
40
+
41
+ <principle>
42
+ Missing a real defect is worse than reporting an extra one. Asymmetric cost demands bias toward reporting.
43
+ </principle>
44
+
45
+ Do not delete `pipeline.state.json` or the JSONL/log files.