devlyn-cli 1.14.0 → 1.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +28 -149
- package/config/skills/devlyn:auto-resolve/SKILL.md +165 -515
- package/config/skills/devlyn:auto-resolve/evals/evals.json +21 -0
- package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +42 -0
- package/config/skills/devlyn:auto-resolve/references/build-gate.md +36 -22
- package/config/skills/devlyn:auto-resolve/references/engine-routing.md +43 -165
- package/config/skills/devlyn:auto-resolve/references/findings-schema.md +103 -0
- package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +54 -0
- package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +45 -0
- package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +84 -0
- package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +114 -0
- package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +201 -0
- package/config/skills/devlyn:auto-resolve/scripts/archive_run.py +104 -0
- package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +96 -0
- package/config/skills/devlyn:ideate/SKILL.md +12 -64
- package/config/skills/devlyn:ideate/references/codex-critic-template.md +42 -0
- package/config/skills/devlyn:preflight/SKILL.md +25 -40
- package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +6 -10
- package/config/skills/devlyn:reap/SKILL.md +104 -0
- package/config/skills/devlyn:reap/scripts/reap.sh +129 -0
- package/config/skills/devlyn:reap/scripts/scan.sh +116 -0
- package/package.json +5 -1
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
{
|
|
2
|
+
"skill_name": "devlyn:auto-resolve",
|
|
3
|
+
"note": "Real regression fixtures from measured A/B runs. Each eval is a task we have actually driven through the pipeline and scored blind against a bare-prompt baseline. Expand only with real runs, not synthesized prompts.",
|
|
4
|
+
"evals": [
|
|
5
|
+
{
|
|
6
|
+
"id": 1,
|
|
7
|
+
"name": "doctor-subcommand",
|
|
8
|
+
"route": "standard",
|
|
9
|
+
"complexity": "medium",
|
|
10
|
+
"prompt": "Implement per spec at evals/task-doctor-subcommand.md",
|
|
11
|
+
"spec": "evals/task-doctor-subcommand.md",
|
|
12
|
+
"expected_output": "Single-file CLI change adding a `doctor` subcommand to bin/devlyn.js with node-version, HOME/.claude, plugins, and skills checks, TTY-gated color, exit-code semantics, help integration. Zero new npm dependencies. No silent error catches.",
|
|
13
|
+
"baseline_source": "bare prompt (no skill pipeline) — see benchmark history commit 2d9a5f0..51a8aeb",
|
|
14
|
+
"historical_scores": {
|
|
15
|
+
"v3.4": { "skill": 57, "bare": 45, "margin": 12, "judge": "codex gpt-5.3-codex blind" },
|
|
16
|
+
"v3.4.1": { "skill": 59, "bare": 43, "margin": 16, "judge": "codex gpt-5.3-codex blind" }
|
|
17
|
+
},
|
|
18
|
+
"files": []
|
|
19
|
+
}
|
|
20
|
+
]
|
|
21
|
+
}
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
# Task: `devlyn doctor` subcommand
|
|
2
|
+
|
|
3
|
+
Add a new `doctor` subcommand to `bin/devlyn.js`. When the user runs `npx devlyn-cli doctor` (or `node bin/devlyn.js doctor`), it diagnoses the local devlyn-cli installation and prints a status report.
|
|
4
|
+
|
|
5
|
+
## Requirements
|
|
6
|
+
|
|
7
|
+
1. **Node version check** — `process.version >= v18.0.0`. Emit a status line. If below, mark FAIL.
|
|
8
|
+
2. **`$HOME/.claude/` check** — exists as directory AND is writable. Missing → FAIL. Exists but not writable (EACCES) → FAIL with a distinct "permission" message.
|
|
9
|
+
3. **Installed plugins scan** — read subdirectories of `$HOME/.claude/plugins/cache/` and print a summary line with the count. `--verbose` lists names.
|
|
10
|
+
4. **Installed skills scan** — count files matching `$HOME/.claude/skills/**/SKILL.md`. Print count; `--verbose` lists relative paths.
|
|
11
|
+
5. **Colored output** — each line prefixed with `[OK]` (green), `[WARN]` (yellow), `[FAIL]` (red) using ANSI escape codes, **only when `process.stdout.isTTY` is true**. Non-TTY → no color codes.
|
|
12
|
+
6. **Summary line** — e.g., `doctor: 3 ok, 1 warn, 0 fail`.
|
|
13
|
+
7. **Exit code** — `0` if zero fails, `1` if any fail.
|
|
14
|
+
8. **`--verbose` flag** — expands details for plugins/skills scans.
|
|
15
|
+
9. **Help integration** —
|
|
16
|
+
- `node bin/devlyn.js doctor --help` prints a short help block and exits 0.
|
|
17
|
+
- `node bin/devlyn.js --help` / `node bin/devlyn.js help` lists `doctor` as an available subcommand.
|
|
18
|
+
|
|
19
|
+
## Constraints
|
|
20
|
+
|
|
21
|
+
- **Zero new dependencies.** Use only Node.js built-ins (`fs`, `path`, `os`, `process`).
|
|
22
|
+
- **No silent error catches.** Per project CLAUDE.md error-handling philosophy, do not wrap operations in `try { … } catch { return fallbackValue }`. All errors visible to the user with actionable messages.
|
|
23
|
+
- **HOME guard.** If `process.env.HOME` is undefined or empty, emit a clear FAIL line ("HOME environment variable is not set") and exit 1. Do not attempt to read arbitrary paths.
|
|
24
|
+
- **EACCES handling.** If `readdirSync` fails with EACCES, emit a permission-specific message quoting the offending path. Do not silently return an empty list.
|
|
25
|
+
|
|
26
|
+
## Acceptance verification
|
|
27
|
+
|
|
28
|
+
Run each of these and they must behave as described:
|
|
29
|
+
|
|
30
|
+
- `node bin/devlyn.js doctor` — produces the status report, exits 0 on a clean machine.
|
|
31
|
+
- `HOME=/nonexistent node bin/devlyn.js doctor` — prints a FAIL line clearly referencing the missing `/nonexistent/.claude`, exits 1.
|
|
32
|
+
- `node bin/devlyn.js doctor | cat` — piped output contains no ANSI escape codes (`\x1b[`).
|
|
33
|
+
- `node bin/devlyn.js doctor --help` — prints help, exits 0.
|
|
34
|
+
- `node bin/devlyn.js --help` — mentions `doctor` in the list of subcommands.
|
|
35
|
+
- `git diff -- package.json` — no new entries under `dependencies`.
|
|
36
|
+
- `node bin/devlyn.js doctor --verbose` — lists each plugin directory and each skill path.
|
|
37
|
+
|
|
38
|
+
## Out of Scope
|
|
39
|
+
|
|
40
|
+
- Auto-repair (don't offer to fix detected problems; just report).
|
|
41
|
+
- Checking remote/registry state (npm, GitHub).
|
|
42
|
+
- Any feature requiring a new npm dependency.
|
|
@@ -82,35 +82,49 @@ Use `--build-gate no-docker` to skip Docker builds for faster iteration during d
|
|
|
82
82
|
|
|
83
83
|
## Output Format
|
|
84
84
|
|
|
85
|
-
|
|
85
|
+
Emit two files plus one state update (schemas: `references/findings-schema.md`, `references/pipeline-state.md`).
|
|
86
|
+
|
|
87
|
+
### 1. `.devlyn/build_gate.findings.jsonl`
|
|
88
|
+
|
|
89
|
+
One JSON line per failing command's extracted root cause. Do NOT emit findings for PASSING commands. Each line follows the canonical findings schema:
|
|
90
|
+
|
|
91
|
+
```jsonl
|
|
92
|
+
{"id":"BGATE-0001","rule_id":"build.type-error","level":"error","severity":"HIGH","confidence":0.99,"message":"Property 'config' does not exist on type 'SettingsTabsProps'","file":"dashboard/app/(dashboard)/settings/page.tsx","line":90,"phase":"build_gate","criterion_ref":null,"fix_hint":"Read dashboard/app/(dashboard)/settings/page.tsx:88-93 and dashboard/components/settings/SettingsTabs.tsx (the SettingsTabsProps type definition). Either add 'config' to SettingsTabsProps or remove the prop from the parent.","blocking":true,"status":"open"}
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
Dedup key is `(rule_id, file, line)` per `findings-schema.md` — no fingerprint bookkeeping (removed in v3.4).
|
|
96
|
+
|
|
97
|
+
Suggested `rule_id` values: `build.type-error`, `build.lint-violation`, `build.dep-missing`, `build.docker-copy-mismatch`, `build.module-not-found`, `build.compile-error`.
|
|
98
|
+
|
|
99
|
+
### 2. `.devlyn/build_gate.log.md`
|
|
100
|
+
|
|
101
|
+
Human-readable run log. This is where the FULL raw stderr/stdout lives — not in the JSONL `message`. Structure:
|
|
86
102
|
|
|
87
103
|
```markdown
|
|
88
|
-
# Build Gate
|
|
89
|
-
## Verdict: [PASS / FAIL]
|
|
104
|
+
# Build Gate Run Log
|
|
90
105
|
## Detected Project Types
|
|
91
106
|
- [type] ([path/])
|
|
92
|
-
## Gate Commands Run
|
|
93
|
-
| # | Command | Dir | Exit | Status | Time |
|
|
94
|
-
|---|---|---|---|---|---|
|
|
95
|
-
| 1 | `npx tsc --noEmit` | dashboard/ | 0 | PASS | 4.2s |
|
|
96
|
-
| 2 | `npx next build` | dashboard/ | 1 | FAIL | 9.8s |
|
|
97
|
-
| 3 | `cargo check --all-targets` | services/indexer/ | 0 | PASS | 12.1s |
|
|
98
107
|
|
|
99
|
-
##
|
|
108
|
+
## Commands
|
|
109
|
+
| # | Command | Dir | Exit | Time |
|
|
110
|
+
|---|---|---|---|---|
|
|
111
|
+
| 1 | `npx tsc --noEmit` | dashboard/ | 0 | 4.2s |
|
|
112
|
+
| 2 | `npx next build` | dashboard/ | 1 | 9.8s |
|
|
100
113
|
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
114
|
+
## Raw Output — failing commands only
|
|
115
|
+
|
|
116
|
+
### #2: `npx next build` (dashboard/, exit 1)
|
|
117
|
+
\`\`\`
|
|
118
|
+
[full raw output — keep for debugging]
|
|
119
|
+
\`\`\`
|
|
104
120
|
```
|
|
105
121
|
|
|
106
|
-
|
|
107
|
-
- `dashboard/app/(dashboard)/settings/page.tsx:90` — Type error: Property 'config' does not exist on type 'SettingsTabsProps'
|
|
122
|
+
### 3. Update `pipeline.state.json`
|
|
108
123
|
|
|
109
|
-
|
|
110
|
-
|
|
111
|
-
|
|
124
|
+
Set `phases.build_gate`:
|
|
125
|
+
- `verdict`: `PASS` if all exit codes == 0, else `FAIL`. If no gates detected, `PASS` with a note in log.md ("No build gate detected — project type unknown; consider adding `--build-gate deploy` if Dockerfiles are present.")
|
|
126
|
+
- `engine: "bash"`, `model: null`
|
|
127
|
+
- `started_at`, `completed_at`, `duration_ms`, `round`
|
|
128
|
+
- `artifacts.findings_file: ".devlyn/build_gate.findings.jsonl"`, `artifacts.log_file: ".devlyn/build_gate.log.md"`
|
|
112
129
|
|
|
113
|
-
|
|
114
|
-
- Any exit code != 0 → **FAIL**
|
|
115
|
-
- All exit codes == 0 → **PASS**
|
|
116
|
-
- No gates detected → **PASS** with note "No build gate detected — project type unknown. Consider adding `--build-gate deploy` if Dockerfiles are present."
|
|
130
|
+
The orchestrator branches on `phases.build_gate.verdict` — it does NOT re-read the findings or log file for routing decisions.
|
|
@@ -1,204 +1,82 @@
|
|
|
1
1
|
# Engine Routing: Intelligent Model Selection
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Routing rules for Claude / Codex / Dual per role and phase. Only read when `--engine` is `auto` or `codex`.
|
|
4
4
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Benchmark Basis
|
|
10
|
-
|
|
11
|
-
| Dimension | Claude Opus 4.6 | GPT-5.4 | Gap | Source |
|
|
12
|
-
|-----------|-----------------|---------|-----|--------|
|
|
13
|
-
| Long-context retrieval (256k) | 92% | ~64% | Claude +28pp | MRCR v2 |
|
|
14
|
-
| Graduate-level reasoning | 87.4% | 83.9% | Claude +3.5pp | GPQA Diamond |
|
|
15
|
-
| Hard coding problems | ~46% | 57.7% | Codex +11.7pp | SWE-bench Pro |
|
|
16
|
-
| Function-level code gen | 90.4% | 93.1% | Codex +2.7pp | HumanEval |
|
|
17
|
-
| Terminal/CLI tasks | 65.4% | 75.1% | Codex +9.7pp | Terminal-Bench 2.0 |
|
|
18
|
-
| Real-world issue resolution | ~80% | ~80% | Tied | SWE-bench Verified |
|
|
19
|
-
| Security vulnerability detection | — | — | Tied | Semgrep 2025 study |
|
|
20
|
-
| Agentic computer use | 72.7% | 75.0% | Codex +2.3pp | OSWorld |
|
|
21
|
-
| Ambiguous intent handling | Preferred by 70% devs | — | Claude | Developer surveys |
|
|
22
|
-
|
|
23
|
-
---
|
|
24
|
-
|
|
25
|
-
## Codex Call Defaults
|
|
26
|
-
|
|
27
|
-
Every Codex call in this file uses these defaults unless stated otherwise:
|
|
28
|
-
|
|
29
|
-
```
|
|
30
|
-
model: "gpt-5.4"
|
|
31
|
-
reasoningEffort: "xhigh"
|
|
32
|
-
sandbox: varies per role (see table)
|
|
33
|
-
workingDirectory: project root
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
The `model` field accepts any string — pass `"gpt-5.4"` even if the MCP schema lists older defaults. The Codex CLI resolves it.
|
|
37
|
-
|
|
38
|
-
---
|
|
39
|
-
|
|
40
|
-
## Role Routing Table
|
|
41
|
-
|
|
42
|
-
### team-resolve roles
|
|
43
|
-
|
|
44
|
-
| Role | Engine | Sandbox | Rationale |
|
|
45
|
-
|------|--------|---------|-----------|
|
|
46
|
-
| root-cause-analyst | **Claude** | — | A/B test: Claude traced git history (15 tool calls) finding exact commit + unchecked migration plan. Codex analyzed structure well but lacked git history depth. Tool access > SWE-bench Pro advantage for this role. |
|
|
47
|
-
| test-engineer | **Codex** | workspace-write | Test code generation = HumanEval (+2.7pp), needs file write |
|
|
48
|
-
| security-auditor | **Dual** | read-only | Semgrep: both find unique vulns; GAN > single model |
|
|
49
|
-
| implementation-planner | **Codex** | read-only | Implementation planning = SWE-bench Pro (+11.7pp) |
|
|
50
|
-
| product-designer | **Claude** | — | Ambiguous requirements, user intent = Claude strength |
|
|
51
|
-
| ui-designer | **Claude** | — | Visual spec, design reasoning = non-coding task |
|
|
52
|
-
| ux-designer | **Claude** | — | User flow analysis = ambiguous intent handling |
|
|
53
|
-
| accessibility-auditor | **Claude** | — | A/B test: Claude found 12 issues (1 CRITICAL) vs Codex 4. WCAG auditing requires thoroughness and domain knowledge depth, not code generation speed. Claude 3x coverage. |
|
|
54
|
-
| product-analyst | **Claude** | — | Requirements clarity, scope judgment = ambiguity handling |
|
|
55
|
-
| architecture-reviewer | **Claude** | — | Codebase-wide pattern review = MRCR long-context (+28pp) |
|
|
56
|
-
| performance-engineer | **Codex** | read-only | Terminal tasks + algorithm analysis = Terminal-Bench (+9.7pp) |
|
|
57
|
-
| api-designer | **Dual** | read-only | A/B test: Claude found 9 issues, Codex found 6, with unique findings on both sides (Claude: --version, exit codes; Codex: YAML folded scalar parsing bug). Dual maximizes coverage for API surface review. |
|
|
58
|
-
|
|
59
|
-
### team-review roles
|
|
60
|
-
|
|
61
|
-
| Role | Engine | Sandbox | Rationale |
|
|
62
|
-
|------|--------|---------|-----------|
|
|
63
|
-
| security-reviewer | **Dual** | read-only | Same as team-resolve security-auditor |
|
|
64
|
-
| quality-reviewer | **Dual** | read-only | A/B test: Claude found 14 issues (2 HIGH), Codex found 11 (3 HIGH), only ~6 overlap. Dual yields ~19 unique findings (+36-73% coverage). Both models find HIGH-severity issues the other misses. |
|
|
65
|
-
| test-analyst | **Codex** | workspace-write | Test gap analysis + test code suggestions |
|
|
66
|
-
| ux-reviewer | **Claude** | — | UX flow assessment = ambiguity handling |
|
|
67
|
-
| ui-reviewer | **Claude** | — | Design token consistency = non-coding task |
|
|
68
|
-
| accessibility-reviewer | **Claude** | — | Same rationale as team-resolve accessibility-auditor: Claude 3x finding coverage on WCAG audits |
|
|
69
|
-
| product-validator | **Claude** | — | Business logic intent = ambiguity handling |
|
|
70
|
-
| api-reviewer | **Dual** | read-only | Same rationale as team-resolve api-designer: both models find unique API issues |
|
|
71
|
-
| performance-reviewer | **Codex** | read-only | Algorithm complexity = Terminal-Bench (+9.7pp) |
|
|
72
|
-
|
|
73
|
-
### Summary distribution
|
|
74
|
-
|
|
75
|
-
| Engine | team-resolve (12) | team-review (9) | Total |
|
|
76
|
-
|--------|-------------------|-----------------|-------|
|
|
77
|
-
| Claude | 7 | 4 | 11 |
|
|
78
|
-
| Codex | 2 | 2 | 4 |
|
|
79
|
-
| Dual | 3 | 3 | 6 |
|
|
5
|
+
Codex call defaults: `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox` per role (below), `workingDirectory: <project root>`.
|
|
80
6
|
|
|
81
7
|
---
|
|
82
8
|
|
|
83
9
|
## Pipeline Phase Routing (auto-resolve)
|
|
84
10
|
|
|
85
|
-
| Phase |
|
|
11
|
+
| Phase | `--engine auto` | `--engine codex` | `--engine claude` |
|
|
86
12
|
|-------|--------------|----------------|-----------------|
|
|
87
|
-
| BUILD
|
|
88
|
-
| BUILD GATE | bash
|
|
89
|
-
| BROWSER VALIDATE | Claude (Chrome MCP
|
|
13
|
+
| BUILD | **Codex** | Codex | Claude |
|
|
14
|
+
| BUILD GATE | bash | bash | bash |
|
|
15
|
+
| BROWSER VALIDATE | Claude (Chrome MCP) | Claude | Claude |
|
|
90
16
|
| EVALUATE | **Claude** | Claude | Claude |
|
|
91
17
|
| FIX LOOP | **Codex** | Codex | Claude |
|
|
92
|
-
|
|
|
93
|
-
|
|
|
94
|
-
| CHALLENGE | **Claude** | Claude | Claude |
|
|
95
|
-
| SECURITY REVIEW | **Dual** | Codex | Claude |
|
|
96
|
-
| CLEAN | Claude | Codex | Claude |
|
|
18
|
+
| CRITIC (design sub-pass) | Claude | Claude | Claude |
|
|
19
|
+
| CRITIC (security sub-pass) | **Dual** | Codex | Claude |
|
|
97
20
|
| DOCS | Claude | Codex | Claude |
|
|
98
21
|
|
|
99
|
-
Rationale
|
|
100
|
-
- BUILD/FIX: Codex — SWE-bench Pro
|
|
101
|
-
- EVALUATE/
|
|
102
|
-
- BROWSER: Claude — Chrome MCP tools
|
|
103
|
-
-
|
|
22
|
+
Rationale:
|
|
23
|
+
- BUILD/FIX: Codex — SWE-bench Pro advantage on hard coding.
|
|
24
|
+
- EVALUATE/CRITIC design sub-pass: Claude — long-context retrieval + skeptical reasoning; different model family from builder for GAN dynamic.
|
|
25
|
+
- BROWSER: Claude — Chrome MCP tools session-bound.
|
|
26
|
+
- CRITIC security: Dual on `auto` — Semgrep study shows each model finds unique vulnerabilities; for security, coverage trumps cost.
|
|
104
27
|
|
|
105
28
|
---
|
|
106
29
|
|
|
107
30
|
## Pipeline Phase Routing (ideate)
|
|
108
31
|
|
|
109
|
-
| Phase |
|
|
32
|
+
| Phase | `--engine auto` | `--engine codex` | `--engine claude` |
|
|
110
33
|
|-------|--------------|----------------|-----------------|
|
|
111
|
-
| FRAME |
|
|
112
|
-
| EXPLORE |
|
|
113
|
-
| CONVERGE |
|
|
34
|
+
| FRAME | Claude | Codex | Claude |
|
|
35
|
+
| EXPLORE | Claude | Codex | Claude |
|
|
36
|
+
| CONVERGE | Claude | Codex | Claude |
|
|
114
37
|
| CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
|
|
115
|
-
| DOCUMENT |
|
|
38
|
+
| DOCUMENT | Claude | Codex | Claude |
|
|
116
39
|
|
|
117
|
-
|
|
118
|
-
- FRAME/EXPLORE/CONVERGE: Claude — ambiguous intent handling, multi-perspective reasoning.
|
|
119
|
-
- CHALLENGE: When `--engine auto`, Codex runs the rubric pass as critic — automatic on every run. When `--engine codex`, Claude runs the challenge (role reversal — builder and critic are always different models).
|
|
120
|
-
- DOCUMENT: Claude — writing quality for spec generation.
|
|
40
|
+
CHALLENGE: when `--engine auto`, Codex runs the rubric as critic (builder and critic always different models).
|
|
121
41
|
|
|
122
42
|
---
|
|
123
43
|
|
|
124
44
|
## Pipeline Phase Routing (preflight)
|
|
125
45
|
|
|
126
|
-
| Phase |
|
|
46
|
+
| Phase | `--engine auto` | `--engine codex` | `--engine claude` |
|
|
127
47
|
|-------|--------------|----------------|-----------------|
|
|
128
|
-
| EXTRACT
|
|
129
|
-
|
|
|
130
|
-
|
|
|
131
|
-
|
|
|
48
|
+
| EXTRACT | Claude | Codex | Claude |
|
|
49
|
+
| AUDIT (code) | **Codex** | Codex | Claude |
|
|
50
|
+
| AUDIT (docs) | **Claude** | Claude | Claude |
|
|
51
|
+
| AUDIT (browser) | Claude | Claude | Claude |
|
|
132
52
|
| SYNTHESIZE | Claude | Claude | Claude |
|
|
133
53
|
|
|
134
|
-
|
|
54
|
+
Docs auditor is always Claude (writing-quality strength for prose-drift detection). Browser is always Claude (Chrome MCP session-bound).
|
|
135
55
|
|
|
136
56
|
---
|
|
137
57
|
|
|
138
|
-
##
|
|
139
|
-
|
|
140
|
-
For roles marked **Codex** in the routing table, call `mcp__codex-cli__codex` instead of spawning a Claude Agent subagent. Package the role's full prompt (from the skill's teammate prompt section) into the Codex call.
|
|
141
|
-
|
|
142
|
-
Template:
|
|
143
|
-
|
|
144
|
-
```
|
|
145
|
-
mcp__codex-cli__codex({
|
|
146
|
-
prompt: "[full role prompt with issue context, file paths, and deliverable format]",
|
|
147
|
-
model: "gpt-5.4",
|
|
148
|
-
reasoningEffort: "xhigh",
|
|
149
|
-
sandbox: "[read-only or workspace-write per table]",
|
|
150
|
-
workingDirectory: "[project root]"
|
|
151
|
-
})
|
|
152
|
-
```
|
|
153
|
-
|
|
154
|
-
**Important**: Codex has no access to team infrastructure (TeamCreate, SendMessage, TaskCreate). For Codex roles:
|
|
155
|
-
- Include ALL context inline in the prompt (issue description, file paths from investigation, deliverable format)
|
|
156
|
-
- The orchestrator collects Codex's response and routes it where it would have gone via SendMessage
|
|
157
|
-
- Codex roles cannot communicate with other teammates directly — the orchestrator relays findings
|
|
158
|
-
|
|
159
|
-
For roles marked **Claude**, spawn a normal Agent subagent as before.
|
|
160
|
-
|
|
161
|
-
---
|
|
58
|
+
## Team-role routing (when `--team` is on OR route == strict in auto-resolve, OR team-review in standalone)
|
|
162
59
|
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
- Conflicting findings → keep both, note the disagreement
|
|
175
|
-
- Take the MORE SEVERE verdict between the two
|
|
176
|
-
|
|
177
|
-
---
|
|
178
|
-
|
|
179
|
-
## How to Spawn a Codex BUILD/FIX Agent
|
|
180
|
-
|
|
181
|
-
For BUILD and FIX LOOP phases when engine routes to Codex:
|
|
182
|
-
|
|
183
|
-
```
|
|
184
|
-
mcp__codex-cli__codex({
|
|
185
|
-
prompt: "[full build/fix prompt with task description, done criteria, and implementation instructions]",
|
|
186
|
-
model: "gpt-5.4",
|
|
187
|
-
reasoningEffort: "xhigh",
|
|
188
|
-
sandbox: "workspace-write",
|
|
189
|
-
fullAuto: true,
|
|
190
|
-
workingDirectory: "[project root]"
|
|
191
|
-
})
|
|
192
|
-
```
|
|
60
|
+
| Role | Engine | Sandbox | Rationale |
|
|
61
|
+
|------|--------|---------|-----------|
|
|
62
|
+
| root-cause-analyst | Claude | — | Git-history traversal + tool access beats SWE-bench Pro for this role |
|
|
63
|
+
| test-engineer | Codex | workspace-write | HumanEval edge, needs file write |
|
|
64
|
+
| security-auditor | Dual | read-only | Semgrep: each finds unique vulns |
|
|
65
|
+
| implementation-planner | Codex | read-only | SWE-bench Pro +11.7pp |
|
|
66
|
+
| architecture-reviewer | Claude | — | Codebase-wide pattern review = MRCR strength |
|
|
67
|
+
| performance-engineer | Codex | read-only | Terminal-Bench edge |
|
|
68
|
+
| api-designer / api-reviewer | Dual | read-only | Both find unique API issues |
|
|
69
|
+
| quality-reviewer | Dual | read-only | Measured ~36–73% coverage gain from dual |
|
|
70
|
+
| ux/ui/accessibility-* | Claude | — | Ambiguity handling + WCAG domain depth |
|
|
193
71
|
|
|
194
|
-
|
|
72
|
+
For Codex roles: `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, sandbox per table. Include full role prompt inline; Codex has no access to TeamCreate/SendMessage/TaskCreate.
|
|
195
73
|
|
|
196
|
-
|
|
74
|
+
For Dual roles: run both in parallel, merge findings. Same finding from both → keep more detailed wording, mark "confirmed by both". Codex-only → prefix `[codex]`. Conflicts → keep both.
|
|
197
75
|
|
|
198
76
|
---
|
|
199
77
|
|
|
200
|
-
## Override
|
|
78
|
+
## Override behavior
|
|
201
79
|
|
|
202
|
-
- `--engine claude` → all roles
|
|
203
|
-
- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration
|
|
204
|
-
- `--engine auto` (default) → each role
|
|
80
|
+
- `--engine claude` → all roles/phases use Claude (no Codex calls).
|
|
81
|
+
- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration/Chrome MCP.
|
|
82
|
+
- `--engine auto` (default) → each role/phase routes per this table.
|
|
@@ -0,0 +1,103 @@
|
|
|
1
|
+
# Findings Schema — `.devlyn/<phase>.findings.jsonl`
|
|
2
|
+
|
|
3
|
+
Structured findings format for phase outputs. One JSON object per line (JSONL). Written by Evaluate, Build Gate, Browser Validate, Critic, and any other phase that emits structured findings.
|
|
4
|
+
|
|
5
|
+
## Purpose
|
|
6
|
+
|
|
7
|
+
Separate structured findings from prose summaries. The orchestrator and fix-loop need machine-readable data for:
|
|
8
|
+
- Filtering by severity / blocking / status.
|
|
9
|
+
- Cross-round dedup (single primary key — see below).
|
|
10
|
+
- Packing into fix-batch packets for the fix-loop subagent.
|
|
11
|
+
- Final report aggregation.
|
|
12
|
+
|
|
13
|
+
## Canonical schema (per line)
|
|
14
|
+
|
|
15
|
+
```json
|
|
16
|
+
{
|
|
17
|
+
"id": "<PHASE>-<4digit>",
|
|
18
|
+
"rule_id": "<category>.<kebab-name>",
|
|
19
|
+
"level": "note" | "warning" | "error",
|
|
20
|
+
"severity": "LOW" | "MEDIUM" | "HIGH" | "CRITICAL",
|
|
21
|
+
"confidence": <float 0.0..1.0>,
|
|
22
|
+
"message": "<one-line human description>",
|
|
23
|
+
"file": "<path relative to repo root>",
|
|
24
|
+
"line": <int, 1-based>,
|
|
25
|
+
"phase": "<phase name, e.g. 'evaluate'>",
|
|
26
|
+
"criterion_ref": "<anchor, e.g. 'spec://requirements/2'>" | null,
|
|
27
|
+
"fix_hint": "<concrete action quoting file:line to change>",
|
|
28
|
+
"blocking": true | false,
|
|
29
|
+
"status": "open" | "resolved" | "suppressed"
|
|
30
|
+
}
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
## Field semantics
|
|
34
|
+
|
|
35
|
+
### Identity
|
|
36
|
+
|
|
37
|
+
- `id` — stable within a single run. Format: `<PHASE>-<4digit>` zero-padded. Examples: `EVAL-0007`, `BUILD-0001`, `CRIT-0003`, `BGATE-0004`.
|
|
38
|
+
- `rule_id` — stable across runs. Format: `<category>.<kebab-case-name>`. Use existing rule_ids before inventing new ones — keeps dedup working. Common categories:
|
|
39
|
+
- `correctness.*` — logic errors, silent failures, null access, wrong API contracts
|
|
40
|
+
- `design.*` — staff-engineer ship/no-ship concerns (non-atomic transactions, hidden assumptions, unidiomatic patterns)
|
|
41
|
+
- `security.*` — OWASP-anchored (sql-injection, xss, hardcoded-credential, missing-input-validation, missing-auth-check, insecure-dependency, permissive-cors, missing-csrf, privilege-escalation, data-exposure, path-traversal, ssrf)
|
|
42
|
+
- `ux.*` — missing error/loading/empty states
|
|
43
|
+
- `architecture.*` — pattern violations, duplication, missing integration
|
|
44
|
+
- `hygiene.*` — unused imports, dead code, unused deps (typically LOW)
|
|
45
|
+
- `types.*` — any-cast escapes, unsafe casts
|
|
46
|
+
- `scope.*` — out-of-scope violations (e.g. `scope.out-of-scope-violation`)
|
|
47
|
+
- `performance.*`, `style.*` — typically LOW
|
|
48
|
+
|
|
49
|
+
### Severity and level
|
|
50
|
+
|
|
51
|
+
- `level` — SARIF-style coarse bucket: `error` blocks ship, `warning` should fix, `note` informational.
|
|
52
|
+
- `severity` — finer granularity for pipeline logic.
|
|
53
|
+
|
|
54
|
+
| severity | level | blocking default |
|
|
55
|
+
|----------|-------|------------------|
|
|
56
|
+
| `CRITICAL` | `error` | always true |
|
|
57
|
+
| `HIGH` | `error` | usually true |
|
|
58
|
+
| `MEDIUM` | `warning` | true (stricter for `security.*` — see pipeline-routing terminal state) |
|
|
59
|
+
| `LOW` | `note` | false |
|
|
60
|
+
|
|
61
|
+
- `confidence` — reporter's self-rating, `0.0`–`1.0`. Fix-loop prioritizes high-confidence HIGH findings first.
|
|
62
|
+
|
|
63
|
+
### Message and location
|
|
64
|
+
|
|
65
|
+
- `message` — one line. NAME the issue, not the symptom. Good: `"Token validated on read path but not write path"`. Bad: `"Potential security issue"`.
|
|
66
|
+
- `file` — repo-relative path. No leading `./`, no absolute paths. Forward slashes.
|
|
67
|
+
- `line` — 1-based line number. Multi-line spans → primary (first) line.
|
|
68
|
+
|
|
69
|
+
### Dedup primary key
|
|
70
|
+
|
|
71
|
+
The primary key is `(rule_id, file, line)`. Two findings with identical coordinates are the same issue. If EVAL runs again after a fix and the finding re-appears at the same spot, it's still unresolved. If the line shifted after a fix, the next EVAL regenerates the finding with the new line — cross-round drift heals naturally via re-evaluation, no hash-normalization bookkeeping.
|
|
72
|
+
|
|
73
|
+
### Pipeline linkage
|
|
74
|
+
|
|
75
|
+
- `phase` — the phase that produced this finding (redundant with filename but keeps records self-describing when concatenated for fix-batch packets).
|
|
76
|
+
- `criterion_ref` — anchor to the criterion this finding affects, or `null` if cross-cutting.
|
|
77
|
+
- `fix_hint` — concrete action with file:line. Not "improve error handling" — e.g. `"Wrap read+write in db.transaction() at src/auth/session.ts:84-92; re-check order.status === 'pending' inside transaction before updating"`.
|
|
78
|
+
|
|
79
|
+
### Lifecycle
|
|
80
|
+
|
|
81
|
+
- `blocking` — does this block ship?
|
|
82
|
+
- `status`:
|
|
83
|
+
- `open` — reported, awaiting action
|
|
84
|
+
- `resolved` — a fix-loop round applied a fix AND subsequent evaluation confirms the finding is gone (via `(rule_id, file, line)` absence in the new EVAL run)
|
|
85
|
+
- `suppressed` — intentionally not fixed with justification in the phase's `log.md`; requires user override or Out-of-Scope mapping in the spec
|
|
86
|
+
|
|
87
|
+
## Dedup and round handling
|
|
88
|
+
|
|
89
|
+
Each phase writes a fresh `.findings.jsonl` on each execution. Fix rounds re-run EVAL, which produces a new file.
|
|
90
|
+
|
|
91
|
+
**Cross-round reconciliation** (fix-loop rounds only):
|
|
92
|
+
1. Read prior round's `<phase>.findings.jsonl` (if any).
|
|
93
|
+
2. For each finding in the new file: if prior open finding has same `(rule_id, file, line)` → reuse the prior `id`, keep `status: open`. Otherwise → new `id`.
|
|
94
|
+
3. For each prior open finding not matched by the new file → set `status: resolved` in the prior file.
|
|
95
|
+
|
|
96
|
+
Fix-batch packet: orchestrator concatenates all phases' `.findings.jsonl`, filters `status == "open"`, drops `blocking == false` when round budget is tight, writes to `.devlyn/fix-batch.round-<N>.json`.
|
|
97
|
+
|
|
98
|
+
## Non-goals
|
|
99
|
+
|
|
100
|
+
- Full SARIF 2.1.0 export (can derive later by mapping our fields to SARIF).
|
|
101
|
+
- Cross-project aggregation.
|
|
102
|
+
- Rule metadata catalogs.
|
|
103
|
+
- Code fix patches — `fix_hint` is prose; fix-loop subagent writes the patch.
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
# PHASE 1 — BUILD (agent prompt body)
|
|
2
|
+
|
|
3
|
+
Spawned when PHASE 1 runs. Engine: BUILD row of `engine-routing.md`.
|
|
4
|
+
|
|
5
|
+
Orchestrator passes the task description as the final section, and sets the team flag (`team: true|false`) per orchestrator rule: team only when `--team` flag OR `state.route.selected == "strict"`.
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
<spec_integrity_check>
|
|
10
|
+
Before reading anything else:
|
|
11
|
+
- If `pipeline.state.json:source.type == "spec"`, compute `sha256(state.source.spec_path)`. If it differs from `state.source.spec_sha256`, write `phases.build.verdict: "BLOCKED"` with reason `"spec_sha256 mismatch"` and return. The spec changed mid-run — invariant violation per `references/pipeline-state.md`.
|
|
12
|
+
- If `source.type == "generated"` and `state.source.criteria_sha256` exists, verify the same way.
|
|
13
|
+
- If the hash field is absent (first phase to populate the file), skip this check this one time only.
|
|
14
|
+
</spec_integrity_check>
|
|
15
|
+
|
|
16
|
+
<goal>
|
|
17
|
+
Implement code changes that satisfy every pending criterion in `pipeline.state.json:criteria[]` without violating anything declared Out of Scope or Constraints. Make the source's intent run in the code.
|
|
18
|
+
</goal>
|
|
19
|
+
|
|
20
|
+
<input>
|
|
21
|
+
- Canonical criteria: `pipeline.state.json:source`. Follow `source.spec_path` (read directly, do not copy) or `source.criteria_path` (`.devlyn/criteria.generated.md` — may not yet exist; see OUTPUT CONTRACT).
|
|
22
|
+
- Codebase at `pipeline.state.json:base_ref.sha`.
|
|
23
|
+
- Task statement appended at the bottom of this prompt.
|
|
24
|
+
</input>
|
|
25
|
+
|
|
26
|
+
<output_contract>
|
|
27
|
+
- **Code changes** implementing every `pending` criterion. Verify with `git diff`.
|
|
28
|
+
- **state.json criteria updates**: for each criterion satisfied, set `status: "implemented"` and append an `evidence` record `{"file": "...", "line": N, "note": "brief"}`.
|
|
29
|
+
- **If `source.type == "generated"` and `.devlyn/criteria.generated.md` does not exist**: create it once with `## Requirements` (each `- [ ]` testable in under 30 seconds, specific, scoped), `## Out of Scope`, `## Verification`. Populate `state.criteria[]` with `{"id": "C<N>", "ref": "criteria.generated://requirements/<N-1>", "status": "pending", "evidence": [], "failed_by_finding_ids": []}`. Classify task complexity into `low` / `medium` / `high` and write to `phases.build.complexity`. Compute `criteria_sha256 = sha256(criteria.generated.md)` and store in `state.source.criteria_sha256`.
|
|
30
|
+
- **No pending criterion remains**: every `criteria[]` entry must transition to `status: "implemented"` with an `evidence` record before you exit. If a criterion genuinely cannot be satisfied (missing external dep, blocking ambiguity), set `phases.build.verdict: "BLOCKED"` and report. Never exit with a criterion still `pending`. BUILD must not mark any criterion `failed` — that's EVAL-only. Legal transitions: `pending → implemented`, or halt via `verdict: "BLOCKED"`.
|
|
31
|
+
- **Tests** added or updated for changed behavior. Run the full test suite before stopping.
|
|
32
|
+
- **Team** (only if orchestrator set `team: true`): use `TeamCreate` per the role table below; collect findings; shut down the team before exiting. Otherwise implement directly — the default.
|
|
33
|
+
</output_contract>
|
|
34
|
+
|
|
35
|
+
<quality_bar>
|
|
36
|
+
- Criteria and Out-of-Scope are the contract — never weaken, reword, or delete them.
|
|
37
|
+
- Read only files the source implicates (Architecture Notes + Dependencies + touched patterns), not the whole codebase.
|
|
38
|
+
- Bugs: failing test first, then fix. Features: follow existing patterns, then write tests. Refactors: tests pass before and after.
|
|
39
|
+
- Fix root causes only — no `any`, `@ts-ignore`, silent `catch`, or hardcoded values.
|
|
40
|
+
</quality_bar>
|
|
41
|
+
|
|
42
|
+
<principle>
|
|
43
|
+
The source is the contract. Your output is evidence that the contract now runs in code.
|
|
44
|
+
</principle>
|
|
45
|
+
|
|
46
|
+
<team_role_selection>
|
|
47
|
+
When `team: true`, select teammates by task type (per-role engine routing per `references/engine-routing.md`):
|
|
48
|
+
- Bug fix: root-cause-analyst + test-engineer (+ security-auditor / performance-engineer as needed)
|
|
49
|
+
- Feature: implementation-planner + test-engineer (+ ux-designer / architecture-reviewer / api-designer as needed)
|
|
50
|
+
- Refactor: architecture-reviewer + test-engineer
|
|
51
|
+
- UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
|
|
52
|
+
</team_role_selection>
|
|
53
|
+
|
|
54
|
+
The task is: [orchestrator pastes the task description here]
|
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
# PHASE 2 — EVALUATE (agent prompt body)
|
|
2
|
+
|
|
3
|
+
Spawned when PHASE 2 runs. Engine: Claude (cross-model critic when builder was Codex).
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
<spec_integrity_check>
|
|
8
|
+
Before reading anything: verify source hash per `references/phases/phase-1-build.md#spec_integrity_check`. Apply the same rule (spec_sha256 for spec runs, criteria_sha256 for generated).
|
|
9
|
+
</spec_integrity_check>
|
|
10
|
+
|
|
11
|
+
<goal>
|
|
12
|
+
Independently verify whether every criterion in `pipeline.state.json:criteria[]` is satisfied by the current code. Surface every defect with file:line evidence. You are a skeptic, not a cheerleader — praise is not your job.
|
|
13
|
+
</goal>
|
|
14
|
+
|
|
15
|
+
<input>
|
|
16
|
+
- Canonical rubric: `pipeline.state.json:source`. Follow `source.spec_path` or `source.criteria_path` and read Requirements + Out of Scope + Verification directly.
|
|
17
|
+
- Change surface: `git diff <pipeline.state.json:base_ref.sha>` + `git status`. Read every changed/new file in full — not just the hunks.
|
|
18
|
+
- Prior browser findings at `.devlyn/browser_validate.findings.jsonl` (if that phase ran).
|
|
19
|
+
</input>
|
|
20
|
+
|
|
21
|
+
<output_contract>
|
|
22
|
+
- **`.devlyn/evaluate.findings.jsonl`** — one JSON per line (schema: `references/findings-schema.md`). Per finding:
|
|
23
|
+
`id` (`EVAL-<4digit>`), `rule_id` (stable kebab-case, e.g. `correctness.silent-error`, `ux.missing-error-state`, `architecture.duplication`, `security.missing-validation`, `types.any-cast-escape`, `style.let-vs-const`, `scope.out-of-scope-violation`, `hygiene.unused-import`), `level` (`error`/`warning`/`note` — map from severity: CRITICAL/HIGH → error, MEDIUM → warning, LOW → note), `severity` (`CRITICAL`/`HIGH`/`MEDIUM`/`LOW`), `confidence` (0.0–1.0), `message` (one line naming the issue, not symptoms), `file`, `line` (1-based primary location), `phase: "evaluate"`, `criterion_ref` (exact `ref` string from a `criteria[]` entry — e.g. `"spec://requirements/2"` — when the finding fails a specific criterion; or a section anchor from `state.source.criteria_anchors` such as `"spec://constraints"` / `"spec://out-of-scope"` when cross-cutting; `null` when scope-broader than any anchor), `fix_hint` (concrete action quoting file:line), `blocking` (CRITICAL/HIGH/MEDIUM default true, LOW false), `status: "open"`. Dedup key is `(rule_id, file, line)` — no fingerprint bookkeeping.
|
|
24
|
+
- **`.devlyn/evaluate.log.md`** — 3–5 line human summary: verdict + criteria pass/fail counts + top 3 risks + cross-cutting patterns if any. Prose here; structured data in the JSONL.
|
|
25
|
+
- **state.json criteria updates** — every `criteria[]` entry leaves Evaluate in a terminal state. Incoming status from BUILD is normally `implemented`; transition each to `status: "verified"` (append `evidence` record confirming satisfaction) OR `status: "failed"` (set `failed_by_finding_ids` to the IDs you emitted). If a criterion is still `pending` (BUILD did not satisfy it), mark it `failed` with a finding whose `rule_id` is `correctness.criterion-unimplemented`. No `criteria[]` entry may remain `pending` or `implemented` after Evaluate.
|
|
26
|
+
- **state.json phases.evaluate** — `verdict` per taxonomy, `engine: "claude"`, `model`, timing, `round`, `artifacts.{findings_file, log_file}`.
|
|
27
|
+
|
|
28
|
+
Verdict taxonomy: `BLOCKED` (any CRITICAL) / `NEEDS_WORK` (HIGH or MEDIUM present) / `PASS_WITH_ISSUES` (LOW only) / `PASS` (clean).
|
|
29
|
+
</output_contract>
|
|
30
|
+
|
|
31
|
+
<quality_bar>
|
|
32
|
+
- Every finding points at a file:line you have opened and read. No real anchor = speculation; exclude it.
|
|
33
|
+
- Every failed criterion maps to ≥1 finding `id`.
|
|
34
|
+
- **Coverage over comfort**: report uncertain and LOW findings too; downstream filters rank them. Missing a real defect ships broken code — the asymmetry is decisive.
|
|
35
|
+
- Audit each changed file for: correctness (logic errors, silent failures, null access, wrong API contracts), architecture (pattern violations, duplication, missing integration), security (if auth/secrets/user-data touched: injection, hardcoded credentials, missing validation), frontend (if UI changed: missing error/loading/empty states, React anti-patterns, server/client boundaries), test coverage (untested modules, missing edge cases), hygiene (unused imports, dead code, unused deps — `hygiene.*` at LOW), defensive programming (recursion depth/cycle guards, boundary conditions, missing null checks — severity per blast radius: `correctness.*` when it can crash or corrupt, `hygiene.*` when cosmetic).
|
|
36
|
+
- Calibration: a catch block that logs but doesn't surface the error to the user → HIGH, not MEDIUM (logging ≠ error handling). A `let` that could be `const` → LOW (linters catch it). "Error handling is generally quite good" is not a finding — count instances, name files.
|
|
37
|
+
- "Pre-existing" findings still count if they relate to the criteria. Working software, not blame attribution.
|
|
38
|
+
- **Out-of-Scope violations are findings**: if BUILD added behavior the source's `## Out of Scope` excludes, emit `rule_id: "scope.out-of-scope-violation"`, `severity: HIGH`, `criterion_ref: "spec://out-of-scope"` (or `"criteria.generated://out-of-scope"`), `fix_hint` naming what to remove.
|
|
39
|
+
</quality_bar>
|
|
40
|
+
|
|
41
|
+
<principle>
|
|
42
|
+
Missing a real defect is worse than reporting an extra one. Asymmetric cost demands bias toward reporting.
|
|
43
|
+
</principle>
|
|
44
|
+
|
|
45
|
+
Do not delete `pipeline.state.json` or the JSONL/log files.
|