devlyn-cli 1.13.0 → 1.15.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/CLAUDE.md +28 -149
- package/README.md +30 -1
- package/config/skills/devlyn:auto-resolve/SKILL.md +167 -453
- package/config/skills/devlyn:auto-resolve/evals/evals.json +21 -0
- package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +42 -0
- package/config/skills/devlyn:auto-resolve/references/build-gate.md +36 -22
- package/config/skills/devlyn:auto-resolve/references/engine-routing.md +43 -165
- package/config/skills/devlyn:auto-resolve/references/findings-schema.md +103 -0
- package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +54 -0
- package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +45 -0
- package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +84 -0
- package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +114 -0
- package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +201 -0
- package/config/skills/devlyn:auto-resolve/scripts/archive_run.py +104 -0
- package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +96 -0
- package/config/skills/devlyn:ideate/SKILL.md +17 -78
- package/config/skills/devlyn:ideate/references/codex-critic-template.md +42 -0
- package/config/skills/devlyn:ideate/references/templates/item-spec.md +4 -0
- package/config/skills/devlyn:preflight/SKILL.md +25 -40
- package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +6 -10
- package/config/skills/devlyn:reap/SKILL.md +104 -0
- package/config/skills/devlyn:reap/scripts/reap.sh +129 -0
- package/config/skills/devlyn:reap/scripts/scan.sh +116 -0
- package/package.json +5 -1
|
@@ -1,538 +1,252 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: devlyn:auto-resolve
|
|
3
|
-
description: Fully automated build-evaluate-
|
|
3
|
+
description: Fully automated build-evaluate-ship pipeline for any task type — bug fixes, new features, refactors, chores. Use this as the default starting point when the user wants hands-free implementation with zero human intervention. Runs a minimal goal-driven loop — build, evaluate, fix, critic, docs — as a single command. Use when the user says "auto resolve", "build this", "implement this feature", "fix this", "run the full pipeline", "refactor this", or wants to walk away and come back to finished work.
|
|
4
4
|
---
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
Orchestrator for the hands-free implementation pipeline. One subagent per phase, file-based handoff, unified fix loop on evaluation feedback until the work passes or `max_rounds` is reached. The orchestrator itself does not write code — it parses input, spawns phases, reads handoff artifacts, runs git commands, branches on verdicts, and emits the final report.
|
|
7
7
|
|
|
8
8
|
<pipeline_config>
|
|
9
9
|
$ARGUMENTS
|
|
10
10
|
</pipeline_config>
|
|
11
11
|
|
|
12
|
-
<pipeline_workflow>
|
|
13
|
-
|
|
14
12
|
<orchestrator_context>
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
- Your context window is auto-compacted as it approaches its limit, so do not stop tasks early due to token-budget concerns. Keep the run going.
|
|
18
|
-
- All durable state lives in `.devlyn/*.md` (done-criteria, BUILD-GATE, EVAL-FINDINGS, BROWSER-RESULTS, CHALLENGE-FINDINGS) and in git commits. If your context is cleared mid-run, the next instance can resume from those files plus `git log`. Keep them up to date.
|
|
19
|
-
- Best results come from `xhigh` effort. If you are running on lower effort and notice shallow reasoning during phase decisions, escalate.
|
|
13
|
+
Long-horizon agentic work. Context auto-compacts — do not stop early on token-budget concerns. All durable state lives in `.devlyn/pipeline.state.json` (control plane: pointers, criteria, verdicts) plus `<phase>.findings.jsonl` + `<phase>.log.md` for phases that emit findings. `state.json` is the **single authoritative verdict source** — branch on `phases.<name>.verdict` directly, never parse artifact files. At PHASE 5, the run's `.devlyn/*` artifacts are **archived** to `.devlyn/runs/<run_id>/` (last 10 kept, best-effort). Schemas: `references/pipeline-state.md`, `references/findings-schema.md`. Best results with `xhigh` reasoning.
|
|
20
14
|
</orchestrator_context>
|
|
21
15
|
|
|
22
16
|
<autonomy_contract>
|
|
23
|
-
This pipeline runs hands-free.
|
|
17
|
+
This pipeline runs hands-free. Measured by how far it gets without human intervention.
|
|
24
18
|
|
|
25
|
-
1. **
|
|
26
|
-
2. **
|
|
27
|
-
3. **
|
|
28
|
-
4. **
|
|
29
|
-
5. **
|
|
19
|
+
1. **Never prompt the user mid-pipeline.** When you'd otherwise ask, pick the safe default, proceed, and log it in the final report.
|
|
20
|
+
2. **Codex availability**: on `--engine auto`/`codex`, call `mcp__codex-cli__ping` at the start. On failure, silently fall back to `--engine claude` and log `engine downgraded: codex-ping failed` in the final report. Do NOT present a menu. Do NOT abort.
|
|
21
|
+
3. **Run only the phases defined below, in order.** Doc updates belong in PHASE 4 (DOCS). Don't insert them earlier.
|
|
22
|
+
4. **Delegate all file changes to spawned subagents.** Orchestrator actions: parse input, spawn phase agents, read handoff files, run `git`, branch on verdicts, emit report, archive.
|
|
23
|
+
5. **Continue by default.** Stop only for: (a) unrecoverable subagent failure, (b) PHASE 1 producing zero code changes, (c) build-gate / browser fix-loop exhausting `max_rounds` (halt → FINAL REPORT). EVAL/CRITIC exhaustion proceeds with warning — never halts.
|
|
30
24
|
</autonomy_contract>
|
|
31
25
|
|
|
32
|
-
<
|
|
33
|
-
|
|
26
|
+
<harness_principles>
|
|
27
|
+
Goal-first. Verify state, source integrity, diff base, artifact contracts. Prefer deletion or reuse over new machinery. Change only files the task requires. Each phase optimizes for its declared success criteria, not a checklist. Fix root causes only — no `any`, `@ts-ignore`, silent catches, hardcoded values. Label hypotheses explicitly; back claims with file:line evidence.
|
|
28
|
+
</harness_principles>
|
|
34
29
|
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
- For phases routed to **Claude**, spawn an Agent subagent with `mode: "bypassPermissions"` and pass the prompt body verbatim.
|
|
38
|
-
- `--engine claude` forces all phases to Claude. `--engine codex` forces implementation/analysis to Codex (Claude still handles orchestration and Chrome MCP). `--engine auto` (default) uses the routing table per phase.
|
|
30
|
+
<engine_routing_convention>
|
|
31
|
+
Every phase routes to the optimal model per `references/engine-routing.md`:
|
|
39
32
|
|
|
40
|
-
|
|
33
|
+
- Phase prompt bodies (in `references/phases/`) are engine-agnostic.
|
|
34
|
+
- Phases routed to **Codex**: call `mcp__codex-cli__codex` per spawn patterns in `engine-routing.md`.
|
|
35
|
+
- Phases routed to **Claude**: spawn an `Agent` subagent with `mode: "bypassPermissions"`, passing the phase body verbatim.
|
|
36
|
+
- **Dual** (CRITIC security sub-pass on `--engine auto`): spawn both in parallel; orchestrator merges findings.
|
|
37
|
+
- `--engine claude` forces all phases to Claude. `--engine codex` forces implementation to Codex, orchestration/Chrome MCP stays Claude. `--engine auto` (default) uses the routing table.
|
|
41
38
|
</engine_routing_convention>
|
|
42
39
|
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
1. Extract the task/issue description from `<pipeline_config>`.
|
|
46
|
-
2. Determine optional flags from the input (defaults in parentheses):
|
|
47
|
-
- `--max-rounds N` (4) — max evaluate-fix loops before stopping with a report
|
|
48
|
-
- `--skip-review` (false) — skip team-review phase
|
|
49
|
-
- `--security-review` (auto) — run dedicated security audit. Auto-detects: runs when changes touch auth, secrets, user data, API endpoints, env/config, or crypto. Force with `--security-review always` or skip with `--security-review skip`
|
|
50
|
-
- `--skip-clean` (false) — skip clean phase
|
|
51
|
-
- `--skip-browser` (false) — skip browser validation phase (auto-skipped for non-web changes)
|
|
52
|
-
- `--skip-docs` (false) — skip update-docs phase
|
|
53
|
-
- `--skip-build-gate` (false) — skip the deterministic build gate (Phase 1.4). Not recommended — the build gate is the primary defense against "tests pass locally, breaks in CI/Docker/production" class of bugs.
|
|
54
|
-
- `--build-gate MODE` (auto) — controls build gate behavior. `auto`: detect project type and run appropriate build/typecheck/lint commands; if Dockerfile(s) are present, Docker builds are included automatically. `strict`: auto + treat warnings as errors. `no-docker`: auto but skip Docker builds even if Dockerfiles exist (for faster iteration). `skip`: same as --skip-build-gate.
|
|
55
|
-
- `--engine MODE` (auto) — controls which model handles each pipeline phase and team role. Modes:
|
|
56
|
-
- `auto` (default): each phase and team role routes to the optimal model based on benchmark data. Requires Codex MCP server. Codex handles BUILD/FIX (SWE-bench Pro lead) and several team roles; Claude handles EVALUATE, CHALLENGE, BROWSER, and orchestration — creating a GAN-like dynamic where the builder and critic are always different models.
|
|
57
|
-
- `codex`: Codex handles implementation/analysis phases, Claude handles orchestration, evaluation, and Chrome MCP.
|
|
58
|
-
- `claude`: all phases use Claude subagents. No Codex calls.
|
|
59
|
-
|
|
60
|
-
Flags can be passed naturally: `/devlyn:auto-resolve fix the auth bug --max-rounds 3 --skip-docs`
|
|
61
|
-
Engine examples: `--engine auto`, `--engine codex`, `--engine claude`
|
|
62
|
-
If no flags are present, use defaults. The default engine is `auto` — if the user does not pass `--engine`, treat it as `--engine auto`.
|
|
63
|
-
|
|
64
|
-
**Consolidated flag**: `--with-codex` (and its variants `evaluate`/`review`/`both`) was rolled into the smarter `--engine auto` default. If the user passes it, inform them once and proceed with `--engine auto`: "Note: `--with-codex` was consolidated into `--engine auto` (default), which provides broader Codex coverage — Codex now handles BUILD, FIX, and several team roles automatically. No flag needed. Continuing with `--engine auto`."
|
|
65
|
-
|
|
66
|
-
3. **Engine pre-flight** (runs unless `--engine claude` was explicitly passed):
|
|
67
|
-
- The default engine is `auto`. If the user did not pass `--engine`, the engine is `auto` — not `claude`.
|
|
68
|
-
- Read `references/engine-routing.md` for the full routing table.
|
|
69
|
-
- Call `mcp__codex-cli__ping` to verify the Codex MCP server is available. If ping fails, warn the user and offer: [1] Continue with `--engine claude` (fallback), [2] Abort.
|
|
70
|
-
|
|
71
|
-
4. Announce the pipeline plan:
|
|
72
|
-
```
|
|
73
|
-
Auto-resolve pipeline starting
|
|
74
|
-
Task: [extracted task description]
|
|
75
|
-
Engine: [auto / codex / claude]
|
|
76
|
-
Phases: Build → Build Gate → [Browser] → Evaluate → [Fix loop if needed] → Simplify → [Review] → Challenge → [Security] → [Clean] → [Docs]
|
|
77
|
-
Max evaluation rounds: [N]
|
|
78
|
-
```
|
|
79
|
-
|
|
80
|
-
## PHASE 1: BUILD
|
|
81
|
-
|
|
82
|
-
**Engine**: BUILD row of the routing table — Codex on `auto`/`codex`, Claude on `claude`. Per `<engine_routing_convention>` above. Subagents do not have access to skills, so the prompt below includes everything they need inline.
|
|
83
|
-
|
|
84
|
-
Agent prompt — pass this to the spawned executor:
|
|
85
|
-
|
|
86
|
-
Investigate and implement the following task. Work through these phases in order:
|
|
87
|
-
|
|
88
|
-
**Phase A — Understand the task**: Read the task description carefully. Classify the task type:
|
|
89
|
-
- **Bug fix**: trace from symptom to root cause. Read error logs and affected code paths.
|
|
90
|
-
- **Feature**: explore the codebase to find existing patterns, integration points, and relevant modules.
|
|
91
|
-
- **Refactor/Chore**: understand current implementation, identify what needs to change and why.
|
|
92
|
-
- **UI/UX**: review existing components, design system, and user flows.
|
|
93
|
-
Read relevant files in parallel. Build a clear picture of what exists and what needs to change.
|
|
94
|
-
|
|
95
|
-
**Phase B — Define done criteria**: Before writing any code, create `.devlyn/done-criteria.md` with testable success criteria. Each criterion must be verifiable (a test can assert it or a human can observe it in under 30 seconds), specific (not vague like "handles errors correctly"), and scoped to this task. Include an "Out of Scope" section and a "Verification Method" section. This file is required — downstream evaluation depends on it.
|
|
96
|
-
|
|
97
|
-
**Phase C — Assemble a team**: Use TeamCreate to create a team. Select teammates based on task type:
|
|
98
|
-
- Bug fix: root-cause-analyst + test-engineer (+ security-auditor, performance-engineer as needed)
|
|
99
|
-
- Feature: implementation-planner + test-engineer (+ ux-designer, architecture-reviewer, api-designer as needed)
|
|
100
|
-
- Refactor: architecture-reviewer + test-engineer
|
|
101
|
-
- UI/UX: product-designer + ux-designer + ui-designer (+ accessibility-auditor as needed)
|
|
102
|
-
Each teammate investigates from their perspective and sends findings back. Per-role engine routing follows the team-resolve table in `references/engine-routing.md`; Dual roles run both models in parallel.
|
|
103
|
-
|
|
104
|
-
**Phase D — Synthesize and implement**: After all teammates report, compile findings into a unified plan. Implement the solution — no workarounds, no hardcoded values, no silent error swallowing. For bugs: write a failing test first, then fix. For features: implement following existing patterns, then write tests. For refactors: ensure tests pass before and after.
|
|
105
|
-
|
|
106
|
-
**Phase E — Update done criteria**: Mark each criterion in `.devlyn/done-criteria.md` as satisfied. Run the full test suite.
|
|
107
|
-
|
|
108
|
-
**Phase F — Cleanup**: Shut down all teammates and delete the team.
|
|
109
|
-
|
|
110
|
-
The task is: [paste the task description here]
|
|
111
|
-
|
|
112
|
-
**After the agent completes**:
|
|
113
|
-
1. Verify `.devlyn/done-criteria.md` exists — if missing, create a basic one from the agent's output summary
|
|
114
|
-
2. Run `git diff --stat` to confirm code was actually changed
|
|
115
|
-
3. If no changes were made, report failure and stop
|
|
116
|
-
4. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"` to create a rollback point
|
|
117
|
-
|
|
118
|
-
## PHASE 1.4: BUILD GATE
|
|
119
|
-
|
|
120
|
-
Skip if `--skip-build-gate` or `--build-gate skip` was set.
|
|
121
|
-
|
|
122
|
-
This phase runs the project's real build, typecheck, and lint commands — the same ones CI, Docker, and production environments will run. It catches the entire class of bugs that LLM-based evaluation and test suites cannot: type errors in un-tested files, cross-package type drift in monorepos, lint violations, missing production dependencies, and Dockerfile copy mismatches.
|
|
123
|
-
|
|
124
|
-
This is deterministic — if the compiler says no, the pipeline stops. No LLM judgment involved.
|
|
125
|
-
|
|
126
|
-
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
40
|
+
<post_eval_invariant>
|
|
41
|
+
Once `state.eval_passed_sha` is non-null (PHASE 2 returned PASS or PASS_WITH_ISSUES), the post-EVAL phases (CRITIC, DOCS) run **findings-only / doc-only** — they never write code. DOCS is the only phase allowed to commit after EVAL, and only for doc files.
|
|
127
42
|
|
|
128
|
-
|
|
43
|
+
**Orchestrator enforcement (per-phase, NOT cumulative)**: before each post-EVAL phase, capture `state.phases.<phase>.pre_sha = git rev-parse HEAD`. After the subagent completes, run `git diff --name-only <pre_sha> -- ':!.devlyn/**'`:
|
|
44
|
+
- CRITIC (findings-only) → any diff → `git reset --hard <pre_sha>`, emit `rule_id: "invariant.post-eval-code-mutation"` + `severity: HIGH` into `.devlyn/invariant.findings.jsonl`, route to FIX LOOP with `triggered_by: "critic"`.
|
|
45
|
+
- DOCS → check against allowlist; non-allowlisted paths trigger the revert-and-find flow.
|
|
129
46
|
|
|
130
|
-
|
|
131
|
-
|
|
132
|
-
Your job: detect every project type in this repo, run their build/typecheck/lint commands, and report results. You do NOT reason about code quality — you run commands and faithfully report what they output.
|
|
133
|
-
|
|
134
|
-
1. Read the detection matrix in `references/build-gate.md`
|
|
135
|
-
2. Scan the repo to detect all matching project types (a monorepo may match several)
|
|
136
|
-
3. Detect the package manager (npm/pnpm/yarn/bun) per the rules in the reference file
|
|
137
|
-
4. Run all gate commands. Sequential within a project type, parallel across unrelated types.
|
|
138
|
-
5. If `--build-gate strict` is set, apply strict-mode flags per the reference file
|
|
139
|
-
6. Run Dockerfile builds if Dockerfiles are detected, UNLESS `--build-gate no-docker` is set (see reference file)
|
|
140
|
-
7. Write results to `.devlyn/BUILD-GATE.md` following the output format in the reference file
|
|
141
|
-
|
|
142
|
-
For failures: include the FULL error output (not truncated) and extract root file:line references with concrete fix guidance so the fix agent knows exactly where to look.
|
|
143
|
-
|
|
144
|
-
**After the agent completes**:
|
|
145
|
-
1. Read `.devlyn/BUILD-GATE.md`
|
|
146
|
-
2. Extract verdict
|
|
147
|
-
3. Branch:
|
|
148
|
-
- `PASS` → continue to PHASE 1.5
|
|
149
|
-
- `FAIL` → go to PHASE 1.4-fix (build gate fix loop)
|
|
150
|
-
|
|
151
|
-
## PHASE 1.4-fix: BUILD GATE FIX LOOP
|
|
152
|
-
|
|
153
|
-
Triggered only when PHASE 1.4 returns FAIL.
|
|
154
|
-
|
|
155
|
-
Track a round counter. The build-gate fix loop and the main evaluate fix loop share **one global round counter** capped at `max-rounds` — increments from this loop and from PHASE 2.5 both count against the same total. If `round >= max-rounds`, stop with a clear failure report and do not continue to evaluate/browser/etc. Code that doesn't build cannot be meaningfully evaluated or tested.
|
|
156
|
-
|
|
157
|
-
**Engine**: FIX LOOP row of the routing table.
|
|
158
|
-
|
|
159
|
-
Agent prompt — pass this to the spawned executor:
|
|
160
|
-
|
|
161
|
-
Read `.devlyn/BUILD-GATE.md` — it contains deterministic build/typecheck/lint failures from real compiler output. These are not opinions; the compiler rejected this code. Fix every listed failure at the root cause level.
|
|
162
|
-
|
|
163
|
-
For each failure:
|
|
164
|
-
1. Read the referenced file:line and enough surrounding context to understand the error
|
|
165
|
-
2. For type errors: check BOTH sides of the type contract — the consumer AND the type definition. The fix may belong to either side. Do NOT suppress errors with `any`, `@ts-ignore`, `as unknown as`, `// eslint-disable`, or equivalent escape hatches.
|
|
166
|
-
3. For lint errors: fix the underlying issue, do not disable the rule.
|
|
167
|
-
4. For missing module/dependency errors: investigate the cause — it may be a missing dep in package.json, a typo in the import path, or a tsconfig paths misconfiguration.
|
|
168
|
-
5. After fixing, do NOT re-run the build yourself. The orchestrator re-runs PHASE 1.4.
|
|
169
|
-
|
|
170
|
-
**After the agent completes**:
|
|
171
|
-
1. **Checkpoint**: `git add -A && git commit -m "chore(pipeline): build gate fix round [N]"`
|
|
172
|
-
2. Increment the global round counter (shared with PHASE 2.5)
|
|
173
|
-
3. Go back to PHASE 1.4 (re-run the gate)
|
|
174
|
-
|
|
175
|
-
## PHASE 1.5: BROWSER VALIDATE (conditional)
|
|
176
|
-
|
|
177
|
-
Skip if `--skip-browser` was set.
|
|
178
|
-
|
|
179
|
-
1. **Check relevance**: Run `git diff --name-only` and check for web-relevant files (`*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.css`, `*.html`, `page.*`, `layout.*`, `route.*`). If none found, skip and note "Browser validation skipped — no web changes detected."
|
|
180
|
-
|
|
181
|
-
2. **Run validation**: Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
182
|
-
|
|
183
|
-
Agent prompt — pass this to the Agent tool:
|
|
184
|
-
|
|
185
|
-
You are a browser validation agent. Read the skill instructions at `.claude/skills/devlyn:browser-validate/SKILL.md` and follow the full workflow to validate this web application. The dev server should be started, tested, and left running (pass `--keep-server` internally) — the pipeline will clean it up later. Write your findings to `.devlyn/BROWSER-RESULTS.md`.
|
|
186
|
-
|
|
187
|
-
**After the agent completes**:
|
|
188
|
-
1. Read `.devlyn/BROWSER-RESULTS.md`
|
|
189
|
-
2. Extract the verdict
|
|
190
|
-
3. **Validate the verdict is real**: If the verdict says "code-level pass" or indicates no actual browser interaction occurred (no screenshots taken, no pages navigated, no DOM inspected), the validation did NOT happen. Treat this as if no browser validation ran — re-run PHASE 1.5 with `--tier 2` to force Playwright, or `--tier 3` for HTTP smoke. A "PARTIALLY VERIFIED" based on reading source code is not browser validation.
|
|
191
|
-
4. Branch on verdict:
|
|
192
|
-
- `PASS` → continue to PHASE 2
|
|
193
|
-
- `PASS WITH ISSUES` → continue to PHASE 2 (evaluator reads browser results as extra context)
|
|
194
|
-
- `PARTIALLY VERIFIED` → continue to PHASE 2, but flag to the evaluator that browser coverage was incomplete — unverified features should be weighted more heavily. This verdict is only valid when features were actually tested in a browser and some couldn't be verified due to environment limitations (missing API keys, external services). It is NOT valid as a substitute for "browser tools didn't work."
|
|
195
|
-
- `NEEDS WORK` → features don't work in the browser. Go to PHASE 2.5 fix loop. Fix agent reads `.devlyn/BROWSER-RESULTS.md` for which criterion failed, at what step, with what error. After fixing, re-run PHASE 1.5 to verify the fix before proceeding to Evaluate.
|
|
196
|
-
- `BLOCKED` → app doesn't render. Go to PHASE 2.5 fix loop. After fixing, re-run PHASE 1.5.
|
|
197
|
-
|
|
198
|
-
## PHASE 2: EVALUATE
|
|
47
|
+
Per-phase (not cumulative) baseline is correct because fix-loop commits between one post-EVAL phase and the next are legitimate.
|
|
199
48
|
|
|
200
|
-
|
|
49
|
+
Doc-file allowlist (DOCS): `*.md`, `.mdx`, files under `docs/`, `README*`, `CHANGELOG*`, `CLAUDE.md`, frontmatter in spec files under `docs/roadmap/phase-*/`. Any other path triggers revert-and-find.
|
|
50
|
+
</post_eval_invariant>
|
|
201
51
|
|
|
202
|
-
|
|
52
|
+
<perf_opt_in>
|
|
53
|
+
Optional: pass `--perf` to record per-phase `{wall_ms, tokens, engine, round, triggered_by}` into `state.perf.per_phase` and totals at PHASE 5. Off by default. Harness efficiency claims can be measured when needed; mandatory meta-measurement was retired in v3.4.
|
|
54
|
+
</perf_opt_in>
|
|
203
55
|
|
|
204
|
-
|
|
56
|
+
## PHASE 0: PARSE + PREFLIGHT + ROUTE
|
|
205
57
|
|
|
206
|
-
|
|
58
|
+
1. **Parse flags** from `<pipeline_config>`:
|
|
59
|
+
- `--max-rounds N` (4)
|
|
60
|
+
- `--route MODE` (auto) — per `references/pipeline-routing.md`
|
|
61
|
+
- `--engine MODE` (auto) — per `references/engine-routing.md`
|
|
62
|
+
- `--team` — force team-assembled BUILD even on non-strict routes (default: solo).
|
|
63
|
+
- `--bypass <phase>[,<phase>...]` — skip specific phases. Valid: `build-gate`, `browser`, `critic`, `docs`. Deprecated aliases (`--skip-*`, `--security-review skip`, `--bypass simplify|review|clean|security|challenge`) map to `--bypass critic` where applicable; log `deprecated flag — use --bypass <phase>` once.
|
|
64
|
+
- `--build-gate MODE` (auto) — `auto` / `strict` / `no-docker`.
|
|
65
|
+
- `--perf` — opt in to per-phase timing/token accounting.
|
|
207
66
|
|
|
208
|
-
|
|
209
|
-
Never claim a file:line or assert a behavior you have not opened and read. The done-criteria file is the rubric — read it first. Then read every changed/new file in full before marking anything VERIFIED or FAILED. Findings without a real file:line behind them are speculation; exclude them.
|
|
210
|
-
</investigate_before_answering>
|
|
67
|
+
2. **Engine pre-flight** (unless `--engine claude`): call `mcp__codex-cli__ping`. On failure, silent fallback to `--engine claude`, log `engine downgraded`. Never prompt.
|
|
211
68
|
|
|
212
|
-
|
|
213
|
-
|
|
69
|
+
3. **Initialize `pipeline.state.json`** per `references/pipeline-state.md`:
|
|
70
|
+
- `version: "1.2"`, `run_id: "ar-$(date -u +%Y%m%dT%H%M%SZ)-<12-hex>"`, `started_at`, `engine`, `base_ref.{branch, sha}`, `rounds.max_rounds`, `eval_passed_sha: null`, `route.bypasses: [...]`, empty `phases`, `criteria`, `route.selected`.
|
|
214
71
|
|
|
215
|
-
|
|
216
|
-
|
|
72
|
+
4. **Spec preflight** (if `<pipeline_config>` contains `docs/roadmap/phase-\d+/[^\s"'`)]+\.md`):
|
|
73
|
+
- Read the spec. Missing → `BLOCKED`.
|
|
74
|
+
- Verify internal deps (each entry under `## Dependencies → Internal` resolves to a `status: done` spec). Unmet → `BLOCKED`.
|
|
75
|
+
- Populate `state.source`: `type: "spec"`, `spec_path`, `spec_sha256 = sha256(spec)`, `criteria_anchors: ["spec://requirements", "spec://out-of-scope", "spec://verification", "spec://constraints", "spec://architecture-notes", "spec://dependencies"]`.
|
|
76
|
+
- Populate `state.criteria[]`: one per `- [ ]` in `## Requirements`, `status: pending`.
|
|
217
77
|
|
|
218
|
-
|
|
78
|
+
No spec path found → `source.type: "generated"`, `source.criteria_path: ".devlyn/criteria.generated.md"` (PHASE 1 creates it), `criteria_anchors: ["criteria.generated://requirements", "criteria.generated://out-of-scope", "criteria.generated://verification"]`, `criteria: []`.
|
|
219
79
|
|
|
220
|
-
**
|
|
221
|
-
|
|
222
|
-
**Step 3 — Evaluate**: For each changed file, check:
|
|
223
|
-
- Correctness: logic errors, silent failures, null access, incorrect API contracts
|
|
224
|
-
- Architecture: pattern violations, duplication, missing integration
|
|
225
|
-
- Security (if auth/secrets/user-data touched): injection, hardcoded credentials, missing validation
|
|
226
|
-
- Frontend (if UI changed): missing error/loading/empty states, React anti-patterns, server/client boundaries
|
|
227
|
-
- Test coverage: untested modules, missing edge cases
|
|
228
|
-
|
|
229
|
-
**Step 4 — Grade against done criteria**: For each criterion in done-criteria.md, mark VERIFIED (with evidence) or FAILED (with file:line and what's wrong).
|
|
230
|
-
|
|
231
|
-
**Step 5 — Write findings**: Write `.devlyn/EVAL-FINDINGS.md` with this exact structure:
|
|
80
|
+
5. **Compute Stage A route** per `references/pipeline-routing.md#stage-a`. Write to `state.route.{selected, user_override, stage_a}`.
|
|
232
81
|
|
|
82
|
+
6. **Announce** (single line):
|
|
233
83
|
```
|
|
234
|
-
|
|
235
|
-
|
|
236
|
-
## Done Criteria Results
|
|
237
|
-
- [x] criterion — VERIFIED: evidence
|
|
238
|
-
- [ ] criterion — FAILED: what's wrong, file:line
|
|
239
|
-
## Findings Requiring Action
|
|
240
|
-
### CRITICAL
|
|
241
|
-
- `file:line` — description — Confidence: high/med/low — Fix: suggested approach
|
|
242
|
-
### HIGH
|
|
243
|
-
- `file:line` — description — Confidence: high/med/low — Fix: suggested approach
|
|
244
|
-
### MEDIUM / LOW
|
|
245
|
-
- `file:line` — description — Confidence: high/med/low — Fix: suggested approach
|
|
246
|
-
## Cross-Cutting Patterns
|
|
247
|
-
- pattern description
|
|
84
|
+
Auto-resolve starting — run <run_id> — task: <desc>
|
|
85
|
+
Engine: <engine>, Route: <selected> (<stage_a_reasons>), Bypasses: <bypasses|none>, Max rounds: <N>
|
|
248
86
|
```
|
|
249
87
|
|
|
250
|
-
|
|
251
|
-
- `BLOCKED` — any CRITICAL issues
|
|
252
|
-
- `NEEDS WORK` — HIGH or MEDIUM issues
|
|
253
|
-
- `PASS WITH ISSUES` — only LOW cosmetic notes
|
|
254
|
-
- `PASS` — clean
|
|
255
|
-
|
|
256
|
-
Findings labeled "pre-existing" or "out of scope" still count if they relate to the done criteria. The goal is working software, not blame attribution.
|
|
88
|
+
## PHASE 1: BUILD
|
|
257
89
|
|
|
258
|
-
|
|
259
|
-
- A catch block that logs but doesn't surface the error to the user → HIGH (not MEDIUM). Logging is not error handling.
|
|
260
|
-
- A `let` that could be `const` → LOW. Linters catch this.
|
|
261
|
-
- "The error handling is generally quite good" is not a finding. Count the instances and name the files. "3 of 7 async ops have error states. 4 are missing: file:line, file:line…"
|
|
90
|
+
**Engine**: BUILD row. Spawn per `<engine_routing_convention>`. Prompt body: **`references/phases/phase-1-build.md`** (verbatim) + task description.
|
|
262
91
|
|
|
263
|
-
|
|
92
|
+
**Team assembly rule** (simplified from v3.2): BUILD spawns as **team** ONLY when `--team` flag passed OR `state.route.selected == "strict"`. Otherwise solo. Keyword-match auto-trigger removed — Claude/Codex base SWE capability is the default.
|
|
264
93
|
|
|
265
94
|
**After the agent completes**:
|
|
266
|
-
1.
|
|
267
|
-
2.
|
|
268
|
-
3.
|
|
269
|
-
- `PASS` → skip to PHASE 3
|
|
270
|
-
- `PASS WITH ISSUES` → go to PHASE 2.5 (fix loop) — LOW-only issues are still issues; fix them
|
|
271
|
-
- `NEEDS WORK` → go to PHASE 2.5 (fix loop)
|
|
272
|
-
- `BLOCKED` → go to PHASE 2.5 (fix loop)
|
|
273
|
-
4. If `.devlyn/EVAL-FINDINGS.md` was not created, treat as NEEDS WORK and log a warning — absence of evidence is not evidence of absence
|
|
274
|
-
|
|
275
|
-
## PHASE 2.5: FIX LOOP (conditional)
|
|
276
|
-
|
|
277
|
-
Track the current round number. If `round >= max-rounds`, stop the loop and proceed to PHASE 3 with a warning that unresolved findings remain.
|
|
95
|
+
1. Verify `criteria[]` has ≥1 entry with `status != "pending"`. If not, re-spawn with reminder.
|
|
96
|
+
2. `git diff --stat` — if no changes, halt with failure.
|
|
97
|
+
3. Checkpoint: `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"`.
|
|
278
98
|
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
Agent prompt — pass this to the spawned executor:
|
|
282
|
-
|
|
283
|
-
Read every findings file present in `.devlyn/`:
|
|
284
|
-
- `.devlyn/EVAL-FINDINGS.md` — issues from the independent evaluator (PHASE 2)
|
|
285
|
-
- `.devlyn/BROWSER-RESULTS.md` — issues from browser validation (PHASE 1.5), if present and the verdict is `NEEDS WORK` or `BLOCKED`
|
|
286
|
-
|
|
287
|
-
Fix every finding regardless of severity (CRITICAL, HIGH, MEDIUM, and LOW). The pipeline loops until the relevant verdict returns PASS — there is no "shippable with issues" shortcut.
|
|
99
|
+
## PHASE 1.4: BUILD GATE
|
|
288
100
|
|
|
289
|
-
|
|
101
|
+
Skip if `build-gate` in `state.route.bypasses`. Deterministic — same commands CI/Docker/production run.
|
|
290
102
|
|
|
291
|
-
|
|
103
|
+
Spawn Claude `Agent` (`mode: "bypassPermissions"`): "Read `references/build-gate.md` (detection matrix, commands, package manager, monorepo, strict, Docker) and `references/findings-schema.md`. Run all matched gates. Apply strict flags if `--build-gate strict` OR `state.route.selected == "strict"`. Run Docker unless `--build-gate no-docker`. Emit `.devlyn/build_gate.findings.jsonl` + `.devlyn/build_gate.log.md`; update `state.phases.build_gate`."
|
|
292
104
|
|
|
293
105
|
**After the agent completes**:
|
|
294
|
-
1.
|
|
295
|
-
2.
|
|
296
|
-
3.
|
|
297
|
-
- If invoked from PHASE 2 (eval failure) → go back to PHASE 2 to re-evaluate
|
|
298
|
-
- If invoked from PHASE 1.5 (browser failure) → go back to PHASE 1.5 to re-validate the browser, then proceed to PHASE 2 only if browser passes
|
|
299
|
-
|
|
300
|
-
## PHASE 3: SIMPLIFY
|
|
106
|
+
1. Read `state.phases.build_gate.verdict`.
|
|
107
|
+
2. **Stage B LITE** (only if `verdict == "PASS"` AND `state.route.user_override == false`): apply the single escalation rule from `references/pipeline-routing.md#stage-b-lite`. If it fires, write `state.route.stage_b.{at, escalated_from, reasons}`.
|
|
108
|
+
3. Branch: `PASS` → PHASE 1.5; `FAIL` → PHASE 2.5 with `triggered_by: "build_gate"`.
|
|
301
109
|
|
|
302
|
-
|
|
110
|
+
## PHASE 1.5: BROWSER VALIDATE (conditional)
|
|
303
111
|
|
|
304
|
-
|
|
112
|
+
Skip if `browser` in `state.route.bypasses`. Skip if `git diff --name-only <state.base_ref.sha>` has no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.css`, `*.html`, `page.*`, `layout.*`, `route.*` matches.
|
|
305
113
|
|
|
306
|
-
|
|
114
|
+
Spawn Claude `Agent` (`mode: "bypassPermissions"`): "Read `.claude/skills/devlyn:browser-validate/SKILL.md` (tiered Chrome MCP → Playwright → curl) and `references/findings-schema.md`. Start dev server, test the implemented feature end-to-end against `pipeline.state.json:criteria[]`, leave server running (`--keep-server`). Emit `.devlyn/browser_validate.findings.jsonl` + `.devlyn/browser_validate.log.md`; update `state.phases.browser_validate`."
|
|
307
115
|
|
|
308
116
|
**After the agent completes**:
|
|
309
|
-
1. **
|
|
310
|
-
|
|
311
|
-
## PHASE 4: REVIEW (skippable)
|
|
312
|
-
|
|
313
|
-
Skip if `--skip-review` was set.
|
|
314
|
-
|
|
315
|
-
**Engine**: REVIEW (team) — per-role routing per the team-review table in `references/engine-routing.md`. Dual roles run both models in parallel and merge findings.
|
|
316
|
-
|
|
317
|
-
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
318
|
-
|
|
319
|
-
Agent prompt — pass this to the spawned executor:
|
|
320
|
-
|
|
321
|
-
Review all recent changes in this codebase (use `git diff main` and `git status` to determine scope). Assemble a review team using TeamCreate with specialized reviewers: security reviewer, quality reviewer, test analyst. Add UX reviewer, performance reviewer, or API reviewer based on the changes. Per-role engine routing follows the team-review table in `references/engine-routing.md`; Dual roles run both models in parallel and merge findings.
|
|
322
|
-
|
|
323
|
-
Each reviewer reports findings with file:line evidence grouped by severity (CRITICAL, HIGH, MEDIUM, LOW) and a confidence level. After all reviewers report, synthesize findings, deduplicate, and fix any CRITICAL issues directly. For HIGH issues, fix if straightforward.
|
|
324
|
-
|
|
325
|
-
Clean up the team after completion.
|
|
326
|
-
|
|
327
|
-
**After the review phase completes**:
|
|
328
|
-
1. If CRITICAL issues remain unfixed, log a warning in the final report
|
|
329
|
-
2. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): review fixes complete"` if there are changes
|
|
330
|
-
|
|
331
|
-
## PHASE 4.5: CHALLENGE
|
|
332
|
-
|
|
333
|
-
Every prior phase used checklists, done-criteria, or structured categories. This phase is deliberately different — it's a fresh pair of eyes with no checklist, no prior context, and a skeptical mandate. The subagent hasn't seen the done-criteria, the eval findings, or the review results. It reads the raw diff cold and asks: "would I mass-ship this?"
|
|
334
|
-
|
|
335
|
-
This is what catches the things structured reviews miss — subtle logic that technically works but isn't the right approach, assumptions nobody questioned, patterns that are fine but not best-practice, and integration seams that look correct in isolation but feel wrong when you read the whole changeset.
|
|
336
|
-
|
|
337
|
-
**Engine**: CHALLENGE row — Claude on every engine. The diff was likely produced by Codex on `--engine auto`; Claude reading it cold preserves the cross-model dynamic.
|
|
117
|
+
1. **Sanity check**: if verdict is `PASS`/`PASS_WITH_ISSUES` but log shows zero screenshots AND zero navigations, treat as unverified — re-run at `--tier 2`/`3`. Code-level verdict is not browser validation.
|
|
118
|
+
2. Branch: `PASS`/`PASS_WITH_ISSUES`/`PARTIALLY_VERIFIED` → PHASE 2; `NEEDS_WORK`/`BLOCKED` → PHASE 2.5 with `triggered_by: "browser_validate"`.
|
|
338
119
|
|
|
339
|
-
|
|
340
|
-
|
|
341
|
-
Agent prompt — pass this to the spawned executor:
|
|
342
|
-
|
|
343
|
-
You are a senior engineer doing a final skeptical review before this code ships to production. You have not seen any prior reviews, test results, or design docs — read the code cold.
|
|
344
|
-
|
|
345
|
-
<investigate_before_answering>
|
|
346
|
-
Anchor every finding in code you have actually opened. Run `git diff main` for the change surface, then read each changed file in full (not just the hunks — surrounding context matters). Findings without a real file:line and a quote from the code are speculation; exclude them.
|
|
347
|
-
</investigate_before_answering>
|
|
348
|
-
|
|
349
|
-
Your job is not to check boxes. Your job is to find the things that would make a staff engineer say "hold on, let's talk about this before we ship." Think about:
|
|
350
|
-
|
|
351
|
-
- Would this approach survive a 10x traffic spike? A midnight oncall page? A junior dev maintaining it 6 months from now?
|
|
352
|
-
- Are there assumptions baked in that nobody stated out loud? Hardcoded limits, implicit ordering, missing edge cases in business logic?
|
|
353
|
-
- Is the error handling actually helpful, or does it just prevent crashes while leaving the user confused?
|
|
354
|
-
- Are there simpler, more idiomatic ways to do what this code does? Not "clever" alternatives — genuinely better approaches?
|
|
355
|
-
- Would you confidently approve this PR, or would you leave comments?
|
|
356
|
-
|
|
357
|
-
Be direct and concrete. Do not open with praise. Every finding must include `file:line` and a concrete fix — not "consider improving" but "change X to Y because Z."
|
|
358
|
-
|
|
359
|
-
Write `.devlyn/CHALLENGE-FINDINGS.md`:
|
|
360
|
-
|
|
361
|
-
```
|
|
362
|
-
# Challenge Findings
|
|
363
|
-
## Verdict: [PASS / NEEDS WORK]
|
|
364
|
-
## Findings
|
|
365
|
-
### [severity: CRITICAL / HIGH / MEDIUM]
|
|
366
|
-
- `file:line` — what's wrong — Fix: concrete change
|
|
367
|
-
```
|
|
120
|
+
## PHASE 2: EVALUATE
|
|
368
121
|
|
|
369
|
-
|
|
370
|
-
<example index="1">
|
|
371
|
-
GOOD finding (anchored, specific, fixable):
|
|
372
|
-
### CRITICAL
|
|
373
|
-
- `src/api/orders/cancel.ts:42` — `await db.transaction(...)` is missing — the read of `order.status` and the write of `order.status = "cancelled"` are not atomic, so two concurrent cancellations both succeed and the inventory hook fires twice. Fix: wrap the read+write in `db.transaction()` and re-check `order.status === "pending"` inside the transaction before the update.
|
|
374
|
-
</example>
|
|
375
|
-
<example index="2">
|
|
376
|
-
BAD finding (vague, unanchored, not actionable):
|
|
377
|
-
### HIGH
|
|
378
|
-
- The error handling could be improved. Consider being more defensive throughout.
|
|
379
|
-
|
|
380
|
-
Why this is bad: no file:line, no specific failure, no concrete fix. Either delete the finding or replace it with a real one anchored to a specific call site.
|
|
381
|
-
</example>
|
|
382
|
-
<example index="3">
|
|
383
|
-
GOOD finding (idiom / approach issue):
|
|
384
|
-
### MEDIUM
|
|
385
|
-
- `src/components/UserList.tsx:18-34` — fetching `/api/users` inside `useEffect` and managing loading/error state by hand re-implements what the project already does with the `useFetch` hook in `src/hooks/useFetch.ts`. Fix: replace the manual `useState`+`useEffect` with `useFetch('/api/users')` so this list inherits retry, cache, and abort handling.
|
|
386
|
-
</example>
|
|
387
|
-
</examples>
|
|
388
|
-
|
|
389
|
-
Verdict: `PASS` only if you would confidently ship this code with your name on it. If you found anything CRITICAL or HIGH, verdict is `NEEDS WORK`.
|
|
122
|
+
**Engine**: EVAL row — always Claude. Prompt body: **`references/phases/phase-2-evaluate.md`**.
|
|
390
123
|
|
|
391
124
|
**After the agent completes**:
|
|
392
|
-
1. Read `.
|
|
393
|
-
2.
|
|
125
|
+
1. Read `state.phases.evaluate.verdict`.
|
|
126
|
+
2. **First-time PASS or PASS_WITH_ISSUES** with `state.eval_passed_sha == null` → set `state.eval_passed_sha = git rev-parse HEAD` (activates `<post_eval_invariant>`).
|
|
394
127
|
3. Branch:
|
|
395
|
-
- `PASS` →
|
|
396
|
-
- `
|
|
397
|
-
|
|
398
|
-
Read `.devlyn/CHALLENGE-FINDINGS.md` — it contains findings from a fresh skeptical review. Fix every CRITICAL and HIGH finding at the root cause. For MEDIUM findings, fix if straightforward. After fixing, run the test suite to verify nothing broke.
|
|
399
|
-
|
|
400
|
-
After the fix agent completes:
|
|
401
|
-
1. **Checkpoint**: Run `git add -A && git commit -m "chore(pipeline): challenge fixes complete"`
|
|
402
|
-
2. Continue to PHASE 5 (do NOT re-run the challenge — one pass is sufficient to avoid infinite loops)
|
|
128
|
+
- `PASS` → PHASE 3 (CRITIC) per route; `fast` → PHASE 5 (FINAL REPORT).
|
|
129
|
+
- `PASS_WITH_ISSUES` → **terminal for this phase** (LOW-only findings do not re-trigger fix loop). Proceed to next phase.
|
|
130
|
+
- `NEEDS_WORK` / `BLOCKED` → PHASE 2.5 with `triggered_by: "evaluate"`.
|
|
403
131
|
|
|
404
|
-
## PHASE 5:
|
|
132
|
+
## PHASE 2.5: UNIFIED FIX LOOP
|
|
405
133
|
|
|
406
|
-
|
|
407
|
-
- If `--security-review always` → run
|
|
408
|
-
- If `--security-review skip` → skip
|
|
409
|
-
- If `--security-review auto` (default) → auto-detect by scanning changed files for security-sensitive patterns:
|
|
410
|
-
- Run `git diff main --name-only` and check for files matching: `*auth*`, `*login*`, `*session*`, `*token*`, `*secret*`, `*crypt*`, `*password*`, `*api*`, `*middleware*`, `*env*`, `*config*`, `*permission*`, `*role*`, `*access*`
|
|
411
|
-
- Also run `git diff main` and scan for patterns: `API_KEY`, `SECRET`, `TOKEN`, `PASSWORD`, `PRIVATE_KEY`, `Bearer`, `jwt`, `bcrypt`, `crypto`, `env.`, `process.env`
|
|
412
|
-
- If any match → run. If no matches → skip and note "Security review skipped — no security-sensitive changes detected."
|
|
134
|
+
Single fix loop for every trigger (`build_gate` / `browser_validate` / `evaluate` / `critic`). `state.rounds.global` shared counter.
|
|
413
135
|
|
|
414
|
-
|
|
136
|
+
**Exhaustion check first**: if `state.rounds.global >= state.rounds.max_rounds`:
|
|
137
|
+
- `build_gate` / `browser_validate` → **halt** → PHASE 5 with exhaustion banner.
|
|
138
|
+
- `evaluate` / `critic` → **proceed_with_warning** → skip to next phase; final report shows banner.
|
|
415
139
|
|
|
416
|
-
|
|
417
|
-
|
|
418
|
-
|
|
419
|
-
|
|
420
|
-
|
|
140
|
+
**Fix-batch packet assembly**: read the trigger's `.findings.jsonl` (plus browser_validate if `triggered_by == "evaluate"` or `"browser_validate"` and browser has open findings — see pipeline-routing.md), filter `status == "open"`, write `.devlyn/fix-batch.round-<N>.json`:
|
|
141
|
+
```json
|
|
142
|
+
{
|
|
143
|
+
"round": <N>, "max_rounds": <N>, "base_ref_sha": "...", "criteria_source": "...",
|
|
144
|
+
"triggered_by": "<trigger>", "findings": [ /* id, rule_id, severity, file, line, message, fix_hint, criterion_ref */ ],
|
|
145
|
+
"failed_criteria": ["<C ids>"], "acceptance": {"build_gate_cmd": "...", "test_cmd": "..."}
|
|
146
|
+
}
|
|
147
|
+
```
|
|
421
148
|
|
|
422
|
-
|
|
423
|
-
2. **Authentication & authorization**: Are new endpoints properly protected? Are auth checks consistent with existing patterns? Any privilege escalation paths?
|
|
424
|
-
3. **Secrets & credentials**: Grep for hardcoded API keys, tokens, passwords, private keys. Check that secrets come from env vars, not source code. Verify .gitignore covers sensitive files.
|
|
425
|
-
4. **Data exposure**: Are error messages leaking internal details? Are logs capturing sensitive data? Are API responses returning more data than needed?
|
|
426
|
-
5. **Dependencies**: If package.json/requirements.txt changed, run the package manager's audit command (npm audit, pip-audit, etc.).
|
|
427
|
-
6. **CSRF/CORS**: For new endpoints with side effects, verify CSRF protection. Check CORS configuration for overly permissive origins.
|
|
149
|
+
**Engine**: FIX LOOP row (Codex on `auto`/`codex`, Claude on `claude`). Fresh Codex call each round (no `sessionId` reuse).
|
|
428
150
|
|
|
429
|
-
|
|
151
|
+
Spawn per `<engine_routing_convention>`. Prompt:
|
|
430
152
|
|
|
431
|
-
|
|
153
|
+
> Read `.devlyn/fix-batch.round-<N>.json` and `pipeline.state.json`.
|
|
154
|
+
>
|
|
155
|
+
> **First, re-ground on the contract.** Open `source.spec_path` (or `source.criteria_path`) and read the sections/anchors referenced by each finding's `criterion_ref`. **Spec/criteria are higher authority than findings** — do not narrow or reinterpret required behavior to satisfy a finding. If a finding hint conflicts with explicit spec text (e.g., a glob/pattern like `**/SKILL.md`, a cardinality, a flag's documented behavior), preserve the spec semantics and fix only the implementation defect. Non-contradictory, backward-compatible enhancements that preserve required default behavior are allowed (e.g., respecting `NO_COLOR` while still defaulting to colored when unset). If a finding **truly contradicts** the spec, halt that finding's fix, log the conflict in `.devlyn/fix-batch.round-<N>.log.md`, and leave the finding `open` — the conflict surfaces in the final report rather than silently narrowing the contract.
|
|
156
|
+
>
|
|
157
|
+
> **Then fix every listed finding at the root cause.** If multiple findings touch the same symbol, produce **one consolidated change**. Prefer editing/replacing existing code over adding new machinery; **do not leave parallel near-duplicate helpers/functions**. When return-shape pressure appears (one finding needs a richer return value than another), broaden the existing helper's return object — don't create a second variant.
|
|
158
|
+
>
|
|
159
|
+
> Read each referenced `file:line`, implement the fix, run tests. No workarounds (`any`, `@ts-ignore`, silent catches, hardcoded values). Raw failure detail: `.devlyn/build_gate.log.md` / `.devlyn/browser_validate.log.md`. When a previously-failed criterion is now satisfied, clear `failed_by_finding_ids`, set `status: "implemented"`, append an `evidence` record.
|
|
432
160
|
|
|
433
161
|
**After the agent completes**:
|
|
434
|
-
1.
|
|
435
|
-
2.
|
|
436
|
-
3.
|
|
437
|
-
|
|
438
|
-
## PHASE 6: CLEAN (skippable)
|
|
439
|
-
|
|
440
|
-
Skip if `--skip-clean` was set.
|
|
441
|
-
|
|
442
|
-
Spawn a subagent using the Agent tool with `mode: "bypassPermissions"`.
|
|
162
|
+
1. Checkpoint: `git add -A && git commit -m "chore(pipeline): fix round <N> (<triggered_by>)"`.
|
|
163
|
+
2. Increment `state.rounds.global`.
|
|
164
|
+
3. Route back: `build_gate` → PHASE 1.4; `browser_validate` → PHASE 1.5; **`evaluate` / `critic` → PHASE 2 (re-EVAL)**. All post-EVAL findings flow back through EVAL.
|
|
165
|
+
4. **After re-EVAL returns PASS/PASS_WITH_ISSUES with `triggered_by == "critic"`**: re-run PHASE 3 CRITIC once before proceeding to DOCS. This verifies the fix didn't introduce new design/security issues the first CRITIC would have caught. Subsequent fix-loop rounds triggered from this re-CRITIC follow the same rule (bounded by `state.rounds.max_rounds`).
|
|
443
166
|
|
|
444
|
-
|
|
167
|
+
## PHASE 3: CRITIC (findings-only, route-gated)
|
|
445
168
|
|
|
446
|
-
|
|
169
|
+
Skip if `state.route.selected == "fast"` OR `critic` in `state.route.bypasses`.
|
|
447
170
|
|
|
448
|
-
|
|
449
|
-
|
|
171
|
+
One post-EVAL critic pass with two sub-concerns:
|
|
172
|
+
- **Design sub-pass** — "would a staff engineer block this PR?" (cold read, any finding → `NEEDS_WORK`). Always Claude.
|
|
173
|
+
- **Security sub-pass** — OWASP-style audit with mandatory dependency audit when any dep manifest OR lockfile changed (`package.json`, `requirements.txt`, `package-lock.json`, `pnpm-lock.yaml`, `yarn.lock`, `Pipfile.lock`, `poetry.lock`, `Cargo.toml`, `Cargo.lock`, `go.mod`, `go.sum`). On `--engine auto`: **Dual** (Claude + Codex parallel, merged). On others: single model per route.
|
|
450
174
|
|
|
451
|
-
|
|
175
|
+
Hygiene concerns (unused imports, dead code) live in EVAL's `hygiene.*` findings at LOW severity, not a separate sub-pass here.
|
|
452
176
|
|
|
453
|
-
|
|
177
|
+
**Before spawn**: capture `phase_pre_sha = git rev-parse HEAD` → `state.phases.critic.pre_sha`.
|
|
454
178
|
|
|
455
|
-
Spawn
|
|
179
|
+
**Spawn**: per `<engine_routing_convention>`. Prompt body: **`references/phases/phase-3-critic.md`**.
|
|
456
180
|
|
|
457
|
-
|
|
181
|
+
**After the agent completes**:
|
|
182
|
+
1. Enforce `<post_eval_invariant>`: `git diff --name-only <phase_pre_sha> -- ':!.devlyn/**'` — non-empty → revert + emit invariant finding + route to fix loop.
|
|
183
|
+
2. Read `state.phases.critic.verdict` (WORSE of design/security sub-verdicts):
|
|
184
|
+
- `PASS` → PHASE 4.
|
|
185
|
+
- `PASS_WITH_ISSUES` (security LOW only; design must be zero) → terminal; PHASE 4.
|
|
186
|
+
- `NEEDS_WORK` / `BLOCKED` → PHASE 2.5 with `triggered_by: "critic"`.
|
|
458
187
|
|
|
459
|
-
|
|
188
|
+
## PHASE 4: DOCS (doc-file mutations only)
|
|
460
189
|
|
|
461
|
-
|
|
190
|
+
Skip if `docs` in `state.route.bypasses` OR `state.route.selected == "fast"`.
|
|
462
191
|
|
|
463
|
-
|
|
192
|
+
Spawn Claude `Agent` (`mode: "bypassPermissions"`). Include original task description. Prompt: "Two jobs:
|
|
464
193
|
|
|
465
|
-
|
|
466
|
-
|
|
467
|
-
|
|
468
|
-
|
|
469
|
-
|
|
470
|
-
- Do not change any other fields, and do not touch the body of the spec.
|
|
471
|
-
4. **Update `docs/ROADMAP.md`.** Find the row whose `#` column matches the spec's `id` (e.g., row starting `| 2.3 |`). Change its Status column to `Done`. Do not touch any other row, and do not reformat the table.
|
|
472
|
-
5. **Check whether the phase is now fully Done.** Read every row of the phase's table (the one containing the just-flipped row). If every row's Status is `Done`, archive the phase:
|
|
473
|
-
- Cut the phase's `## Phase N: …` heading and table out of the active section of ROADMAP.md.
|
|
474
|
-
- If no `## Completed` section exists at the bottom of the file, create one just above end-of-file (below Decisions if Decisions exists).
|
|
475
|
-
- Add a `<details>` block for the phase inside Completed, using the format defined in the devlyn:ideate skill's Context Archiving section. Pull each item's completion date from its spec file's `completed:` frontmatter; if a spec has none, use today's date.
|
|
476
|
-
- Item spec files stay on disk — do not delete them. Only the index row moves.
|
|
477
|
-
6. **Report.** In your summary, say explicitly what you did: "Flipped spec 2.3 to done, updated ROADMAP.md row." And if applicable: "Phase 2 was fully Done — archived to Completed block."
|
|
194
|
+
**Job 1 — Roadmap sync**: if task matched `docs/roadmap/phase-\d+/[^\s\"']+\.md` and `git diff <state.base_ref.sha> --stat` touches non-doc files:
|
|
195
|
+
1. Read the spec. If `status: done` already, skip to Job 2.
|
|
196
|
+
2. Set `status: done` + `completed: <today>` in frontmatter. Do not touch body.
|
|
197
|
+
3. Update `docs/ROADMAP.md`: find row matching spec id; change Status to `Done`.
|
|
198
|
+
4. If phase now fully Done: archive to `## Completed <details>` block at bottom (format per `devlyn:ideate#context-archiving`). Item spec files stay on disk.
|
|
478
199
|
|
|
479
|
-
**
|
|
480
|
-
- Never flip a spec to `done` without a non-empty `git diff` touching non-doc files.
|
|
481
|
-
- Never flip multiple specs in one run — one task, one spec.
|
|
482
|
-
- Never edit a row whose `#` doesn't exactly match the spec's `id`.
|
|
483
|
-
- Never delete spec files.
|
|
200
|
+
**Job 2 — General doc sync**: update docs referencing changed APIs/features/behaviors. Use `git log --oneline -20` + `git diff <state.base_ref.sha>`. Preserve forward-looking content.
|
|
484
201
|
|
|
485
|
-
**
|
|
202
|
+
**Safety**: never flip a spec `done` without a non-empty non-doc diff; never flip multiple specs in one run; never touch files outside the doc-file allowlist."
|
|
486
203
|
|
|
487
|
-
|
|
204
|
+
**Before spawn**: capture `phase_pre_sha = git rev-parse HEAD` → `state.phases.docs.pre_sha`.
|
|
488
205
|
|
|
489
206
|
**After the agent completes**:
|
|
490
|
-
1.
|
|
491
|
-
|
|
492
|
-
## PHASE 8: FINAL REPORT
|
|
207
|
+
1. Enforce allowlist: `git diff --name-only <phase_pre_sha> -- ':!.devlyn/**'` — any non-allowlisted path → revert + emit `invariant.post-eval-code-mutation` + route to PHASE 2.5 with `triggered_by: "docs"`.
|
|
208
|
+
2. If allowlist honored and diff non-empty: `git add -A && git commit -m "chore(pipeline): docs updated"`.
|
|
493
209
|
|
|
494
|
-
|
|
210
|
+
## PHASE 5: FINAL REPORT + ARCHIVE
|
|
495
211
|
|
|
496
|
-
1.
|
|
497
|
-
- Delete the `.devlyn/` directory entirely (contains done-criteria.md, BUILD-GATE.md, EVAL-FINDINGS.md, BROWSER-RESULTS.md, CHALLENGE-FINDINGS.md, screenshots/, playwright temp files)
|
|
498
|
-
- Kill any dev server process still running from browser validation
|
|
499
|
-
|
|
500
|
-
2. Run `git log --oneline -10` to show commits made during the pipeline
|
|
501
|
-
|
|
502
|
-
3. Present the report:
|
|
212
|
+
1. **Terminal verdict**: run `python3 scripts/terminal_verdict.py` (implements the precedence in `references/pipeline-routing.md#terminal-state-algorithm`; prints verdict, exits 0/1/2/3 for PASS/PASS_WITH_ISSUES/NEEDS_WORK/BLOCKED).
|
|
503
213
|
|
|
214
|
+
2. **Render report**:
|
|
504
215
|
```
|
|
505
|
-
### Auto-Resolve
|
|
506
|
-
|
|
507
|
-
|
|
508
|
-
|
|
509
|
-
|
|
510
|
-
|
|
511
|
-
|
|
512
|
-
|
|
513
|
-
|
|
514
|
-
|
|
515
|
-
|
|
516
|
-
|
|
517
|
-
|
|
|
518
|
-
|
|
519
|
-
|
|
|
520
|
-
|
|
|
521
|
-
|
|
|
522
|
-
|
|
|
523
|
-
|
|
|
524
|
-
|
|
525
|
-
|
|
526
|
-
|
|
527
|
-
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
|
|
531
|
-
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
-
|
|
535
|
-
-
|
|
216
|
+
### Auto-Resolve Complete — run <run_id>
|
|
217
|
+
|
|
218
|
+
Task: <original task>
|
|
219
|
+
Engine: <engine> (downgraded: <reason or no>)
|
|
220
|
+
Route: <selected> (user_override: <t/f>)
|
|
221
|
+
Stage A: <reasons>
|
|
222
|
+
Stage B LITE: <no escalation | escalated from X — reason>
|
|
223
|
+
|
|
224
|
+
Terminal verdict: <PASS / PASS_WITH_ISSUES / NEEDS_WORK / BLOCKED>
|
|
225
|
+
<banner if applicable: "⚠ BUILD GATE EXHAUSTED" / "⚠ EVAL EXHAUSTED — open findings: <list file:line>" />
|
|
226
|
+
|
|
227
|
+
Pipeline summary:
|
|
228
|
+
| Phase | Verdict | Notes |
|
|
229
|
+
|-------|---------|-------|
|
|
230
|
+
| BUILD | <v> | <engine, solo/team> |
|
|
231
|
+
| BUILD GATE | <v> | <project types, commands> |
|
|
232
|
+
| BROWSER | <v / skipped — no web> | <tier, flow> |
|
|
233
|
+
| EVAL (round <N>) | <v> | <finding count by severity> |
|
|
234
|
+
| FIX ROUNDS | <N of max> | <triggered_by history> |
|
|
235
|
+
| CRITIC | <v / skipped-route / skipped-bypass> | <design: N, security: N, dep-audit: ran/skipped> |
|
|
236
|
+
| DOCS | <completed / skipped> | <specs flipped, roadmap archived> |
|
|
237
|
+
|
|
238
|
+
Guardrails bypassed: <state.route.bypasses or "none">
|
|
239
|
+
|
|
240
|
+
Commits: <git log --oneline from state.base_ref.sha>
|
|
241
|
+
|
|
242
|
+
Audit trail: .devlyn/runs/<run_id>/
|
|
243
|
+
|
|
244
|
+
Next steps:
|
|
245
|
+
- Review: git diff <base_ref.sha>
|
|
246
|
+
- Squash: git rebase -i <base_ref.sha>
|
|
247
|
+
- Re-run fixes: /devlyn:auto-resolve "<narrower task>"
|
|
536
248
|
```
|
|
537
249
|
|
|
538
|
-
|
|
250
|
+
3. **Archive**: run `python3 scripts/archive_run.py` (implements `references/pipeline-state.md#archive-contract`; moves per-run artifacts into `.devlyn/runs/<run_id>/`, best-effort prunes to last 10 completed runs).
|
|
251
|
+
|
|
252
|
+
4. Kill dev server from PHASE 1.5 if still running.
|