devlyn-cli 1.15.0 → 2.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/AGENTS.md +104 -0
- package/CLAUDE.md +135 -21
- package/README.md +43 -125
- package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
- package/benchmark/auto-resolve/README.md +114 -0
- package/benchmark/auto-resolve/RUBRIC.md +162 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
- package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
- package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
- package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
- package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
- package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
- package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
- package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
- package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
- package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
- package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
- package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
- package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
- package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
- package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
- package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
- package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
- package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
- package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
- package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
- package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
- package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
- package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
- package/benchmark/auto-resolve/scripts/judge.sh +359 -0
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
- package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
- package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
- package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
- package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
- package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
- package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
- package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
- package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
- package/bin/devlyn.js +175 -17
- package/config/skills/_shared/adapters/README.md +64 -0
- package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
- package/config/skills/_shared/adapters/opus-4-7.md +29 -0
- package/config/skills/{devlyn:auto-resolve/scripts → _shared}/archive_run.py +26 -0
- package/config/skills/_shared/codex-config.md +54 -0
- package/config/skills/_shared/codex-monitored.sh +141 -0
- package/config/skills/_shared/engine-preflight.md +35 -0
- package/config/skills/_shared/expected.schema.json +93 -0
- package/config/skills/_shared/pair-plan-schema.md +298 -0
- package/config/skills/_shared/runtime-principles.md +110 -0
- package/config/skills/_shared/spec-verify-check.py +519 -0
- package/config/skills/devlyn:ideate/SKILL.md +99 -429
- package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
- package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
- package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
- package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
- package/config/skills/devlyn:resolve/SKILL.md +172 -184
- package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
- package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
- package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
- package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
- package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
- package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
- package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
- package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
- package/{config/skills → optional-skills}/devlyn:reap/SKILL.md +1 -0
- package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
- package/package.json +12 -2
- package/scripts/lint-skills.sh +431 -0
- package/config/skills/devlyn:auto-resolve/SKILL.md +0 -252
- package/config/skills/devlyn:auto-resolve/evals/evals.json +0 -21
- package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +0 -42
- package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -130
- package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -82
- package/config/skills/devlyn:auto-resolve/references/findings-schema.md +0 -103
- package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +0 -54
- package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +0 -45
- package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +0 -84
- package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +0 -114
- package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +0 -201
- package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +0 -96
- package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
- package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
- package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
- package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
- package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
- package/config/skills/devlyn:clean/SKILL.md +0 -285
- package/config/skills/devlyn:design-ui/SKILL.md +0 -351
- package/config/skills/devlyn:discover-product/SKILL.md +0 -124
- package/config/skills/devlyn:evaluate/SKILL.md +0 -564
- package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
- package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
- package/config/skills/devlyn:ideate/references/codex-critic-template.md +0 -42
- package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
- package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
- package/config/skills/devlyn:preflight/SKILL.md +0 -355
- package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
- package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -86
- package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
- package/config/skills/devlyn:product-spec/SKILL.md +0 -603
- package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
- package/config/skills/devlyn:review/SKILL.md +0 -161
- package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
- package/config/skills/devlyn:team-review/SKILL.md +0 -493
- package/config/skills/devlyn:update-docs/SKILL.md +0 -463
- package/config/skills/workflow-routing/SKILL.md +0 -73
- /package/{config/skills → optional-skills}/devlyn:reap/scripts/reap.sh +0 -0
- /package/{config/skills → optional-skills}/devlyn:reap/scripts/scan.sh +0 -0
|
@@ -1,252 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: devlyn:auto-resolve
|
|
3
|
-
description: Fully automated build-evaluate-ship pipeline for any task type — bug fixes, new features, refactors, chores. Use this as the default starting point when the user wants hands-free implementation with zero human intervention. Runs a minimal goal-driven loop — build, evaluate, fix, critic, docs — as a single command. Use when the user says "auto resolve", "build this", "implement this feature", "fix this", "run the full pipeline", "refactor this", or wants to walk away and come back to finished work.
|
|
4
|
-
---
|
|
5
|
-
|
|
6
|
-
Orchestrator for the hands-free implementation pipeline. One subagent per phase, file-based handoff, unified fix loop on evaluation feedback until the work passes or `max_rounds` is reached. The orchestrator itself does not write code — it parses input, spawns phases, reads handoff artifacts, runs git commands, branches on verdicts, and emits the final report.
|
|
7
|
-
|
|
8
|
-
<pipeline_config>
|
|
9
|
-
$ARGUMENTS
|
|
10
|
-
</pipeline_config>
|
|
11
|
-
|
|
12
|
-
<orchestrator_context>
|
|
13
|
-
Long-horizon agentic work. Context auto-compacts — do not stop early on token-budget concerns. All durable state lives in `.devlyn/pipeline.state.json` (control plane: pointers, criteria, verdicts) plus `<phase>.findings.jsonl` + `<phase>.log.md` for phases that emit findings. `state.json` is the **single authoritative verdict source** — branch on `phases.<name>.verdict` directly, never parse artifact files. At PHASE 5, the run's `.devlyn/*` artifacts are **archived** to `.devlyn/runs/<run_id>/` (last 10 kept, best-effort). Schemas: `references/pipeline-state.md`, `references/findings-schema.md`. Best results with `xhigh` reasoning.
|
|
14
|
-
</orchestrator_context>
|
|
15
|
-
|
|
16
|
-
<autonomy_contract>
|
|
17
|
-
This pipeline runs hands-free. Measured by how far it gets without human intervention.
|
|
18
|
-
|
|
19
|
-
1. **Never prompt the user mid-pipeline.** When you'd otherwise ask, pick the safe default, proceed, and log it in the final report.
|
|
20
|
-
2. **Codex availability**: on `--engine auto`/`codex`, call `mcp__codex-cli__ping` at the start. On failure, silently fall back to `--engine claude` and log `engine downgraded: codex-ping failed` in the final report. Do NOT present a menu. Do NOT abort.
|
|
21
|
-
3. **Run only the phases defined below, in order.** Doc updates belong in PHASE 4 (DOCS). Don't insert them earlier.
|
|
22
|
-
4. **Delegate all file changes to spawned subagents.** Orchestrator actions: parse input, spawn phase agents, read handoff files, run `git`, branch on verdicts, emit report, archive.
|
|
23
|
-
5. **Continue by default.** Stop only for: (a) unrecoverable subagent failure, (b) PHASE 1 producing zero code changes, (c) build-gate / browser fix-loop exhausting `max_rounds` (halt → FINAL REPORT). EVAL/CRITIC exhaustion proceeds with warning — never halts.
|
|
24
|
-
</autonomy_contract>
|
|
25
|
-
|
|
26
|
-
<harness_principles>
|
|
27
|
-
Goal-first. Verify state, source integrity, diff base, artifact contracts. Prefer deletion or reuse over new machinery. Change only files the task requires. Each phase optimizes for its declared success criteria, not a checklist. Fix root causes only — no `any`, `@ts-ignore`, silent catches, hardcoded values. Label hypotheses explicitly; back claims with file:line evidence.
|
|
28
|
-
</harness_principles>
|
|
29
|
-
|
|
30
|
-
<engine_routing_convention>
|
|
31
|
-
Every phase routes to the optimal model per `references/engine-routing.md`:
|
|
32
|
-
|
|
33
|
-
- Phase prompt bodies (in `references/phases/`) are engine-agnostic.
|
|
34
|
-
- Phases routed to **Codex**: call `mcp__codex-cli__codex` per spawn patterns in `engine-routing.md`.
|
|
35
|
-
- Phases routed to **Claude**: spawn an `Agent` subagent with `mode: "bypassPermissions"`, passing the phase body verbatim.
|
|
36
|
-
- **Dual** (CRITIC security sub-pass on `--engine auto`): spawn both in parallel; orchestrator merges findings.
|
|
37
|
-
- `--engine claude` forces all phases to Claude. `--engine codex` forces implementation to Codex, orchestration/Chrome MCP stays Claude. `--engine auto` (default) uses the routing table.
|
|
38
|
-
</engine_routing_convention>
|
|
39
|
-
|
|
40
|
-
<post_eval_invariant>
|
|
41
|
-
Once `state.eval_passed_sha` is non-null (PHASE 2 returned PASS or PASS_WITH_ISSUES), the post-EVAL phases (CRITIC, DOCS) run **findings-only / doc-only** — they never write code. DOCS is the only phase allowed to commit after EVAL, and only for doc files.
|
|
42
|
-
|
|
43
|
-
**Orchestrator enforcement (per-phase, NOT cumulative)**: before each post-EVAL phase, capture `state.phases.<phase>.pre_sha = git rev-parse HEAD`. After the subagent completes, run `git diff --name-only <pre_sha> -- ':!.devlyn/**'`:
|
|
44
|
-
- CRITIC (findings-only) → any diff → `git reset --hard <pre_sha>`, emit `rule_id: "invariant.post-eval-code-mutation"` + `severity: HIGH` into `.devlyn/invariant.findings.jsonl`, route to FIX LOOP with `triggered_by: "critic"`.
|
|
45
|
-
- DOCS → check against allowlist; non-allowlisted paths trigger the revert-and-find flow.
|
|
46
|
-
|
|
47
|
-
Per-phase (not cumulative) baseline is correct because fix-loop commits between one post-EVAL phase and the next are legitimate.
|
|
48
|
-
|
|
49
|
-
Doc-file allowlist (DOCS): `*.md`, `.mdx`, files under `docs/`, `README*`, `CHANGELOG*`, `CLAUDE.md`, frontmatter in spec files under `docs/roadmap/phase-*/`. Any other path triggers revert-and-find.
|
|
50
|
-
</post_eval_invariant>
|
|
51
|
-
|
|
52
|
-
<perf_opt_in>
|
|
53
|
-
Optional: pass `--perf` to record per-phase `{wall_ms, tokens, engine, round, triggered_by}` into `state.perf.per_phase` and totals at PHASE 5. Off by default. Harness efficiency claims can be measured when needed; mandatory meta-measurement was retired in v3.4.
|
|
54
|
-
</perf_opt_in>
|
|
55
|
-
|
|
56
|
-
## PHASE 0: PARSE + PREFLIGHT + ROUTE
|
|
57
|
-
|
|
58
|
-
1. **Parse flags** from `<pipeline_config>`:
|
|
59
|
-
- `--max-rounds N` (4)
|
|
60
|
-
- `--route MODE` (auto) — per `references/pipeline-routing.md`
|
|
61
|
-
- `--engine MODE` (auto) — per `references/engine-routing.md`
|
|
62
|
-
- `--team` — force team-assembled BUILD even on non-strict routes (default: solo).
|
|
63
|
-
- `--bypass <phase>[,<phase>...]` — skip specific phases. Valid: `build-gate`, `browser`, `critic`, `docs`. Deprecated aliases (`--skip-*`, `--security-review skip`, `--bypass simplify|review|clean|security|challenge`) map to `--bypass critic` where applicable; log `deprecated flag — use --bypass <phase>` once.
|
|
64
|
-
- `--build-gate MODE` (auto) — `auto` / `strict` / `no-docker`.
|
|
65
|
-
- `--perf` — opt in to per-phase timing/token accounting.
|
|
66
|
-
|
|
67
|
-
2. **Engine pre-flight** (unless `--engine claude`): call `mcp__codex-cli__ping`. On failure, silent fallback to `--engine claude`, log `engine downgraded`. Never prompt.
|
|
68
|
-
|
|
69
|
-
3. **Initialize `pipeline.state.json`** per `references/pipeline-state.md`:
|
|
70
|
-
- `version: "1.2"`, `run_id: "ar-$(date -u +%Y%m%dT%H%M%SZ)-<12-hex>"`, `started_at`, `engine`, `base_ref.{branch, sha}`, `rounds.max_rounds`, `eval_passed_sha: null`, `route.bypasses: [...]`, empty `phases`, `criteria`, `route.selected`.
|
|
71
|
-
|
|
72
|
-
4. **Spec preflight** (if `<pipeline_config>` contains `docs/roadmap/phase-\d+/[^\s"'`)]+\.md`):
|
|
73
|
-
- Read the spec. Missing → `BLOCKED`.
|
|
74
|
-
- Verify internal deps (each entry under `## Dependencies → Internal` resolves to a `status: done` spec). Unmet → `BLOCKED`.
|
|
75
|
-
- Populate `state.source`: `type: "spec"`, `spec_path`, `spec_sha256 = sha256(spec)`, `criteria_anchors: ["spec://requirements", "spec://out-of-scope", "spec://verification", "spec://constraints", "spec://architecture-notes", "spec://dependencies"]`.
|
|
76
|
-
- Populate `state.criteria[]`: one per `- [ ]` in `## Requirements`, `status: pending`.
|
|
77
|
-
|
|
78
|
-
No spec path found → `source.type: "generated"`, `source.criteria_path: ".devlyn/criteria.generated.md"` (PHASE 1 creates it), `criteria_anchors: ["criteria.generated://requirements", "criteria.generated://out-of-scope", "criteria.generated://verification"]`, `criteria: []`.
|
|
79
|
-
|
|
80
|
-
5. **Compute Stage A route** per `references/pipeline-routing.md#stage-a`. Write to `state.route.{selected, user_override, stage_a}`.
|
|
81
|
-
|
|
82
|
-
6. **Announce** (single line):
|
|
83
|
-
```
|
|
84
|
-
Auto-resolve starting — run <run_id> — task: <desc>
|
|
85
|
-
Engine: <engine>, Route: <selected> (<stage_a_reasons>), Bypasses: <bypasses|none>, Max rounds: <N>
|
|
86
|
-
```
|
|
87
|
-
|
|
88
|
-
## PHASE 1: BUILD
|
|
89
|
-
|
|
90
|
-
**Engine**: BUILD row. Spawn per `<engine_routing_convention>`. Prompt body: **`references/phases/phase-1-build.md`** (verbatim) + task description.
|
|
91
|
-
|
|
92
|
-
**Team assembly rule** (simplified from v3.2): BUILD spawns as **team** ONLY when `--team` flag passed OR `state.route.selected == "strict"`. Otherwise solo. Keyword-match auto-trigger removed — Claude/Codex base SWE capability is the default.
|
|
93
|
-
|
|
94
|
-
**After the agent completes**:
|
|
95
|
-
1. Verify `criteria[]` has ≥1 entry with `status != "pending"`. If not, re-spawn with reminder.
|
|
96
|
-
2. `git diff --stat` — if no changes, halt with failure.
|
|
97
|
-
3. Checkpoint: `git add -A && git commit -m "chore(pipeline): phase 1 — build complete"`.
|
|
98
|
-
|
|
99
|
-
## PHASE 1.4: BUILD GATE
|
|
100
|
-
|
|
101
|
-
Skip if `build-gate` in `state.route.bypasses`. Deterministic — same commands CI/Docker/production run.
|
|
102
|
-
|
|
103
|
-
Spawn Claude `Agent` (`mode: "bypassPermissions"`): "Read `references/build-gate.md` (detection matrix, commands, package manager, monorepo, strict, Docker) and `references/findings-schema.md`. Run all matched gates. Apply strict flags if `--build-gate strict` OR `state.route.selected == "strict"`. Run Docker unless `--build-gate no-docker`. Emit `.devlyn/build_gate.findings.jsonl` + `.devlyn/build_gate.log.md`; update `state.phases.build_gate`."
|
|
104
|
-
|
|
105
|
-
**After the agent completes**:
|
|
106
|
-
1. Read `state.phases.build_gate.verdict`.
|
|
107
|
-
2. **Stage B LITE** (only if `verdict == "PASS"` AND `state.route.user_override == false`): apply the single escalation rule from `references/pipeline-routing.md#stage-b-lite`. If it fires, write `state.route.stage_b.{at, escalated_from, reasons}`.
|
|
108
|
-
3. Branch: `PASS` → PHASE 1.5; `FAIL` → PHASE 2.5 with `triggered_by: "build_gate"`.
|
|
109
|
-
|
|
110
|
-
## PHASE 1.5: BROWSER VALIDATE (conditional)
|
|
111
|
-
|
|
112
|
-
Skip if `browser` in `state.route.bypasses`. Skip if `git diff --name-only <state.base_ref.sha>` has no `*.tsx`, `*.jsx`, `*.vue`, `*.svelte`, `*.css`, `*.html`, `page.*`, `layout.*`, `route.*` matches.
|
|
113
|
-
|
|
114
|
-
Spawn Claude `Agent` (`mode: "bypassPermissions"`): "Read `.claude/skills/devlyn:browser-validate/SKILL.md` (tiered Chrome MCP → Playwright → curl) and `references/findings-schema.md`. Start dev server, test the implemented feature end-to-end against `pipeline.state.json:criteria[]`, leave server running (`--keep-server`). Emit `.devlyn/browser_validate.findings.jsonl` + `.devlyn/browser_validate.log.md`; update `state.phases.browser_validate`."
|
|
115
|
-
|
|
116
|
-
**After the agent completes**:
|
|
117
|
-
1. **Sanity check**: if verdict is `PASS`/`PASS_WITH_ISSUES` but log shows zero screenshots AND zero navigations, treat as unverified — re-run at `--tier 2`/`3`. Code-level verdict is not browser validation.
|
|
118
|
-
2. Branch: `PASS`/`PASS_WITH_ISSUES`/`PARTIALLY_VERIFIED` → PHASE 2; `NEEDS_WORK`/`BLOCKED` → PHASE 2.5 with `triggered_by: "browser_validate"`.
|
|
119
|
-
|
|
120
|
-
## PHASE 2: EVALUATE
|
|
121
|
-
|
|
122
|
-
**Engine**: EVAL row — always Claude. Prompt body: **`references/phases/phase-2-evaluate.md`**.
|
|
123
|
-
|
|
124
|
-
**After the agent completes**:
|
|
125
|
-
1. Read `state.phases.evaluate.verdict`.
|
|
126
|
-
2. **First-time PASS or PASS_WITH_ISSUES** with `state.eval_passed_sha == null` → set `state.eval_passed_sha = git rev-parse HEAD` (activates `<post_eval_invariant>`).
|
|
127
|
-
3. Branch:
|
|
128
|
-
- `PASS` → PHASE 3 (CRITIC) per route; `fast` → PHASE 5 (FINAL REPORT).
|
|
129
|
-
- `PASS_WITH_ISSUES` → **terminal for this phase** (LOW-only findings do not re-trigger fix loop). Proceed to next phase.
|
|
130
|
-
- `NEEDS_WORK` / `BLOCKED` → PHASE 2.5 with `triggered_by: "evaluate"`.
|
|
131
|
-
|
|
132
|
-
## PHASE 2.5: UNIFIED FIX LOOP
|
|
133
|
-
|
|
134
|
-
Single fix loop for every trigger (`build_gate` / `browser_validate` / `evaluate` / `critic`). `state.rounds.global` shared counter.
|
|
135
|
-
|
|
136
|
-
**Exhaustion check first**: if `state.rounds.global >= state.rounds.max_rounds`:
|
|
137
|
-
- `build_gate` / `browser_validate` → **halt** → PHASE 5 with exhaustion banner.
|
|
138
|
-
- `evaluate` / `critic` → **proceed_with_warning** → skip to next phase; final report shows banner.
|
|
139
|
-
|
|
140
|
-
**Fix-batch packet assembly**: read the trigger's `.findings.jsonl` (plus browser_validate if `triggered_by == "evaluate"` or `"browser_validate"` and browser has open findings — see pipeline-routing.md), filter `status == "open"`, write `.devlyn/fix-batch.round-<N>.json`:
|
|
141
|
-
```json
|
|
142
|
-
{
|
|
143
|
-
"round": <N>, "max_rounds": <N>, "base_ref_sha": "...", "criteria_source": "...",
|
|
144
|
-
"triggered_by": "<trigger>", "findings": [ /* id, rule_id, severity, file, line, message, fix_hint, criterion_ref */ ],
|
|
145
|
-
"failed_criteria": ["<C ids>"], "acceptance": {"build_gate_cmd": "...", "test_cmd": "..."}
|
|
146
|
-
}
|
|
147
|
-
```
|
|
148
|
-
|
|
149
|
-
**Engine**: FIX LOOP row (Codex on `auto`/`codex`, Claude on `claude`). Fresh Codex call each round (no `sessionId` reuse).
|
|
150
|
-
|
|
151
|
-
Spawn per `<engine_routing_convention>`. Prompt:
|
|
152
|
-
|
|
153
|
-
> Read `.devlyn/fix-batch.round-<N>.json` and `pipeline.state.json`.
|
|
154
|
-
>
|
|
155
|
-
> **First, re-ground on the contract.** Open `source.spec_path` (or `source.criteria_path`) and read the sections/anchors referenced by each finding's `criterion_ref`. **Spec/criteria are higher authority than findings** — do not narrow or reinterpret required behavior to satisfy a finding. If a finding hint conflicts with explicit spec text (e.g., a glob/pattern like `**/SKILL.md`, a cardinality, a flag's documented behavior), preserve the spec semantics and fix only the implementation defect. Non-contradictory, backward-compatible enhancements that preserve required default behavior are allowed (e.g., respecting `NO_COLOR` while still defaulting to colored when unset). If a finding **truly contradicts** the spec, halt that finding's fix, log the conflict in `.devlyn/fix-batch.round-<N>.log.md`, and leave the finding `open` — the conflict surfaces in the final report rather than silently narrowing the contract.
|
|
156
|
-
>
|
|
157
|
-
> **Then fix every listed finding at the root cause.** If multiple findings touch the same symbol, produce **one consolidated change**. Prefer editing/replacing existing code over adding new machinery; **do not leave parallel near-duplicate helpers/functions**. When return-shape pressure appears (one finding needs a richer return value than another), broaden the existing helper's return object — don't create a second variant.
|
|
158
|
-
>
|
|
159
|
-
> Read each referenced `file:line`, implement the fix, run tests. No workarounds (`any`, `@ts-ignore`, silent catches, hardcoded values). Raw failure detail: `.devlyn/build_gate.log.md` / `.devlyn/browser_validate.log.md`. When a previously-failed criterion is now satisfied, clear `failed_by_finding_ids`, set `status: "implemented"`, append an `evidence` record.
|
|
160
|
-
|
|
161
|
-
**After the agent completes**:
|
|
162
|
-
1. Checkpoint: `git add -A && git commit -m "chore(pipeline): fix round <N> (<triggered_by>)"`.
|
|
163
|
-
2. Increment `state.rounds.global`.
|
|
164
|
-
3. Route back: `build_gate` → PHASE 1.4; `browser_validate` → PHASE 1.5; **`evaluate` / `critic` → PHASE 2 (re-EVAL)**. All post-EVAL findings flow back through EVAL.
|
|
165
|
-
4. **After re-EVAL returns PASS/PASS_WITH_ISSUES with `triggered_by == "critic"`**: re-run PHASE 3 CRITIC once before proceeding to DOCS. This verifies the fix didn't introduce new design/security issues the first CRITIC would have caught. Subsequent fix-loop rounds triggered from this re-CRITIC follow the same rule (bounded by `state.rounds.max_rounds`).
|
|
166
|
-
|
|
167
|
-
## PHASE 3: CRITIC (findings-only, route-gated)
|
|
168
|
-
|
|
169
|
-
Skip if `state.route.selected == "fast"` OR `critic` in `state.route.bypasses`.
|
|
170
|
-
|
|
171
|
-
One post-EVAL critic pass with two sub-concerns:
|
|
172
|
-
- **Design sub-pass** — "would a staff engineer block this PR?" (cold read, any finding → `NEEDS_WORK`). Always Claude.
|
|
173
|
-
- **Security sub-pass** — OWASP-style audit with mandatory dependency audit when any dep manifest OR lockfile changed (`package.json`, `requirements.txt`, `package-lock.json`, `pnpm-lock.yaml`, `yarn.lock`, `Pipfile.lock`, `poetry.lock`, `Cargo.toml`, `Cargo.lock`, `go.mod`, `go.sum`). On `--engine auto`: **Dual** (Claude + Codex parallel, merged). On others: single model per route.
|
|
174
|
-
|
|
175
|
-
Hygiene concerns (unused imports, dead code) live in EVAL's `hygiene.*` findings at LOW severity, not a separate sub-pass here.
|
|
176
|
-
|
|
177
|
-
**Before spawn**: capture `phase_pre_sha = git rev-parse HEAD` → `state.phases.critic.pre_sha`.
|
|
178
|
-
|
|
179
|
-
**Spawn**: per `<engine_routing_convention>`. Prompt body: **`references/phases/phase-3-critic.md`**.
|
|
180
|
-
|
|
181
|
-
**After the agent completes**:
|
|
182
|
-
1. Enforce `<post_eval_invariant>`: `git diff --name-only <phase_pre_sha> -- ':!.devlyn/**'` — non-empty → revert + emit invariant finding + route to fix loop.
|
|
183
|
-
2. Read `state.phases.critic.verdict` (WORSE of design/security sub-verdicts):
|
|
184
|
-
- `PASS` → PHASE 4.
|
|
185
|
-
- `PASS_WITH_ISSUES` (security LOW only; design must be zero) → terminal; PHASE 4.
|
|
186
|
-
- `NEEDS_WORK` / `BLOCKED` → PHASE 2.5 with `triggered_by: "critic"`.
|
|
187
|
-
|
|
188
|
-
## PHASE 4: DOCS (doc-file mutations only)
|
|
189
|
-
|
|
190
|
-
Skip if `docs` in `state.route.bypasses` OR `state.route.selected == "fast"`.
|
|
191
|
-
|
|
192
|
-
Spawn Claude `Agent` (`mode: "bypassPermissions"`). Include original task description. Prompt: "Two jobs:
|
|
193
|
-
|
|
194
|
-
**Job 1 — Roadmap sync**: if task matched `docs/roadmap/phase-\d+/[^\s\"']+\.md` and `git diff <state.base_ref.sha> --stat` touches non-doc files:
|
|
195
|
-
1. Read the spec. If `status: done` already, skip to Job 2.
|
|
196
|
-
2. Set `status: done` + `completed: <today>` in frontmatter. Do not touch body.
|
|
197
|
-
3. Update `docs/ROADMAP.md`: find row matching spec id; change Status to `Done`.
|
|
198
|
-
4. If phase now fully Done: archive to `## Completed <details>` block at bottom (format per `devlyn:ideate#context-archiving`). Item spec files stay on disk.
|
|
199
|
-
|
|
200
|
-
**Job 2 — General doc sync**: update docs referencing changed APIs/features/behaviors. Use `git log --oneline -20` + `git diff <state.base_ref.sha>`. Preserve forward-looking content.
|
|
201
|
-
|
|
202
|
-
**Safety**: never flip a spec `done` without a non-empty non-doc diff; never flip multiple specs in one run; never touch files outside the doc-file allowlist."
|
|
203
|
-
|
|
204
|
-
**Before spawn**: capture `phase_pre_sha = git rev-parse HEAD` → `state.phases.docs.pre_sha`.
|
|
205
|
-
|
|
206
|
-
**After the agent completes**:
|
|
207
|
-
1. Enforce allowlist: `git diff --name-only <phase_pre_sha> -- ':!.devlyn/**'` — any non-allowlisted path → revert + emit `invariant.post-eval-code-mutation` + route to PHASE 2.5 with `triggered_by: "docs"`.
|
|
208
|
-
2. If allowlist honored and diff non-empty: `git add -A && git commit -m "chore(pipeline): docs updated"`.
|
|
209
|
-
|
|
210
|
-
## PHASE 5: FINAL REPORT + ARCHIVE
|
|
211
|
-
|
|
212
|
-
1. **Terminal verdict**: run `python3 scripts/terminal_verdict.py` (implements the precedence in `references/pipeline-routing.md#terminal-state-algorithm`; prints verdict, exits 0/1/2/3 for PASS/PASS_WITH_ISSUES/NEEDS_WORK/BLOCKED).
|
|
213
|
-
|
|
214
|
-
2. **Render report**:
|
|
215
|
-
```
|
|
216
|
-
### Auto-Resolve Complete — run <run_id>
|
|
217
|
-
|
|
218
|
-
Task: <original task>
|
|
219
|
-
Engine: <engine> (downgraded: <reason or no>)
|
|
220
|
-
Route: <selected> (user_override: <t/f>)
|
|
221
|
-
Stage A: <reasons>
|
|
222
|
-
Stage B LITE: <no escalation | escalated from X — reason>
|
|
223
|
-
|
|
224
|
-
Terminal verdict: <PASS / PASS_WITH_ISSUES / NEEDS_WORK / BLOCKED>
|
|
225
|
-
<banner if applicable: "⚠ BUILD GATE EXHAUSTED" / "⚠ EVAL EXHAUSTED — open findings: <list file:line>" />
|
|
226
|
-
|
|
227
|
-
Pipeline summary:
|
|
228
|
-
| Phase | Verdict | Notes |
|
|
229
|
-
|-------|---------|-------|
|
|
230
|
-
| BUILD | <v> | <engine, solo/team> |
|
|
231
|
-
| BUILD GATE | <v> | <project types, commands> |
|
|
232
|
-
| BROWSER | <v / skipped — no web> | <tier, flow> |
|
|
233
|
-
| EVAL (round <N>) | <v> | <finding count by severity> |
|
|
234
|
-
| FIX ROUNDS | <N of max> | <triggered_by history> |
|
|
235
|
-
| CRITIC | <v / skipped-route / skipped-bypass> | <design: N, security: N, dep-audit: ran/skipped> |
|
|
236
|
-
| DOCS | <completed / skipped> | <specs flipped, roadmap archived> |
|
|
237
|
-
|
|
238
|
-
Guardrails bypassed: <state.route.bypasses or "none">
|
|
239
|
-
|
|
240
|
-
Commits: <git log --oneline from state.base_ref.sha>
|
|
241
|
-
|
|
242
|
-
Audit trail: .devlyn/runs/<run_id>/
|
|
243
|
-
|
|
244
|
-
Next steps:
|
|
245
|
-
- Review: git diff <base_ref.sha>
|
|
246
|
-
- Squash: git rebase -i <base_ref.sha>
|
|
247
|
-
- Re-run fixes: /devlyn:auto-resolve "<narrower task>"
|
|
248
|
-
```
|
|
249
|
-
|
|
250
|
-
3. **Archive**: run `python3 scripts/archive_run.py` (implements `references/pipeline-state.md#archive-contract`; moves per-run artifacts into `.devlyn/runs/<run_id>/`, best-effort prunes to last 10 completed runs).
|
|
251
|
-
|
|
252
|
-
4. Kill dev server from PHASE 1.5 if still running.
|
|
@@ -1,21 +0,0 @@
|
|
|
1
|
-
{
|
|
2
|
-
"skill_name": "devlyn:auto-resolve",
|
|
3
|
-
"note": "Real regression fixtures from measured A/B runs. Each eval is a task we have actually driven through the pipeline and scored blind against a bare-prompt baseline. Expand only with real runs, not synthesized prompts.",
|
|
4
|
-
"evals": [
|
|
5
|
-
{
|
|
6
|
-
"id": 1,
|
|
7
|
-
"name": "doctor-subcommand",
|
|
8
|
-
"route": "standard",
|
|
9
|
-
"complexity": "medium",
|
|
10
|
-
"prompt": "Implement per spec at evals/task-doctor-subcommand.md",
|
|
11
|
-
"spec": "evals/task-doctor-subcommand.md",
|
|
12
|
-
"expected_output": "Single-file CLI change adding a `doctor` subcommand to bin/devlyn.js with node-version, HOME/.claude, plugins, and skills checks, TTY-gated color, exit-code semantics, help integration. Zero new npm dependencies. No silent error catches.",
|
|
13
|
-
"baseline_source": "bare prompt (no skill pipeline) — see benchmark history commit 2d9a5f0..51a8aeb",
|
|
14
|
-
"historical_scores": {
|
|
15
|
-
"v3.4": { "skill": 57, "bare": 45, "margin": 12, "judge": "codex gpt-5.3-codex blind" },
|
|
16
|
-
"v3.4.1": { "skill": 59, "bare": 43, "margin": 16, "judge": "codex gpt-5.3-codex blind" }
|
|
17
|
-
},
|
|
18
|
-
"files": []
|
|
19
|
-
}
|
|
20
|
-
]
|
|
21
|
-
}
|
|
@@ -1,42 +0,0 @@
|
|
|
1
|
-
# Task: `devlyn doctor` subcommand
|
|
2
|
-
|
|
3
|
-
Add a new `doctor` subcommand to `bin/devlyn.js`. When the user runs `npx devlyn-cli doctor` (or `node bin/devlyn.js doctor`), it diagnoses the local devlyn-cli installation and prints a status report.
|
|
4
|
-
|
|
5
|
-
## Requirements
|
|
6
|
-
|
|
7
|
-
1. **Node version check** — `process.version >= v18.0.0`. Emit a status line. If below, mark FAIL.
|
|
8
|
-
2. **`$HOME/.claude/` check** — exists as directory AND is writable. Missing → FAIL. Exists but not writable (EACCES) → FAIL with a distinct "permission" message.
|
|
9
|
-
3. **Installed plugins scan** — read subdirectories of `$HOME/.claude/plugins/cache/` and print a summary line with the count. `--verbose` lists names.
|
|
10
|
-
4. **Installed skills scan** — count files matching `$HOME/.claude/skills/**/SKILL.md`. Print count; `--verbose` lists relative paths.
|
|
11
|
-
5. **Colored output** — each line prefixed with `[OK]` (green), `[WARN]` (yellow), `[FAIL]` (red) using ANSI escape codes, **only when `process.stdout.isTTY` is true**. Non-TTY → no color codes.
|
|
12
|
-
6. **Summary line** — e.g., `doctor: 3 ok, 1 warn, 0 fail`.
|
|
13
|
-
7. **Exit code** — `0` if zero fails, `1` if any fail.
|
|
14
|
-
8. **`--verbose` flag** — expands details for plugins/skills scans.
|
|
15
|
-
9. **Help integration** —
|
|
16
|
-
- `node bin/devlyn.js doctor --help` prints a short help block and exits 0.
|
|
17
|
-
- `node bin/devlyn.js --help` / `node bin/devlyn.js help` lists `doctor` as an available subcommand.
|
|
18
|
-
|
|
19
|
-
## Constraints
|
|
20
|
-
|
|
21
|
-
- **Zero new dependencies.** Use only Node.js built-ins (`fs`, `path`, `os`, `process`).
|
|
22
|
-
- **No silent error catches.** Per project CLAUDE.md error-handling philosophy, do not wrap operations in `try { … } catch { return fallbackValue }`. All errors visible to the user with actionable messages.
|
|
23
|
-
- **HOME guard.** If `process.env.HOME` is undefined or empty, emit a clear FAIL line ("HOME environment variable is not set") and exit 1. Do not attempt to read arbitrary paths.
|
|
24
|
-
- **EACCES handling.** If `readdirSync` fails with EACCES, emit a permission-specific message quoting the offending path. Do not silently return an empty list.
|
|
25
|
-
|
|
26
|
-
## Acceptance verification
|
|
27
|
-
|
|
28
|
-
Run each of these and they must behave as described:
|
|
29
|
-
|
|
30
|
-
- `node bin/devlyn.js doctor` — produces the status report, exits 0 on a clean machine.
|
|
31
|
-
- `HOME=/nonexistent node bin/devlyn.js doctor` — prints a FAIL line clearly referencing the missing `/nonexistent/.claude`, exits 1.
|
|
32
|
-
- `node bin/devlyn.js doctor | cat` — piped output contains no ANSI escape codes (`\x1b[`).
|
|
33
|
-
- `node bin/devlyn.js doctor --help` — prints help, exits 0.
|
|
34
|
-
- `node bin/devlyn.js --help` — mentions `doctor` in the list of subcommands.
|
|
35
|
-
- `git diff -- package.json` — no new entries under `dependencies`.
|
|
36
|
-
- `node bin/devlyn.js doctor --verbose` — lists each plugin directory and each skill path.
|
|
37
|
-
|
|
38
|
-
## Out of Scope
|
|
39
|
-
|
|
40
|
-
- Auto-repair (don't offer to fix detected problems; just report).
|
|
41
|
-
- Checking remote/registry state (npm, GitHub).
|
|
42
|
-
- Any feature requiring a new npm dependency.
|
|
@@ -1,130 +0,0 @@
|
|
|
1
|
-
# Build Gate — Project Type Detection & Commands
|
|
2
|
-
|
|
3
|
-
Reference for PHASE 1.4 (Build Gate). The build gate agent reads this file to determine which commands to run.
|
|
4
|
-
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
## Project Type Detection Matrix
|
|
8
|
-
|
|
9
|
-
Inspect the repository root and subdirectories (up to 2 levels). A repo can match **multiple** signals — run ALL matching gates. Do not pick "the main one"; a monorepo with a Next.js dashboard + Rust service needs both.
|
|
10
|
-
|
|
11
|
-
| Signal file(s) | Project type | Gate commands (run in order) |
|
|
12
|
-
|---|---|---|
|
|
13
|
-
| `package.json` with `next` dep | Next.js | `npx tsc --noEmit` → `npx next build` |
|
|
14
|
-
| `package.json` with `nuxt` dep | Nuxt | `npx nuxi typecheck` → `npx nuxi build` |
|
|
15
|
-
| `package.json` with `vite` + `tsconfig.json` | Vite+TS | `npx tsc --noEmit` → `npm run build` (if script exists) |
|
|
16
|
-
| `package.json` with `expo` dep | Expo (React Native) | `npx tsc --noEmit` → `npx expo-doctor` |
|
|
17
|
-
| `package.json` with `react-native` (no expo) | React Native | `npx tsc --noEmit` |
|
|
18
|
-
| `package.json` with `svelte` + `@sveltejs/kit` | SvelteKit | `npm run check` → `npm run build` |
|
|
19
|
-
| `package.json` only, has `build` script | Generic Node | `npm run build` |
|
|
20
|
-
| `package.json` only, has `tsconfig.json` but no `build` | TS library | `npx tsc --noEmit` |
|
|
21
|
-
| `pnpm-workspace.yaml` / `turbo.json` / `lerna.json` | Monorepo | `pnpm -r build` or `turbo run build typecheck lint` — **workspace-wide**, NOT just the changed package |
|
|
22
|
-
| `Cargo.toml` | Rust | `cargo check --all-targets` → `cargo clippy -- -D warnings` |
|
|
23
|
-
| `go.mod` | Go | `go build ./...` → `go vet ./...` |
|
|
24
|
-
| `foundry.toml` | Foundry (Solidity) | `forge build` |
|
|
25
|
-
| `hardhat.config.{js,ts,cjs}` | Hardhat (Solidity) | `npx hardhat compile` |
|
|
26
|
-
| `Anchor.toml` | Anchor (Solana) | `anchor build` |
|
|
27
|
-
| `Move.toml` | Move (Sui/Aptos) | `sui move build` or `aptos move compile` |
|
|
28
|
-
| `pyproject.toml` / `setup.py` + mypy config | Python+mypy | `mypy .` |
|
|
29
|
-
| `pyproject.toml` with `ruff` | Python+Ruff | `ruff check .` |
|
|
30
|
-
| `Package.swift` | Swift package | `swift build` |
|
|
31
|
-
| `*.xcodeproj` / `*.xcworkspace` | iOS/macOS (Xcode) | Skip by default — log "Xcode project detected, manual build gate recommended". Too project-specific without knowing the scheme. |
|
|
32
|
-
| `build.gradle*` / `settings.gradle*` | Gradle/Android | `./gradlew assembleDebug` (debug, not release — keep it fast) |
|
|
33
|
-
| `CMakeLists.txt` | C/C++ (CMake) | `cmake -B build && cmake --build build` |
|
|
34
|
-
| `Makefile` (with no other signals) | Generic Make | `make` (only if no other type matched — Makefiles are too generic) |
|
|
35
|
-
| `Unity/ProjectSettings/` or `ProjectSettings/ProjectVersion.txt` | Unity | Skip by default — log "Unity project detected, manual build gate recommended" |
|
|
36
|
-
| `project.godot` | Godot | Skip by default — log "Godot project detected, manual build gate recommended" |
|
|
37
|
-
| `Dockerfile*` | Docker | `docker build -f <dockerfile> -t _pipeline_gate_test .` — included by default in `auto` mode. Skip with `--build-gate no-docker`. |
|
|
38
|
-
|
|
39
|
-
## Package Manager Detection
|
|
40
|
-
|
|
41
|
-
Respect the project's package manager. Check in order:
|
|
42
|
-
1. `packageManager` field in root `package.json` → use that
|
|
43
|
-
2. `pnpm-lock.yaml` exists → `pnpm`
|
|
44
|
-
3. `yarn.lock` exists → `yarn`
|
|
45
|
-
4. `bun.lockb` / `bun.lock` exists → `bun`
|
|
46
|
-
5. Default → `npm`
|
|
47
|
-
|
|
48
|
-
Replace `npm run build` / `npx` accordingly: `pnpm build` / `pnpm exec`, `yarn build` / `yarn`, `bun run build` / `bunx`.
|
|
49
|
-
|
|
50
|
-
## Monorepo Handling
|
|
51
|
-
|
|
52
|
-
Monorepo is the most critical case — cross-package type drift is the #1 source of "tests pass locally, build fails in CI."
|
|
53
|
-
|
|
54
|
-
1. Detect workspace root markers: `pnpm-workspace.yaml`, `turbo.json`, `lerna.json`, `workspaces` in root `package.json`
|
|
55
|
-
2. Run gates at the **workspace root** level, not per-changed-package:
|
|
56
|
-
- Turbo: `turbo run build typecheck lint` (respects dependency graph)
|
|
57
|
-
- pnpm: `pnpm -r build` (runs in topological order)
|
|
58
|
-
- yarn workspaces: `yarn workspaces foreach -A run build`
|
|
59
|
-
- npm workspaces: `npm run build --workspaces`
|
|
60
|
-
3. This ensures Package A's type change that breaks Package B's consumer is caught, even if only Package A was directly modified.
|
|
61
|
-
|
|
62
|
-
## Strict Mode (`--build-gate strict`)
|
|
63
|
-
|
|
64
|
-
When strict mode is set, treat warnings as failures:
|
|
65
|
-
- TypeScript: add `--strict` if not already in tsconfig (or verify it's set)
|
|
66
|
-
- Clippy: `-D warnings` (already default in the matrix)
|
|
67
|
-
- ESLint: `--max-warnings 0`
|
|
68
|
-
- Go vet: already treats warnings as errors
|
|
69
|
-
- Foundry: `--deny-warnings`
|
|
70
|
-
|
|
71
|
-
In default (auto) mode, only hard errors (non-zero exit code from the tool's perspective) block.
|
|
72
|
-
|
|
73
|
-
## Docker Build (default in `auto` mode)
|
|
74
|
-
|
|
75
|
-
When `Dockerfile*` files are detected AND `--build-gate no-docker` is NOT set:
|
|
76
|
-
1. Run all non-Docker gates first (they're faster and catch most errors before the slow Docker step)
|
|
77
|
-
2. Then run `docker build -f <dockerfile> -t _pipeline_gate_test .` for each Dockerfile found in the repo root and subdirectories (up to 2 levels)
|
|
78
|
-
3. If Docker daemon is not available, log the skip with a warning but do NOT fail — developers without Docker should not be blocked. The warning should note: "Docker builds were skipped because the Docker daemon is unavailable. Use `--build-gate no-docker` to suppress this warning, or ensure Docker is running to catch Dockerfile-specific issues."
|
|
79
|
-
4. This catches Dockerfile-specific issues that no other gate can: COPY paths referencing files excluded by .dockerignore, multi-stage build failures, production-only dependency resolution, and environment differences between dev and container builds
|
|
80
|
-
|
|
81
|
-
Use `--build-gate no-docker` to skip Docker builds for faster iteration during development — the language-level gates (tsc, cargo check, etc.) still run and catch the majority of issues. Docker builds are most valuable as a final gate before shipping.
|
|
82
|
-
|
|
83
|
-
## Output Format
|
|
84
|
-
|
|
85
|
-
Emit two files plus one state update (schemas: `references/findings-schema.md`, `references/pipeline-state.md`).
|
|
86
|
-
|
|
87
|
-
### 1. `.devlyn/build_gate.findings.jsonl`
|
|
88
|
-
|
|
89
|
-
One JSON line per failing command's extracted root cause. Do NOT emit findings for PASSING commands. Each line follows the canonical findings schema:
|
|
90
|
-
|
|
91
|
-
```jsonl
|
|
92
|
-
{"id":"BGATE-0001","rule_id":"build.type-error","level":"error","severity":"HIGH","confidence":0.99,"message":"Property 'config' does not exist on type 'SettingsTabsProps'","file":"dashboard/app/(dashboard)/settings/page.tsx","line":90,"phase":"build_gate","criterion_ref":null,"fix_hint":"Read dashboard/app/(dashboard)/settings/page.tsx:88-93 and dashboard/components/settings/SettingsTabs.tsx (the SettingsTabsProps type definition). Either add 'config' to SettingsTabsProps or remove the prop from the parent.","blocking":true,"status":"open"}
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
Dedup key is `(rule_id, file, line)` per `findings-schema.md` — no fingerprint bookkeeping (removed in v3.4).
|
|
96
|
-
|
|
97
|
-
Suggested `rule_id` values: `build.type-error`, `build.lint-violation`, `build.dep-missing`, `build.docker-copy-mismatch`, `build.module-not-found`, `build.compile-error`.
|
|
98
|
-
|
|
99
|
-
### 2. `.devlyn/build_gate.log.md`
|
|
100
|
-
|
|
101
|
-
Human-readable run log. This is where the FULL raw stderr/stdout lives — not in the JSONL `message`. Structure:
|
|
102
|
-
|
|
103
|
-
```markdown
|
|
104
|
-
# Build Gate Run Log
|
|
105
|
-
## Detected Project Types
|
|
106
|
-
- [type] ([path/])
|
|
107
|
-
|
|
108
|
-
## Commands
|
|
109
|
-
| # | Command | Dir | Exit | Time |
|
|
110
|
-
|---|---|---|---|---|
|
|
111
|
-
| 1 | `npx tsc --noEmit` | dashboard/ | 0 | 4.2s |
|
|
112
|
-
| 2 | `npx next build` | dashboard/ | 1 | 9.8s |
|
|
113
|
-
|
|
114
|
-
## Raw Output — failing commands only
|
|
115
|
-
|
|
116
|
-
### #2: `npx next build` (dashboard/, exit 1)
|
|
117
|
-
\`\`\`
|
|
118
|
-
[full raw output — keep for debugging]
|
|
119
|
-
\`\`\`
|
|
120
|
-
```
|
|
121
|
-
|
|
122
|
-
### 3. Update `pipeline.state.json`
|
|
123
|
-
|
|
124
|
-
Set `phases.build_gate`:
|
|
125
|
-
- `verdict`: `PASS` if all exit codes == 0, else `FAIL`. If no gates detected, `PASS` with a note in log.md ("No build gate detected — project type unknown; consider adding `--build-gate deploy` if Dockerfiles are present.")
|
|
126
|
-
- `engine: "bash"`, `model: null`
|
|
127
|
-
- `started_at`, `completed_at`, `duration_ms`, `round`
|
|
128
|
-
- `artifacts.findings_file: ".devlyn/build_gate.findings.jsonl"`, `artifacts.log_file: ".devlyn/build_gate.log.md"`
|
|
129
|
-
|
|
130
|
-
The orchestrator branches on `phases.build_gate.verdict` — it does NOT re-read the findings or log file for routing decisions.
|
|
@@ -1,82 +0,0 @@
|
|
|
1
|
-
# Engine Routing: Intelligent Model Selection
|
|
2
|
-
|
|
3
|
-
Routing rules for Claude / Codex / Dual per role and phase. Only read when `--engine` is `auto` or `codex`.
|
|
4
|
-
|
|
5
|
-
Codex call defaults: `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, `sandbox` per role (below), `workingDirectory: <project root>`.
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Pipeline Phase Routing (auto-resolve)
|
|
10
|
-
|
|
11
|
-
| Phase | `--engine auto` | `--engine codex` | `--engine claude` |
|
|
12
|
-
|-------|--------------|----------------|-----------------|
|
|
13
|
-
| BUILD | **Codex** | Codex | Claude |
|
|
14
|
-
| BUILD GATE | bash | bash | bash |
|
|
15
|
-
| BROWSER VALIDATE | Claude (Chrome MCP) | Claude | Claude |
|
|
16
|
-
| EVALUATE | **Claude** | Claude | Claude |
|
|
17
|
-
| FIX LOOP | **Codex** | Codex | Claude |
|
|
18
|
-
| CRITIC (design sub-pass) | Claude | Claude | Claude |
|
|
19
|
-
| CRITIC (security sub-pass) | **Dual** | Codex | Claude |
|
|
20
|
-
| DOCS | Claude | Codex | Claude |
|
|
21
|
-
|
|
22
|
-
Rationale:
|
|
23
|
-
- BUILD/FIX: Codex — SWE-bench Pro advantage on hard coding.
|
|
24
|
-
- EVALUATE/CRITIC design sub-pass: Claude — long-context retrieval + skeptical reasoning; different model family from builder for GAN dynamic.
|
|
25
|
-
- BROWSER: Claude — Chrome MCP tools session-bound.
|
|
26
|
-
- CRITIC security: Dual on `auto` — Semgrep study shows each model finds unique vulnerabilities; for security, coverage trumps cost.
|
|
27
|
-
|
|
28
|
-
---
|
|
29
|
-
|
|
30
|
-
## Pipeline Phase Routing (ideate)
|
|
31
|
-
|
|
32
|
-
| Phase | `--engine auto` | `--engine codex` | `--engine claude` |
|
|
33
|
-
|-------|--------------|----------------|-----------------|
|
|
34
|
-
| FRAME | Claude | Codex | Claude |
|
|
35
|
-
| EXPLORE | Claude | Codex | Claude |
|
|
36
|
-
| CONVERGE | Claude | Codex | Claude |
|
|
37
|
-
| CHALLENGE | **Codex** (rubric critic) | Claude (role reversal) | Claude |
|
|
38
|
-
| DOCUMENT | Claude | Codex | Claude |
|
|
39
|
-
|
|
40
|
-
CHALLENGE: when `--engine auto`, Codex runs the rubric as critic (builder and critic always different models).
|
|
41
|
-
|
|
42
|
-
---
|
|
43
|
-
|
|
44
|
-
## Pipeline Phase Routing (preflight)
|
|
45
|
-
|
|
46
|
-
| Phase | `--engine auto` | `--engine codex` | `--engine claude` |
|
|
47
|
-
|-------|--------------|----------------|-----------------|
|
|
48
|
-
| EXTRACT | Claude | Codex | Claude |
|
|
49
|
-
| AUDIT (code) | **Codex** | Codex | Claude |
|
|
50
|
-
| AUDIT (docs) | **Claude** | Claude | Claude |
|
|
51
|
-
| AUDIT (browser) | Claude | Claude | Claude |
|
|
52
|
-
| SYNTHESIZE | Claude | Claude | Claude |
|
|
53
|
-
|
|
54
|
-
Docs auditor is always Claude (writing-quality strength for prose-drift detection). Browser is always Claude (Chrome MCP session-bound).
|
|
55
|
-
|
|
56
|
-
---
|
|
57
|
-
|
|
58
|
-
## Team-role routing (when `--team` is on OR route == strict in auto-resolve, OR team-review in standalone)
|
|
59
|
-
|
|
60
|
-
| Role | Engine | Sandbox | Rationale |
|
|
61
|
-
|------|--------|---------|-----------|
|
|
62
|
-
| root-cause-analyst | Claude | — | Git-history traversal + tool access beats SWE-bench Pro for this role |
|
|
63
|
-
| test-engineer | Codex | workspace-write | HumanEval edge, needs file write |
|
|
64
|
-
| security-auditor | Dual | read-only | Semgrep: each finds unique vulns |
|
|
65
|
-
| implementation-planner | Codex | read-only | SWE-bench Pro +11.7pp |
|
|
66
|
-
| architecture-reviewer | Claude | — | Codebase-wide pattern review = MRCR strength |
|
|
67
|
-
| performance-engineer | Codex | read-only | Terminal-Bench edge |
|
|
68
|
-
| api-designer / api-reviewer | Dual | read-only | Both find unique API issues |
|
|
69
|
-
| quality-reviewer | Dual | read-only | Measured ~36–73% coverage gain from dual |
|
|
70
|
-
| ux/ui/accessibility-* | Claude | — | Ambiguity handling + WCAG domain depth |
|
|
71
|
-
|
|
72
|
-
For Codex roles: `mcp__codex-cli__codex` with `model: "gpt-5.4"`, `reasoningEffort: "xhigh"`, sandbox per table. Include full role prompt inline; Codex has no access to TeamCreate/SendMessage/TaskCreate.
|
|
73
|
-
|
|
74
|
-
For Dual roles: run both in parallel, merge findings. Same finding from both → keep more detailed wording, mark "confirmed by both". Codex-only → prefix `[codex]`. Conflicts → keep both.
|
|
75
|
-
|
|
76
|
-
---
|
|
77
|
-
|
|
78
|
-
## Override behavior
|
|
79
|
-
|
|
80
|
-
- `--engine claude` → all roles/phases use Claude (no Codex calls).
|
|
81
|
-
- `--engine codex` → all phases use Codex for implementation/analysis, Claude only for orchestration/Chrome MCP.
|
|
82
|
-
- `--engine auto` (default) → each role/phase routes per this table.
|
|
@@ -1,103 +0,0 @@
|
|
|
1
|
-
# Findings Schema — `.devlyn/<phase>.findings.jsonl`
|
|
2
|
-
|
|
3
|
-
Structured findings format for phase outputs. One JSON object per line (JSONL). Written by Evaluate, Build Gate, Browser Validate, Critic, and any other phase that emits structured findings.
|
|
4
|
-
|
|
5
|
-
## Purpose
|
|
6
|
-
|
|
7
|
-
Separate structured findings from prose summaries. The orchestrator and fix-loop need machine-readable data for:
|
|
8
|
-
- Filtering by severity / blocking / status.
|
|
9
|
-
- Cross-round dedup (single primary key — see below).
|
|
10
|
-
- Packing into fix-batch packets for the fix-loop subagent.
|
|
11
|
-
- Final report aggregation.
|
|
12
|
-
|
|
13
|
-
## Canonical schema (per line)
|
|
14
|
-
|
|
15
|
-
```json
|
|
16
|
-
{
|
|
17
|
-
"id": "<PHASE>-<4digit>",
|
|
18
|
-
"rule_id": "<category>.<kebab-name>",
|
|
19
|
-
"level": "note" | "warning" | "error",
|
|
20
|
-
"severity": "LOW" | "MEDIUM" | "HIGH" | "CRITICAL",
|
|
21
|
-
"confidence": <float 0.0..1.0>,
|
|
22
|
-
"message": "<one-line human description>",
|
|
23
|
-
"file": "<path relative to repo root>",
|
|
24
|
-
"line": <int, 1-based>,
|
|
25
|
-
"phase": "<phase name, e.g. 'evaluate'>",
|
|
26
|
-
"criterion_ref": "<anchor, e.g. 'spec://requirements/2'>" | null,
|
|
27
|
-
"fix_hint": "<concrete action quoting file:line to change>",
|
|
28
|
-
"blocking": true | false,
|
|
29
|
-
"status": "open" | "resolved" | "suppressed"
|
|
30
|
-
}
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
## Field semantics
|
|
34
|
-
|
|
35
|
-
### Identity
|
|
36
|
-
|
|
37
|
-
- `id` — stable within a single run. Format: `<PHASE>-<4digit>` zero-padded. Examples: `EVAL-0007`, `BUILD-0001`, `CRIT-0003`, `BGATE-0004`.
|
|
38
|
-
- `rule_id` — stable across runs. Format: `<category>.<kebab-case-name>`. Use existing rule_ids before inventing new ones — keeps dedup working. Common categories:
|
|
39
|
-
- `correctness.*` — logic errors, silent failures, null access, wrong API contracts
|
|
40
|
-
- `design.*` — staff-engineer ship/no-ship concerns (non-atomic transactions, hidden assumptions, unidiomatic patterns)
|
|
41
|
-
- `security.*` — OWASP-anchored (sql-injection, xss, hardcoded-credential, missing-input-validation, missing-auth-check, insecure-dependency, permissive-cors, missing-csrf, privilege-escalation, data-exposure, path-traversal, ssrf)
|
|
42
|
-
- `ux.*` — missing error/loading/empty states
|
|
43
|
-
- `architecture.*` — pattern violations, duplication, missing integration
|
|
44
|
-
- `hygiene.*` — unused imports, dead code, unused deps (typically LOW)
|
|
45
|
-
- `types.*` — any-cast escapes, unsafe casts
|
|
46
|
-
- `scope.*` — out-of-scope violations (e.g. `scope.out-of-scope-violation`)
|
|
47
|
-
- `performance.*`, `style.*` — typically LOW
|
|
48
|
-
|
|
49
|
-
### Severity and level
|
|
50
|
-
|
|
51
|
-
- `level` — SARIF-style coarse bucket: `error` blocks ship, `warning` should fix, `note` informational.
|
|
52
|
-
- `severity` — finer granularity for pipeline logic.
|
|
53
|
-
|
|
54
|
-
| severity | level | blocking default |
|
|
55
|
-
|----------|-------|------------------|
|
|
56
|
-
| `CRITICAL` | `error` | always true |
|
|
57
|
-
| `HIGH` | `error` | usually true |
|
|
58
|
-
| `MEDIUM` | `warning` | true (stricter for `security.*` — see pipeline-routing terminal state) |
|
|
59
|
-
| `LOW` | `note` | false |
|
|
60
|
-
|
|
61
|
-
- `confidence` — reporter's self-rating, `0.0`–`1.0`. Fix-loop prioritizes high-confidence HIGH findings first.
|
|
62
|
-
|
|
63
|
-
### Message and location
|
|
64
|
-
|
|
65
|
-
- `message` — one line. NAME the issue, not the symptom. Good: `"Token validated on read path but not write path"`. Bad: `"Potential security issue"`.
|
|
66
|
-
- `file` — repo-relative path. No leading `./`, no absolute paths. Forward slashes.
|
|
67
|
-
- `line` — 1-based line number. Multi-line spans → primary (first) line.
|
|
68
|
-
|
|
69
|
-
### Dedup primary key
|
|
70
|
-
|
|
71
|
-
The primary key is `(rule_id, file, line)`. Two findings with identical coordinates are the same issue. If EVAL runs again after a fix and the finding re-appears at the same spot, it's still unresolved. If the line shifted after a fix, the next EVAL regenerates the finding with the new line — cross-round drift heals naturally via re-evaluation, no hash-normalization bookkeeping.
|
|
72
|
-
|
|
73
|
-
### Pipeline linkage
|
|
74
|
-
|
|
75
|
-
- `phase` — the phase that produced this finding (redundant with filename but keeps records self-describing when concatenated for fix-batch packets).
|
|
76
|
-
- `criterion_ref` — anchor to the criterion this finding affects, or `null` if cross-cutting.
|
|
77
|
-
- `fix_hint` — concrete action with file:line. Not "improve error handling" — e.g. `"Wrap read+write in db.transaction() at src/auth/session.ts:84-92; re-check order.status === 'pending' inside transaction before updating"`.
|
|
78
|
-
|
|
79
|
-
### Lifecycle
|
|
80
|
-
|
|
81
|
-
- `blocking` — does this block ship?
|
|
82
|
-
- `status`:
|
|
83
|
-
- `open` — reported, awaiting action
|
|
84
|
-
- `resolved` — a fix-loop round applied a fix AND subsequent evaluation confirms the finding is gone (via `(rule_id, file, line)` absence in the new EVAL run)
|
|
85
|
-
- `suppressed` — intentionally not fixed with justification in the phase's `log.md`; requires user override or Out-of-Scope mapping in the spec
|
|
86
|
-
|
|
87
|
-
## Dedup and round handling
|
|
88
|
-
|
|
89
|
-
Each phase writes a fresh `.findings.jsonl` on each execution. Fix rounds re-run EVAL, which produces a new file.
|
|
90
|
-
|
|
91
|
-
**Cross-round reconciliation** (fix-loop rounds only):
|
|
92
|
-
1. Read prior round's `<phase>.findings.jsonl` (if any).
|
|
93
|
-
2. For each finding in the new file: if prior open finding has same `(rule_id, file, line)` → reuse the prior `id`, keep `status: open`. Otherwise → new `id`.
|
|
94
|
-
3. For each prior open finding not matched by the new file → set `status: resolved` in the prior file.
|
|
95
|
-
|
|
96
|
-
Fix-batch packet: orchestrator concatenates all phases' `.findings.jsonl`, filters `status == "open"`, drops `blocking == false` when round budget is tight, writes to `.devlyn/fix-batch.round-<N>.json`.
|
|
97
|
-
|
|
98
|
-
## Non-goals
|
|
99
|
-
|
|
100
|
-
- Full SARIF 2.1.0 export (can derive later by mapping our fields to SARIF).
|
|
101
|
-
- Cross-project aggregation.
|
|
102
|
-
- Rule metadata catalogs.
|
|
103
|
-
- Code fix patches — `fix_hint` is prose; fix-loop subagent writes the patch.
|