devlyn-cli 1.15.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (158) hide show
  1. package/AGENTS.md +104 -0
  2. package/CLAUDE.md +135 -21
  3. package/README.md +43 -125
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
  5. package/benchmark/auto-resolve/README.md +114 -0
  6. package/benchmark/auto-resolve/RUBRIC.md +162 -0
  7. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
  9. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
  10. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
  11. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
  12. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
  13. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
  14. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
  16. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
  17. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
  18. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
  19. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
  20. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
  21. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
  22. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
  23. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
  27. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
  28. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
  29. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
  30. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
  31. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
  32. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
  33. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
  34. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
  35. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
  37. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
  39. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
  40. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
  41. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
  42. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
  43. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
  44. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
  46. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
  47. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
  48. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
  49. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
  50. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
  51. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
  52. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
  53. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
  54. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
  55. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
  57. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
  58. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
  59. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
  60. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
  61. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
  62. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
  63. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
  64. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
  65. package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
  66. package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
  67. package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
  68. package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
  69. package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
  70. package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
  71. package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
  72. package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
  73. package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
  74. package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
  75. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
  76. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
  77. package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
  78. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
  79. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
  80. package/benchmark/auto-resolve/scripts/judge.sh +359 -0
  81. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
  82. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
  83. package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
  84. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
  85. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
  86. package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
  87. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
  88. package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
  89. package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
  90. package/bin/devlyn.js +129 -17
  91. package/config/skills/_shared/adapters/README.md +64 -0
  92. package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
  93. package/config/skills/_shared/adapters/opus-4-7.md +29 -0
  94. package/config/skills/{devlyn:auto-resolve/scripts → _shared}/archive_run.py +26 -0
  95. package/config/skills/_shared/codex-config.md +54 -0
  96. package/config/skills/_shared/codex-monitored.sh +141 -0
  97. package/config/skills/_shared/engine-preflight.md +35 -0
  98. package/config/skills/_shared/expected.schema.json +93 -0
  99. package/config/skills/_shared/pair-plan-schema.md +298 -0
  100. package/config/skills/_shared/runtime-principles.md +110 -0
  101. package/config/skills/_shared/spec-verify-check.py +519 -0
  102. package/config/skills/devlyn:ideate/SKILL.md +99 -429
  103. package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
  104. package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
  105. package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
  106. package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
  107. package/config/skills/devlyn:resolve/SKILL.md +172 -184
  108. package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
  109. package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
  110. package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
  111. package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
  112. package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
  113. package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
  114. package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
  115. package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
  116. package/{config/skills → optional-skills}/devlyn:reap/SKILL.md +1 -0
  117. package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
  118. package/package.json +12 -2
  119. package/scripts/lint-skills.sh +431 -0
  120. package/config/skills/devlyn:auto-resolve/SKILL.md +0 -252
  121. package/config/skills/devlyn:auto-resolve/evals/evals.json +0 -21
  122. package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +0 -42
  123. package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -130
  124. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -82
  125. package/config/skills/devlyn:auto-resolve/references/findings-schema.md +0 -103
  126. package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +0 -54
  127. package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +0 -45
  128. package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +0 -84
  129. package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +0 -114
  130. package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +0 -201
  131. package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +0 -96
  132. package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
  133. package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
  134. package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
  135. package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
  136. package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
  137. package/config/skills/devlyn:clean/SKILL.md +0 -285
  138. package/config/skills/devlyn:design-ui/SKILL.md +0 -351
  139. package/config/skills/devlyn:discover-product/SKILL.md +0 -124
  140. package/config/skills/devlyn:evaluate/SKILL.md +0 -564
  141. package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
  142. package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
  143. package/config/skills/devlyn:ideate/references/codex-critic-template.md +0 -42
  144. package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
  145. package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
  146. package/config/skills/devlyn:preflight/SKILL.md +0 -355
  147. package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
  148. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -86
  149. package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
  150. package/config/skills/devlyn:product-spec/SKILL.md +0 -603
  151. package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
  152. package/config/skills/devlyn:review/SKILL.md +0 -161
  153. package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
  154. package/config/skills/devlyn:team-review/SKILL.md +0 -493
  155. package/config/skills/devlyn:update-docs/SKILL.md +0 -463
  156. package/config/skills/workflow-routing/SKILL.md +0 -73
  157. /package/{config/skills → optional-skills}/devlyn:reap/scripts/reap.sh +0 -0
  158. /package/{config/skills → optional-skills}/devlyn:reap/scripts/scan.sh +0 -0
package/bin/devlyn.js CHANGED
@@ -17,6 +17,7 @@ const CLI_TARGETS = {
17
17
  codex: {
18
18
  name: 'Codex CLI (OpenAI)',
19
19
  instructionsFile: 'AGENTS.md',
20
+ baseInstructionsFile: 'AGENTS.md',
20
21
  configDir: null, // Codex uses AGENTS.md at project root
21
22
  detect: () => fs.existsSync(path.join(process.cwd(), 'AGENTS.md')) || fs.existsSync(path.join(process.cwd(), '.codex')),
22
23
  },
@@ -68,8 +69,15 @@ const DEPRECATED_FILES = [
68
69
  'commands/devlyn.pencil-push.md', // migrated to skills/devlyn:pencil-push
69
70
  ];
70
71
 
71
- // Skill directories renamed from devlyn-* to devlyn:* in v0.7.x
72
+ // Skill directories renamed from devlyn-* to devlyn:* in v0.7.x, plus
73
+ // iter-0034 Phase 4 cutover (2026-05-03): 15 user skills deleted and 3 moved
74
+ // to optional-skills/. Listed here so post-cutover `npx devlyn-cli` upgrades
75
+ // force-remove stale legacy skill dirs from downstream `~/.claude/skills/`
76
+ // even though the source dirs no longer exist (cleanManagedSkillDirs only
77
+ // removes target dirs that still exist in source — without this list,
78
+ // deleted-from-source skills persist in user installs forever).
72
79
  const DEPRECATED_DIRS = [
80
+ // v0.7.x rename: devlyn-* → devlyn:*
73
81
  'skills/devlyn-clean',
74
82
  'skills/devlyn-design-system',
75
83
  'skills/devlyn-design-ui',
@@ -87,6 +95,28 @@ const DEPRECATED_DIRS = [
87
95
  'skills/devlyn-update-docs',
88
96
  'skills/devlyn-pencil-pull',
89
97
  'skills/devlyn-pencil-push',
98
+ // iter-0034 Phase 4 cutover: deleted user skills
99
+ 'skills/devlyn:auto-resolve',
100
+ 'skills/devlyn:browser-validate',
101
+ 'skills/devlyn:clean',
102
+ 'skills/devlyn:design-ui',
103
+ 'skills/devlyn:discover-product',
104
+ 'skills/devlyn:evaluate',
105
+ 'skills/devlyn:feature-spec',
106
+ 'skills/devlyn:implement-ui',
107
+ 'skills/devlyn:preflight',
108
+ 'skills/devlyn:product-spec',
109
+ 'skills/devlyn:recommend-features',
110
+ 'skills/devlyn:review',
111
+ 'skills/devlyn:team-resolve',
112
+ 'skills/devlyn:team-review',
113
+ 'skills/devlyn:update-docs',
114
+ // iter-0034 Phase 4 cutover: moved to optional-skills/. Force-removed on
115
+ // upgrade so users only have them if they opt in via the interactive
116
+ // installer (matches the pencil-pull / pencil-push pattern).
117
+ 'skills/devlyn:reap',
118
+ 'skills/devlyn:design-system',
119
+ 'skills/devlyn:team-design-ui',
90
120
  ];
91
121
 
92
122
  function getTargetDir() {
@@ -148,6 +178,9 @@ const OPTIONAL_ADDONS = [
148
178
  { name: 'dokkit', desc: 'Document template filling for DOCX/HWPX — ingest, fill, review, export', type: 'local' },
149
179
  { name: 'devlyn:pencil-pull', desc: 'Pull Pencil designs into code with exact visual fidelity', type: 'local' },
150
180
  { name: 'devlyn:pencil-push', desc: 'Push codebase UI to Pencil canvas for design sync', type: 'local' },
181
+ { name: 'devlyn:reap', desc: 'Safely reap orphaned MCP / codex / Superset child processes left behind by long Claude sessions', type: 'local' },
182
+ { name: 'devlyn:design-system', desc: 'Extract design tokens from a chosen UI style for exact reproduction (creative power-user)', type: 'local' },
183
+ { name: 'devlyn:team-design-ui', desc: '5 distinct UI style explorations from a full design team (creative power-user)', type: 'local' },
151
184
  // External skill packs (installed via npx skills add)
152
185
  { name: 'vercel-labs/agent-skills', desc: 'React, Next.js, React Native best practices', type: 'external' },
153
186
  { name: 'supabase/agent-skills', desc: 'Supabase integration patterns', type: 'external' },
@@ -155,8 +188,10 @@ const OPTIONAL_ADDONS = [
155
188
  { name: 'anthropics/skills', desc: 'Official Anthropic skill-creator with eval framework and description optimizer', type: 'external' },
156
189
  { name: 'Leonxlnx/taste-skill', desc: 'Premium frontend design skills — modern layouts, animations, and visual refinement', type: 'external' },
157
190
  // MCP servers (installed via claude mcp add)
158
- { name: 'codex-cli', desc: 'Codex MCP server for cross-model evaluation via OpenAI Codex', type: 'mcp', command: 'npx -y codex-mcp-server' },
159
- { name: 'playwright', desc: 'Playwright MCP for browser testing powers devlyn:browser-validate Tier 2', type: 'mcp', command: 'npx -y @anthropic-ai/mcp-playwright' },
191
+ // Note: the Codex integration uses the local `codex` CLI binary (not MCP).
192
+ // Install the CLI separately per https://platform.openai.com/docs/codexthe
193
+ // harness auto-detects availability and downgrades to Claude-only on failure.
194
+ { name: 'playwright', desc: 'Playwright MCP for browser testing — powers /devlyn:resolve BUILD_GATE browser tier', type: 'mcp', command: 'npx -y @anthropic-ai/mcp-playwright' },
160
195
  ];
161
196
 
162
197
  function log(msg, color = 'reset') {
@@ -262,7 +297,7 @@ function cleanupDeprecated(targetDir) {
262
297
  const fullPath = path.join(targetDir, relPath);
263
298
  if (fs.existsSync(fullPath)) {
264
299
  fs.rmSync(fullPath, { recursive: true });
265
- log(` ✕ ${relPath}/ (renamed)`, 'dim');
300
+ log(` ✕ ${relPath}/ (removed)`, 'dim');
266
301
  removed++;
267
302
  }
268
303
  }
@@ -273,6 +308,8 @@ function copyRecursive(src, dest, baseDir) {
273
308
  const stats = fs.statSync(src);
274
309
 
275
310
  if (stats.isDirectory()) {
311
+ // Never install dev workspaces, even when running from source repo.
312
+ if (UNSHIPPED_SKILL_DIRS.has(path.basename(src))) return;
276
313
  if (!fs.existsSync(dest)) {
277
314
  fs.mkdirSync(dest, { recursive: true });
278
315
  }
@@ -290,6 +327,37 @@ function copyRecursive(src, dest, baseDir) {
290
327
  }
291
328
  }
292
329
 
330
+ // Dev artifacts that live under config/skills/ but must never ship or install.
331
+ // Mirrors the `!` exclusions in package.json files[].
332
+ const UNSHIPPED_SKILL_DIRS = new Set([
333
+ 'devlyn:auto-resolve-workspace',
334
+ 'devlyn:ideate-workspace',
335
+ 'preflight-workspace',
336
+ 'roadmap-archival-workspace',
337
+ ]);
338
+
339
+ // Clean managed skill directories before copy to prevent stale-file drift.
340
+ // copyRecursive is a pure overlay: if a file was removed or renamed in source,
341
+ // the installed mirror keeps the old copy. For each top-level dir under
342
+ // config/skills/, remove its counterpart in target/skills/ before the copy so
343
+ // each managed skill is fully replaced on every sync. User-installed skills
344
+ // (e.g. skill-creator from optional addons) are left alone because they have
345
+ // no counterpart in source. Dev workspaces are skipped entirely.
346
+ function cleanManagedSkillDirs(sourceSkillsDir, targetSkillsDir) {
347
+ if (!fs.existsSync(sourceSkillsDir) || !fs.existsSync(targetSkillsDir)) return 0;
348
+ let cleaned = 0;
349
+ for (const entry of fs.readdirSync(sourceSkillsDir, { withFileTypes: true })) {
350
+ if (!entry.isDirectory()) continue;
351
+ if (UNSHIPPED_SKILL_DIRS.has(entry.name)) continue;
352
+ const targetPath = path.join(targetSkillsDir, entry.name);
353
+ if (fs.existsSync(targetPath)) {
354
+ fs.rmSync(targetPath, { recursive: true, force: true });
355
+ cleaned++;
356
+ }
357
+ }
358
+ return cleaned;
359
+ }
360
+
293
361
  function multiSelect(items) {
294
362
  return new Promise((resolve) => {
295
363
  const selected = new Set();
@@ -310,8 +378,8 @@ function multiSelect(items) {
310
378
  const checkbox = selected.has(i) ? `${COLORS.green}◉${COLORS.reset}` : `${COLORS.dim}○${COLORS.reset}`;
311
379
  const pointer = i === cursor ? `${COLORS.cyan}❯${COLORS.reset}` : ' ';
312
380
  const name = i === cursor ? `${COLORS.cyan}${item.name}${COLORS.reset}` : item.name;
313
- const tagLabel = item.type === 'mcp' ? 'mcp' : item.type === 'local' ? 'skill' : 'pack';
314
- const tagColor = item.type === 'mcp' ? COLORS.green : item.type === 'local' ? COLORS.magenta : COLORS.cyan;
381
+ const tagLabel = item.type === 'mcp' ? 'mcp' : item.type === 'local' ? 'skill' : item.type === 'cli' ? 'cli' : 'pack';
382
+ const tagColor = item.type === 'mcp' ? COLORS.green : item.type === 'local' ? COLORS.magenta : item.type === 'cli' ? COLORS.blue : COLORS.cyan;
315
383
  const tag = `${tagColor}${tagLabel}${COLORS.reset}`;
316
384
  console.log(`${pointer} ${checkbox} ${name} ${COLORS.dim}[${tag}${COLORS.dim}]${COLORS.reset}`);
317
385
  console.log(` ${COLORS.dim}${item.desc}${COLORS.reset}`);
@@ -482,6 +550,11 @@ function installAgentsForCLI(cliKey) {
482
550
  const sepIdx = existing.lastIndexOf('---', markerIdx);
483
551
  existing = existing.slice(0, sepIdx > 0 ? sepIdx : markerIdx).trimEnd();
484
552
  }
553
+ } else if (cli.baseInstructionsFile) {
554
+ const baseInstructionsSrc = path.join(__dirname, '..', cli.baseInstructionsFile);
555
+ if (fs.existsSync(baseInstructionsSrc)) {
556
+ existing = fs.readFileSync(baseInstructionsSrc, 'utf8').trimEnd();
557
+ }
485
558
  }
486
559
 
487
560
  fs.writeFileSync(destFile, existing + separator + agentContent + '\n');
@@ -514,6 +587,13 @@ async function init(skipPrompts = false) {
514
587
  // Install core config
515
588
  const targetDir = getTargetDir();
516
589
  log('\n📁 Installing core config to .claude/', 'green');
590
+ const refreshed = cleanManagedSkillDirs(
591
+ path.join(CONFIG_SOURCE, 'skills'),
592
+ path.join(targetDir, 'skills'),
593
+ );
594
+ if (refreshed > 0) {
595
+ log(` 🔄 Refreshing ${refreshed} managed skill director${refreshed === 1 ? 'y' : 'ies'}`, 'dim');
596
+ }
517
597
  copyRecursive(CONFIG_SOURCE, targetDir, targetDir);
518
598
 
519
599
  // Remove deprecated files from previous versions
@@ -522,7 +602,8 @@ async function init(skipPrompts = false) {
522
602
  log(`\n🧹 Cleaned up ${removed} deprecated file${removed > 1 ? 's' : ''}`, 'yellow');
523
603
  }
524
604
 
525
- // Copy CLAUDE.md to project root
605
+ // Copy Claude project instructions to project root. Other CLI instruction
606
+ // files are installed only when explicitly selected below or via `agents`.
526
607
  const claudeMdSrc = path.join(__dirname, '..', 'CLAUDE.md');
527
608
  const claudeMdDest = path.join(process.cwd(), 'CLAUDE.md');
528
609
  if (fs.existsSync(claudeMdSrc)) {
@@ -609,26 +690,39 @@ async function init(skipPrompts = false) {
609
690
  log(' → ~/.claude/settings.json (disabled adaptive thinking, enabled 1h prompt caching)', 'dim');
610
691
  }
611
692
 
612
- // Install agents for other detected CLIs
613
- const detected = detectOtherCLIs();
614
- if (detected.length > 0) {
615
- log(`\n🔍 Detected other AI CLIs: ${detected.map((k) => CLI_TARGETS[k].name).join(', ')}`, 'blue');
616
- const agentsInstalled = installAgentsForAllDetected();
617
- if (agentsInstalled > 0) {
618
- log(` ✅ Agent instructions installed for ${agentsInstalled} CLI${agentsInstalled > 1 ? 's' : ''}`, 'green');
619
- }
620
- }
621
-
622
693
  log('\n✅ Core config installed!', 'green');
623
694
 
624
695
  // Skip prompts if -y flag or non-interactive
625
696
  if (skipPrompts || !process.stdin.isTTY) {
626
697
  log('\n💡 Add optional addons later: run `npx devlyn-cli` without -y', 'dim');
698
+ log(' Add Codex instructions later: run `npx devlyn-cli agents codex`', 'dim');
627
699
  log(`\n${COLORS.dim} Enjoying devlyn? Star it on GitHub — it helps others find it:${COLORS.reset}`);
628
700
  log(` ${COLORS.purple}→ https://github.com/fysoul17/devlyn-cli${COLORS.reset}\n`);
629
701
  return;
630
702
  }
631
703
 
704
+ // Ask which non-Claude CLIs should receive instruction files.
705
+ log('\n🤖 Optional AI CLI instructions:\n', 'blue');
706
+ const cliOptions = Object.entries(CLI_TARGETS).map(([key, cli]) => ({
707
+ key,
708
+ name: cli.name,
709
+ desc: cli.configDir
710
+ ? `Install agents into ${cli.configDir}/`
711
+ : `Install ${cli.instructionsFile}`,
712
+ type: 'cli',
713
+ }));
714
+ const selectedClis = await multiSelect(cliOptions);
715
+ if (selectedClis.length > 0) {
716
+ let agentsInstalled = 0;
717
+ for (const selectedCli of selectedClis) {
718
+ if (installAgentsForCLI(selectedCli.key)) agentsInstalled++;
719
+ }
720
+ log(` ✅ Agent instructions installed for ${agentsInstalled} CLI${agentsInstalled !== 1 ? 's' : ''}`, 'green');
721
+ } else {
722
+ log('💡 No additional CLI instructions selected', 'dim');
723
+ log(' Run `npx devlyn-cli agents codex` later to install Codex AGENTS.md', 'dim');
724
+ }
725
+
632
726
  // Ask about optional addons (local skills + external packs)
633
727
  log('\n📚 Optional skills & packs:\n', 'blue');
634
728
 
@@ -657,6 +751,9 @@ function showHelp() {
657
751
  log(' npx devlyn-cli -y Install without prompts');
658
752
  log(' npx devlyn-cli agents Install agents for detected CLIs');
659
753
  log(' npx devlyn-cli agents all Install agents for all supported CLIs');
754
+ log(' npx devlyn-cli benchmark Run the full A/B benchmark suite vs bare');
755
+ log(' npx devlyn-cli benchmark --n 3 --bless Ship-decision run + promote baseline if pass');
756
+ log(' npx devlyn-cli benchmark --dry-run Validate suite setup without model invocation');
660
757
  log(' npx devlyn-cli --help Show this help\n');
661
758
  log('Optional skills (select during install):', 'green');
662
759
  OPTIONAL_ADDONS.filter((a) => a.type === 'local').forEach((skill) => {
@@ -694,6 +791,21 @@ switch (command) {
694
791
  case 'ls':
695
792
  listContents();
696
793
  break;
794
+ case 'benchmark':
795
+ case 'bench': {
796
+ // Delegate to benchmark/auto-resolve/scripts/run-suite.sh with all remaining args.
797
+ const runSuite = path.join(__dirname, '..', 'benchmark', 'auto-resolve', 'scripts', 'run-suite.sh');
798
+ if (!fs.existsSync(runSuite)) {
799
+ log('❌ Benchmark suite runner missing — is this a clean devlyn-cli checkout?', 'yellow');
800
+ log(` Expected: ${runSuite}`, 'dim');
801
+ process.exit(1);
802
+ }
803
+ const { spawnSync } = require('child_process');
804
+ const forwardedArgs = args.slice(1);
805
+ const res = spawnSync('bash', [runSuite, ...forwardedArgs], { stdio: 'inherit' });
806
+ process.exit(res.status ?? 1);
807
+ break;
808
+ }
697
809
  case 'agents': {
698
810
  showLogo();
699
811
  log('─'.repeat(44), 'dim');
@@ -0,0 +1,64 @@
1
+ # Per-engine prompt adapters
2
+
3
+ This folder is the LLM-specific delta layer. The harness's canonical phase prompts (in each skill's `references/phases/<phase>.md`) stay model-neutral and outcome-first. Each adapter file in this folder is a **small delta header** that gets injected BEFORE the canonical body when the phase runs against that specific engine.
4
+
5
+ ## Why adapters exist
6
+
7
+ Anthropic and OpenAI publish official prompt-engineering guides for their flagship models. The two guides converge on outcome-first + decision rules + mechanical validation but **diverge on tactics** (XML structure vs stop-rules format, literal interpretation vs decision-rule phrasing, self-check pattern vs validation-tool primacy). A single canonical prompt can't hit both ceilings.
8
+
9
+ The split:
10
+ - **Canonical body** (in `<skill>/references/phases/`) = the contract: goal, output format, invariants, common-ground rules from both guides.
11
+ - **Adapter header** (here) = the per-engine elaboration: model-specific guidance from that engine's official guide.
12
+
13
+ This is also the load-bearing piece for **multi-LLM evolution**. When Qwen / Gemini / Gemma are added (Mission 2/3), each gets its own adapter file here. The canonical body never moves.
14
+
15
+ ## Format
16
+
17
+ Each adapter is a single markdown file named `<model-id>.md` (e.g. `opus-4-7.md`, `gpt-5-5.md`). Structure:
18
+
19
+ ```markdown
20
+ # <Model name> adapter
21
+
22
+ > Source: <official-prompt-engineering-guide URL>
23
+
24
+ ## Identity
25
+ 1-2 lines telling the model who it is + which guide governs.
26
+
27
+ ## Output discipline
28
+ Verbosity, formatting, length conventions specific to this model.
29
+
30
+ ## Tool-use posture
31
+ When to use tools, when to reason, parallel/sequential preferences.
32
+
33
+ ## Validation pattern
34
+ How this model verifies its work — mechanical-first vs self-check, etc.
35
+
36
+ ## Anti-patterns
37
+ Specific patterns the official guide warns about for this model.
38
+ ```
39
+
40
+ Keep each section to ≤ 8 lines. Adapters are deltas, not full prompts. If an adapter grows past ~80 lines, the content probably belongs in canonical body.
41
+
42
+ ## When to add a new adapter
43
+
44
+ A new adapter file ships when:
45
+ 1. A new LLM is integrated into the pipeline (the engine is now invocable).
46
+ 2. An official prompt-engineering guide for that LLM exists (or a vendor-recommended pattern set).
47
+ 3. An empirical A/B shows the adapter's specific guidance lifts that model's performance over the canonical body alone.
48
+
49
+ Not all models need adapters. If a model performs well on the canonical body without delta, ship without one.
50
+
51
+ ## What NOT to put here
52
+
53
+ - ❌ Universal rules (those go in canonical body or `_shared/runtime-principles.md`).
54
+ - ❌ Iter-history annotations (`*(iter-0020: F4 evidence...)*` style).
55
+ - ❌ Full phase prompts (defeats the decoupling).
56
+ - ❌ Per-task or per-spec content (adapters are model-scope, not task-scope).
57
+
58
+ ## Runtime injection
59
+
60
+ A skill's phase invocation prepends the resolved engine's adapter file to the canonical body before sending. Mechanism is left to each skill (a `_shared/adapter-inject.sh` helper may land in a later iter); for now, skills consume the adapter file by direct read at phase-spawn time.
61
+
62
+ ## Standing rule
63
+
64
+ Any iter that touches an adapter file MUST cite the corresponding official guide as part of acceptance: "guide section X.Y says Z, this change applies Z." Generic preferences ("feels cleaner") are rejected.
@@ -0,0 +1,29 @@
1
+ # OpenAI GPT-5.5 adapter
2
+
3
+ > Source: <https://developers.openai.com/api/docs/guides/prompt-guidance?model=gpt-5.5>
4
+
5
+ ## Identity
6
+
7
+ You are GPT-5.5 by OpenAI. OpenAI's prompt-guidance for this model governs your behavior on top of the canonical phase prompt below. When the canonical body and this header conflict on tactics, the canonical body wins on what to deliver; this header wins on how to deliver it.
8
+
9
+ ## Output discipline
10
+
11
+ Your default is efficient, direct, task-oriented. The canonical body specifies the outcome and constraints; you choose the efficient path. Do not over-specify process steps when an outcome is clearly stated. Use headers, bullets, and bold sparingly — favor short paragraphs and natural transitions unless the canonical body or user requests structure. When `text.verbosity` is `low`, prefer even shorter responses.
12
+
13
+ ## Tool-use posture
14
+
15
+ Resolve the request in the fewest useful tool loops without sacrificing correctness. For retrieval tasks: start with one broad search using short discriminative keywords; make another retrieval call only when the top results don't answer the core question or a required fact / parameter / source is missing. For tool-heavy tasks, start with a brief preamble: a one-line acknowledgment of the request and the first step you'll take.
16
+
17
+ ## Validation pattern
18
+
19
+ Validation is concrete commands and tools, not self-belief. When the canonical body lists verification commands, execute them and trust their output. Do not substitute your judgment for a deterministic check the harness has provided. When validation tools are available (test runners, lint, type-check, the harness's `spec-verify-check.py`), run them before declaring success. The minimum evidence sufficient to answer correctly, cited precisely — then stop.
20
+
21
+ ## Anti-patterns
22
+
23
+ The official guide warns explicitly about carrying over instructions from older prompt stacks — earlier models needed more help, and process-heavy directives now narrow GPT-5.5's search space.
24
+
25
+ 1. **Avoid absolute imperatives for judgment calls.** ALWAYS / NEVER / must / only are reserved for true safety invariants and required output fields. For judgment calls, use decision rules with conditions ("when X, do Y"). The canonical body uses this style; do not promote softer guidance to absolute rules.
26
+ 2. **Don't over-specify process when the destination is clear.** If the canonical body names the outcome, choose the path; do not narrate every step.
27
+ 3. **Stop rules are explicit.** When the canonical body or the harness asks you to stop / abstain / ask, follow the stop rule rather than retrying loops indefinitely. Loop-minimization does not outrank correctness or required citation.
28
+
29
+ Do not narrate internal deliberation. State results and decisions directly.
@@ -0,0 +1,29 @@
1
+ # Claude Opus 4.7 adapter
2
+
3
+ > Source: <https://platform.claude.com/docs/en/build-with-claude/prompt-engineering/claude-prompting-best-practices>
4
+
5
+ ## Identity
6
+
7
+ You are Claude Opus 4.7 by Anthropic. Anthropic's prompt-engineering guide for this model governs your behavior on top of the canonical phase prompt below. When the canonical body and this header conflict on tactics, the canonical body wins on what to deliver; this header wins on how to deliver it.
8
+
9
+ ## Output discipline
10
+
11
+ You calibrate response length to task complexity automatically — keep simple lookups short, scale up only when the task warrants it. Do NOT pad with context the user didn't ask for. When the canonical body sets a structural format (XML, JSON, sections), follow it literally; do not silently restructure.
12
+
13
+ ## Tool-use posture
14
+
15
+ You default to fewer tool calls than prior Claude generations. When the canonical body lists tools, use them when their result would change your answer. Make independent tool calls in parallel; chain only when one depends on another's output. Do not narrate "I'll now call X" preambles unless the canonical body requests progress updates.
16
+
17
+ ## Validation pattern
18
+
19
+ When the canonical body asks you to verify your output before declaring done ("self-check" instructions), execute that step literally — re-read the spec's acceptance criteria, run the listed verification commands if available, list any gap. This is not optional. Mechanical gates owned by the harness (spec-verify-check.py, build-gate.py) are the primary correctness guard; your self-check is the secondary layer that catches what regex cannot.
20
+
21
+ ## Anti-patterns
22
+
23
+ You interpret instructions more literally than prior Claude versions. The official guide is explicit about three failure modes:
24
+
25
+ 1. **Review-prompt self-filtering**: when the canonical body asks for findings, report every issue you find — including low-severity and low-confidence ones. Do NOT pre-filter for importance; the harness has a separate filter step.
26
+ 2. **Subagent over-spawning**: do NOT spawn a subagent for work you can complete in a single response. Spawn only when the canonical body explicitly requests it OR when fanning out across independent items.
27
+ 3. **Overengineering**: do NOT add files, abstractions, error handling, validation, or "future flexibility" beyond what the spec asks. A bug fix doesn't need surrounding cleanup. The right complexity is the minimum needed for the current task.
28
+
29
+ You do NOT need stronger imperatives ("CRITICAL!", "YOU MUST!") to follow rules. Normal phrasing is sufficient.
@@ -26,6 +26,32 @@ PER_RUN_PATTERNS = (
26
26
  "*.log.md",
27
27
  "fix-batch.round-*.json",
28
28
  "criteria.generated.md",
29
+ # iter-0019.8: spec-verify carrier artifacts get archived alongside
30
+ # other per-run state. Killed mid-run cleanup is enforced separately
31
+ # by spec-verify-check.py main() — when source markdown has no json
32
+ # block AND BENCH_WORKDIR is unset (real-user mode), the script drops
33
+ # any pre-existing .devlyn/spec-verify.json so a stale orphan from a
34
+ # killed prior run cannot poison this run's gate.
35
+ "spec-verify.json",
36
+ "spec-verify.results.json",
37
+ "spec-verify-findings.jsonl",
38
+ # iter-0033a/2026-04-30 archive-fix iter: NEW /devlyn:resolve emits
39
+ # plan.md (PLAN output) + final-report.md (PHASE 6 render) +
40
+ # cumulative.patch (cumulative diff). Smoke 2's archive listing
41
+ # captured all three; archive_run.py was missing them because the
42
+ # patterns predated the new skill's artifact set. Added explicitly
43
+ # so the move is deterministic.
44
+ "plan.md",
45
+ "final-report.md",
46
+ "cumulative.patch",
47
+ # iter-0033c (Codex R-final-smoke Q2): pair-mode VERIFY emits per-judge
48
+ # deliberation transcripts (verify-judge-claude.md / verify-judge-codex.md
49
+ # — and any future-engine analogue via wildcard). Smoke 1a (F2 l2_forced)
50
+ # surfaced the gap: the orchestrator wrote them and listed them as
51
+ # artifacts, but archive_run.py left them in .devlyn/. Gate 8
52
+ # ("pair_judge findings archive distinguishable") would false-fail on
53
+ # every paired fixture without this glob.
54
+ "verify-judge-*.md",
29
55
  )
30
56
 
31
57
 
@@ -0,0 +1,54 @@
1
+ # Shared — Codex Invocation
2
+
3
+ Single source of truth for how every skill calls Codex. **MCP is not used.** Skills shell out via the wrapper at `_shared/codex-monitored.sh`, which fronts the local Codex CLI (shipped by the `openai-codex` Claude Code plugin).
4
+
5
+ ## Canonical invocations
6
+
7
+ All long-running Codex calls go through `codex-monitored.sh` — a thin wrapper that closes stdin (codex 0.124.0 hangs when both stdin is open and a prompt arg is given), streams Codex stdout fully (no `tail -n` truncation), and prints a `[codex-monitored] heartbeat` line every 30s so the outer `claude -p` byte-watchdog stays fed during long reasoning gaps. The wrapper passes its arguments through verbatim to the underlying CLI, so the canonical flag set is unchanged from a raw call — only the launcher differs.
8
+
9
+ **Read-only critique / adversarial review / debate** (ideate CHALLENGE phase, `/devlyn:resolve` VERIFY pair-mode when triggered). Security review is delegated to the native `security-review` Claude Code skill, invoked from `/devlyn:resolve` BUILD_GATE rather than from Codex.
10
+
11
+ ```bash
12
+ bash .claude/skills/_shared/codex-monitored.sh \
13
+ -C <project-root> \
14
+ -s read-only \
15
+ -c model_reasoning_effort=xhigh \
16
+ "<inlined-prompt>"
17
+ ```
18
+
19
+ **Workspace-write implementation** (`/devlyn:resolve` IMPLEMENT phase when `--engine codex` or `--engine auto` routes to Codex, plus codex-routed `/devlyn:ideate` phases):
20
+
21
+ ```bash
22
+ bash .claude/skills/_shared/codex-monitored.sh \
23
+ -C <project-root> \
24
+ --full-auto \
25
+ -c model_reasoning_effort=xhigh \
26
+ "<inlined-prompt>"
27
+ ```
28
+
29
+ Notes:
30
+ - `-C` — project root so Codex's working directory matches.
31
+ - `-s read-only` / `--full-auto` — sandbox policy. `--full-auto` = `-s workspace-write` with auto-approval of sandboxed commands.
32
+ - `-c model_reasoning_effort=xhigh` — config override for reasoning depth. Required for deep critique; skills may choose `high` or `medium` when thoroughness doesn't warrant xhigh.
33
+ - **Omit `-m <model>`** — Codex CLI uses its configured flagship (currently `gpt-5.5`, automatically whatever ships next). This is the zero-touch mechanism. Only name `-m` when a role explicitly needs a different model (e.g., `gpt-5.3-codex` for SWE-bench-heavy coding tasks, `gpt-5.3-codex-spark` for speed).
34
+ - Raw `codex exec ...` invocations are **forbidden** in skill prompts. The benchmark variant arm runs a PATH shim (`scripts/codex-shim/codex`) that transparently re-routes any raw `codex exec` to the wrapper as a safety net, but skills should always emit the wrapper form directly so the orchestrator's first-attempt has the right shape. Two prior iterations (iter-0006 universal foreground ban, iter-0008 prompt-level kill-shape contract) failed because the orchestrator picked starvation-prone shapes (`codex exec ... 2>&1 | tail -200`) from its own pattern prior — the wrapper plus the shim is the runtime binding layer those iters lacked. See `autoresearch/iterations/0009-wrapper-and-hook.md`.
35
+
36
+ ## Availability check
37
+
38
+ Before the first Codex call in a run, verify the CLI is on PATH:
39
+
40
+ ```bash
41
+ command -v codex >/dev/null 2>&1
42
+ ```
43
+
44
+ If the check fails, the skill follows the `_shared/engine-preflight.md` downgrade rule — silently switch to Claude for this run and log `engine downgraded: codex-unavailable` in the final report. Never prompt, never abort.
45
+
46
+ ## Why CLI over other paths
47
+
48
+ The local Codex CLI (fronted by `codex-monitored.sh`) is the primary (and only) integration. It beats alternatives on three dimensions: the model is inherited from the CLI's own default so no skill edits are needed when OpenAI ships a new flagship; flags compose on the command line and the skill docs stay grep-friendly; the invocation has one failure mode (the binary is on PATH or it isn't), which the shared availability check covers cleanly.
49
+
50
+ ## Invocation from inside a skill prompt
51
+
52
+ Skills write the invocation as a Bash command the runtime executes. Example shape from `/devlyn:resolve` PHASE 2 IMPLEMENT when routed to Codex:
53
+
54
+ > Run `bash .claude/skills/_shared/codex-monitored.sh -C <state.base_ref.repo_root> --full-auto -c model_reasoning_effort=xhigh "<IMPLEMENT prompt>"`. Omit `-m` so the CLI flagship is auto-selected. Capture stdout as the IMPLEMENT reply; non-zero exit → treat as subagent failure. The wrapper emits `[codex-monitored]` heartbeat and lifecycle lines on **stderr** — stdout stays clean for Codex output, so the orchestrator can parse the reply without filtering. Heartbeat-on-stderr keeps the orchestrator's combined-output stream non-silent (defeats the iter-0008 byte-watchdog kill) without polluting the codex-reply view of stdout.
@@ -0,0 +1,141 @@
1
+ #!/usr/bin/env bash
2
+ # codex-monitored.sh — run `codex exec` in a monitored shape that keeps the
3
+ # outer claude -p API stream from going silent during long Codex calls.
4
+ #
5
+ # WHY (iter-0009, post iter-0006/0007/0008):
6
+ # • iter-0007 isolation proved a single foreground `codex exec` Bash dispatch
7
+ # can starve the outer API stream of bytes during a 10+ min run; Anthropic's
8
+ # byte-level idle watchdog fires (~300s) and kills the orchestrator.
9
+ # • iter-0008 saw the orchestrator pick `codex exec ... 2>&1 | tail -200` from
10
+ # its own pattern prior — `tail` on a pipe buffers until EOF, suppressing
11
+ # ALL bytes. Same starvation, amplified.
12
+ # • iter-0008 also documented codex 0.124.0 reads stdin as a `<stdin>` block
13
+ # when the prompt is passed as an arg AND stdin is open; without
14
+ # `< /dev/null` the call hangs indefinitely.
15
+ #
16
+ # WHAT THIS WRAPPER DOES:
17
+ # 1. Refuses to run if stdout is a pipe. Piping wrapper output to text tools
18
+ # (tail/head/awk/sed/grep without --line-buffered) re-introduces the
19
+ # iter-0008 starvation mechanism — the downstream tool buffers until EOF
20
+ # and the outer claude -p byte-watchdog never sees bytes. Exits 64 with a
21
+ # clear message so the orchestrator can self-correct on retry.
22
+ # (Round 2 finding #1 fix: shim alone does not defeat `| tail`; the
23
+ # wrapper must reject the pipe shape directly.)
24
+ # 2. Closes stdin (`< /dev/null`) — kills the codex 0.124.0 stdin hang.
25
+ # 3. Streams codex stdout to OUR stdout line-by-line — the orchestrator reads
26
+ # stdout as the subagent reply (per `_shared/codex-config.md`) so we MUST
27
+ # NOT swallow it (e.g. `tail -n 200`). codex stderr forwards to OUR stderr.
28
+ # 4. Emits a `[codex-monitored] heartbeat` line every CODEX_MONITORED_HEARTBEAT
29
+ # seconds (default 30s) on STDERR while codex is alive. Heartbeat-on-stderr
30
+ # keeps the orchestrator's combined-output stream non-silent without
31
+ # polluting the codex-reply view of stdout.
32
+ # 5. Forwards SIGTERM/SIGINT from the outer watchdog to the codex child so a
33
+ # timeout actually reaps codex (otherwise process group kill races with
34
+ # backgrounded codex).
35
+ # 6. Preserves codex's exact exit code.
36
+ #
37
+ # USAGE:
38
+ # bash codex-monitored.sh -C <repo> -s read-only -c model_reasoning_effort=xhigh "<prompt>"
39
+ # bash codex-monitored.sh resume --last
40
+ # (Args after the script name are passed verbatim to `codex exec`.)
41
+ #
42
+ # ENV OVERRIDES:
43
+ # CODEX_MONITORED_HEARTBEAT — heartbeat interval seconds (default 30).
44
+ # CODEX_BIN — real codex binary path. Default: `codex`.
45
+ # Set this when the shim has put us first
46
+ # on PATH.
47
+ # CODEX_MONITORED_ALLOW_PIPED — set non-empty to skip the pipe-stdout
48
+ # refusal. Reserved for tests; don't use
49
+ # in skill prompts.
50
+
51
+ set -uo pipefail
52
+
53
+ # iter-0019 — solo_claude (L1) arm enforcement (defense in depth alongside
54
+ # scripts/codex-shim/codex). If this env is set, the wrapper refuses to invoke
55
+ # codex at all, regardless of how it was reached. Two enforcement points
56
+ # protect against the case where one is bypassed: the shim catches PATH-based
57
+ # resolution, and this wrapper catches direct-path invocations of
58
+ # codex-monitored.sh that don't go through the shim.
59
+ if [ -n "${CODEX_BLOCKED:-}" ]; then
60
+ printf '[codex-monitored] CODEX_BLOCKED=%s — refusing codex invocation (solo_claude / L1 arm enforcement). args: %s\n' \
61
+ "${CODEX_BLOCKED}" "$*" >&2
62
+ exit 126
63
+ fi
64
+
65
+ HEARTBEAT_SEC="${CODEX_MONITORED_HEARTBEAT:-30}"
66
+ CODEX_BIN="${CODEX_BIN:-codex}"
67
+ START=$(date +%s)
68
+
69
+ # --- Pipe-stdout refusal (iter-0009 R2 finding #1) -------------------------
70
+ # `[ -p /dev/stdout ]` is the POSIX test for "is fd 1 a FIFO/pipe". Verified
71
+ # correct on macOS via lsof: distinguishes piped (`| cat`) from redirected
72
+ # (`> file`) and from claude-bash-tool capture (regular file). Without this
73
+ # refusal, `bash WRAPPER ... 2>&1 | tail -200` would buffer wrapper output —
74
+ # including the heartbeat on stderr after `2>&1` — until EOF, reproducing
75
+ # the iter-0008 byte-watchdog kill.
76
+ if [ -z "${CODEX_MONITORED_ALLOW_PIPED:-}" ] && [ -p /dev/stdout ]; then
77
+ cat >&2 <<'EOF'
78
+ [codex-monitored] error: stdout is a pipe.
79
+
80
+ Piping the wrapper to tail/head/awk/sed/grep buffers wrapper output until EOF,
81
+ which starves the outer claude -p byte-watchdog (iter-0008 starvation mechanism)
82
+ and kills the run after ~300s with empty transcript.
83
+
84
+ Fix: invoke the wrapper directly so the bash tool captures its stdout. The
85
+ wrapper streams full Codex output and emits a heartbeat on stderr; you do NOT
86
+ need to truncate.
87
+
88
+ WRONG: bash codex-monitored.sh ... 2>&1 | tail -200
89
+ RIGHT: bash codex-monitored.sh ...
90
+
91
+ If you absolutely must filter, use a line-buffered tool (e.g. `grep --line-buffered`)
92
+ and set CODEX_MONITORED_ALLOW_PIPED=1 in the wrapper's environment.
93
+ EOF
94
+ exit 64
95
+ fi
96
+
97
+ # --- Heartbeat + signal forwarding ----------------------------------------
98
+ heartbeat_loop() {
99
+ local pid="$1"
100
+ while kill -0 "$pid" 2>/dev/null; do
101
+ sleep "$HEARTBEAT_SEC"
102
+ if kill -0 "$pid" 2>/dev/null; then
103
+ local elapsed=$(( $(date +%s) - START ))
104
+ printf '[codex-monitored] heartbeat: elapsed=%ds\n' "$elapsed" >&2
105
+ fi
106
+ done
107
+ }
108
+
109
+ forward_signal() {
110
+ local sig="$1"
111
+ if [ -n "${CODEX_PID:-}" ] && kill -0 "$CODEX_PID" 2>/dev/null; then
112
+ kill -"$sig" "$CODEX_PID" 2>/dev/null || true
113
+ fi
114
+ if [ -n "${HB_PID:-}" ] && kill -0 "$HB_PID" 2>/dev/null; then
115
+ kill -TERM "$HB_PID" 2>/dev/null || true
116
+ fi
117
+ }
118
+
119
+ trap 'forward_signal TERM' TERM
120
+ trap 'forward_signal INT' INT
121
+
122
+ printf '[codex-monitored] start: ts=%s heartbeat=%ds bin=%s\n' \
123
+ "$(date -u +%FT%TZ)" "$HEARTBEAT_SEC" "$CODEX_BIN" >&2
124
+
125
+ # Launch codex with stdin closed; output streams directly to OUR stdout/stderr.
126
+ "$CODEX_BIN" exec "$@" < /dev/null &
127
+ CODEX_PID=$!
128
+ printf '[codex-monitored] codex pid=%d\n' "$CODEX_PID" >&2
129
+
130
+ heartbeat_loop "$CODEX_PID" &
131
+ HB_PID=$!
132
+
133
+ wait "$CODEX_PID"
134
+ EXIT=$?
135
+
136
+ kill -TERM "$HB_PID" 2>/dev/null || true
137
+ wait "$HB_PID" 2>/dev/null || true
138
+
139
+ printf '[codex-monitored] codex exited: code=%d elapsed=%ds\n' \
140
+ "$EXIT" $(( $(date +%s) - START )) >&2
141
+ exit "$EXIT"
@@ -0,0 +1,35 @@
1
+ # Shared — `--engine` Pre-flight
2
+
3
+ Used by `/devlyn:resolve` and `/devlyn:ideate`. One shared availability rule so every skill routes identically.
4
+
5
+ ## Rule
6
+
7
+ Each skill resolves the effective engine from its own SKILL.md default plus any explicit `--engine` flag passed by the user. This pre-flight runs **only when the resolved engine is `auto` or `codex`** — when the resolved engine is `claude` (whether by skill default or explicit flag), the Codex check is skipped entirely.
8
+
9
+ When the resolved engine is `auto` or `codex`, on entry (before spawning any phase that could route to Codex):
10
+
11
+ 1. Check if the Codex CLI is installed: `command -v codex >/dev/null 2>&1` (or equivalent bash test).
12
+ 2. On failure → silently set `engine = "claude"` for the remainder of this run AND log `engine downgraded: codex-unavailable` into the skill's final summary/report header.
13
+ 3. On success → proceed with the original engine value.
14
+
15
+ Never prompt the user. Never abort the run on missing CLI.
16
+
17
+ Per-skill defaults: `/devlyn:resolve` defaults to `claude` (post iter-0020 close-out — Codex BUILD/IMPLEMENT below quality floor; iter-0033g + iter-0034 close-out — PLAN-pair research-only until container/sandbox infra justifies a measurement); `/devlyn:ideate` defaults to `auto` for the CHALLENGE phase's cross-model GAN-critic dynamic. Each skill's SKILL.md flag block is the source of truth for that skill's default.
18
+
19
+ ## Why this is the one permitted silent fallback
20
+
21
+ `CLAUDE.md` sets the no-silent-fallback rule for this repo. This downgrade is documented there as the single explicit exception because the hands-free contract — skills the user walks away from — would otherwise fail every run whenever the Codex CLI is absent. The user-visible behavior is identical to an explicit `--engine claude` invocation, and the banner in the final report removes the silence. Any other silent fallback in skills code is a bug.
22
+
23
+ ## What a skill must log after downgrade
24
+
25
+ When the resolved engine was `auto` / `codex` and the Codex CLI was absent, the final user-facing report/summary shows both the requested and effective mode:
26
+
27
+ ```
28
+ Engine: claude (downgraded from auto — codex-unavailable)
29
+ ```
30
+
31
+ If no downgrade happened (either Codex was available, or the resolved engine was already `claude`), omit the parenthetical. That single line is the contract — the user can always see why Codex did or did not participate.
32
+
33
+ ## Canonical Codex invocation
34
+
35
+ See `config/skills/_shared/codex-config.md` for the canonical wrapper invocation and flag set skills should use after the availability check passes.