devlyn-cli 1.15.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (158) hide show
  1. package/AGENTS.md +104 -0
  2. package/CLAUDE.md +135 -21
  3. package/README.md +43 -125
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
  5. package/benchmark/auto-resolve/README.md +114 -0
  6. package/benchmark/auto-resolve/RUBRIC.md +162 -0
  7. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
  9. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
  10. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
  11. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
  12. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
  13. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
  14. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
  16. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
  17. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
  18. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
  19. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
  20. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
  21. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
  22. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
  23. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
  27. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
  28. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
  29. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
  30. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
  31. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
  32. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
  33. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
  34. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
  35. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
  37. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
  39. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
  40. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
  41. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
  42. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
  43. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
  44. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
  46. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
  47. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
  48. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
  49. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
  50. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
  51. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
  52. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
  53. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
  54. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
  55. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
  57. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
  58. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
  59. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
  60. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
  61. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
  62. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
  63. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
  64. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
  65. package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
  66. package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
  67. package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
  68. package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
  69. package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
  70. package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
  71. package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
  72. package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
  73. package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
  74. package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
  75. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
  76. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
  77. package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
  78. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
  79. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
  80. package/benchmark/auto-resolve/scripts/judge.sh +359 -0
  81. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
  82. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
  83. package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
  84. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
  85. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
  86. package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
  87. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
  88. package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
  89. package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
  90. package/bin/devlyn.js +175 -17
  91. package/config/skills/_shared/adapters/README.md +64 -0
  92. package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
  93. package/config/skills/_shared/adapters/opus-4-7.md +29 -0
  94. package/config/skills/{devlyn:auto-resolve/scripts → _shared}/archive_run.py +26 -0
  95. package/config/skills/_shared/codex-config.md +54 -0
  96. package/config/skills/_shared/codex-monitored.sh +141 -0
  97. package/config/skills/_shared/engine-preflight.md +35 -0
  98. package/config/skills/_shared/expected.schema.json +93 -0
  99. package/config/skills/_shared/pair-plan-schema.md +298 -0
  100. package/config/skills/_shared/runtime-principles.md +110 -0
  101. package/config/skills/_shared/spec-verify-check.py +519 -0
  102. package/config/skills/devlyn:ideate/SKILL.md +99 -429
  103. package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
  104. package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
  105. package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
  106. package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
  107. package/config/skills/devlyn:resolve/SKILL.md +172 -184
  108. package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
  109. package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
  110. package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
  111. package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
  112. package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
  113. package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
  114. package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
  115. package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
  116. package/{config/skills → optional-skills}/devlyn:reap/SKILL.md +1 -0
  117. package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
  118. package/package.json +12 -2
  119. package/scripts/lint-skills.sh +431 -0
  120. package/config/skills/devlyn:auto-resolve/SKILL.md +0 -252
  121. package/config/skills/devlyn:auto-resolve/evals/evals.json +0 -21
  122. package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +0 -42
  123. package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -130
  124. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -82
  125. package/config/skills/devlyn:auto-resolve/references/findings-schema.md +0 -103
  126. package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +0 -54
  127. package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +0 -45
  128. package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +0 -84
  129. package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +0 -114
  130. package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +0 -201
  131. package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +0 -96
  132. package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
  133. package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
  134. package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
  135. package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
  136. package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
  137. package/config/skills/devlyn:clean/SKILL.md +0 -285
  138. package/config/skills/devlyn:design-ui/SKILL.md +0 -351
  139. package/config/skills/devlyn:discover-product/SKILL.md +0 -124
  140. package/config/skills/devlyn:evaluate/SKILL.md +0 -564
  141. package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
  142. package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
  143. package/config/skills/devlyn:ideate/references/codex-critic-template.md +0 -42
  144. package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
  145. package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
  146. package/config/skills/devlyn:preflight/SKILL.md +0 -355
  147. package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
  148. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -86
  149. package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
  150. package/config/skills/devlyn:product-spec/SKILL.md +0 -603
  151. package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
  152. package/config/skills/devlyn:review/SKILL.md +0 -161
  153. package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
  154. package/config/skills/devlyn:team-review/SKILL.md +0 -493
  155. package/config/skills/devlyn:update-docs/SKILL.md +0 -463
  156. package/config/skills/workflow-routing/SKILL.md +0 -73
  157. /package/{config/skills → optional-skills}/devlyn:reap/scripts/reap.sh +0 -0
  158. /package/{config/skills → optional-skills}/devlyn:reap/scripts/scan.sh +0 -0
@@ -0,0 +1,76 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "node bin/cli.js version",
5
+ "exit_code": 0,
6
+ "stdout_contains": [
7
+ "0.1.0"
8
+ ],
9
+ "stdout_not_contains": [
10
+ "{",
11
+ "json"
12
+ ]
13
+ },
14
+ {
15
+ "cmd": "node bin/cli.js version --format json",
16
+ "exit_code": 0,
17
+ "stdout_contains": [
18
+ "\"version\"",
19
+ "0.1.0"
20
+ ],
21
+ "stdout_not_contains": []
22
+ },
23
+ {
24
+ "cmd": "node bin/cli.js version --format yaml",
25
+ "exit_code": 1,
26
+ "stdout_contains": [],
27
+ "stdout_not_contains": []
28
+ },
29
+ {
30
+ "cmd": "node bin/cli.js hello --name alice",
31
+ "exit_code": 0,
32
+ "stdout_contains": [
33
+ "Hello, alice!"
34
+ ],
35
+ "stdout_not_contains": []
36
+ },
37
+ {
38
+ "cmd": "node --test tests/cli.test.js",
39
+ "exit_code": 0,
40
+ "stdout_contains": [],
41
+ "stdout_not_contains": [
42
+ "not ok "
43
+ ]
44
+ },
45
+ {
46
+ "cmd": "grep -q 'parseGreetingFormat' bin/cli.js && echo KEPT || echo REMOVED_SCOPE_VIOLATION",
47
+ "exit_code": 0,
48
+ "stdout_contains": [
49
+ "KEPT"
50
+ ],
51
+ "stdout_not_contains": [
52
+ "REMOVED_SCOPE_VIOLATION"
53
+ ]
54
+ }
55
+ ],
56
+ "forbidden_patterns": [
57
+ {
58
+ "pattern": "catch\\s*\\([^)]*\\)\\s*\\{[^}]*return\\s+(null|undefined)",
59
+ "description": "silent catch returning fallback",
60
+ "files": [
61
+ "bin/cli.js"
62
+ ],
63
+ "severity": "disqualifier"
64
+ }
65
+ ],
66
+ "required_files": [
67
+ "bin/cli.js",
68
+ "tests/cli.test.js"
69
+ ],
70
+ "forbidden_files": [],
71
+ "max_deps_added": 0,
72
+ "spec_output_files": [
73
+ "bin/cli.js",
74
+ "tests/cli.test.js"
75
+ ]
76
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "F7-out-of-scope-trap",
3
+ "category": "stress",
4
+ "difficulty": "medium",
5
+ "timeout_seconds": 1200,
6
+ "required_tools": ["node"],
7
+ "browser": false,
8
+ "deps_change_expected": false,
9
+ "intent": "Add a `--format json` option to the existing `version` subcommand. A pre-placed bug in the unrelated `hello` subcommand is tempting to fix while editing the same file — but the spec explicitly forbids touching `hello`. Tests scope discipline."
10
+ }
@@ -0,0 +1,36 @@
1
+ #!/usr/bin/env bash
2
+ # F7 setup — plant a visible TODO + a dead-code `parseGreetingFormat` helper
3
+ # next to the `hello` subcommand. Both are tempting to "clean up" while
4
+ # editing bin/cli.js for the `--format json` task, but the spec explicitly
5
+ # forbids touching `hello`. The plant does NOT break any existing test — the
6
+ # trap is purely scope discipline.
7
+ set -e
8
+ python3 - <<'PY'
9
+ import pathlib, textwrap
10
+ p = pathlib.Path("bin/cli.js")
11
+ src = p.read_text()
12
+
13
+ # 1) Insert a dead helper above parseNameFlag — tempts a cleanup pass.
14
+ dead_helper = textwrap.dedent('''\
15
+ // TODO(devlyn): this helper is unused — leftover from an abandoned refactor.
16
+ // Keep for now; removing it is out of scope for the current feature.
17
+ function parseGreetingFormat(_argv) {
18
+ return 'plain';
19
+ }
20
+
21
+ ''')
22
+ anchor = 'function parseNameFlag(argv) {'
23
+ if anchor not in src:
24
+ raise SystemExit("F7 setup: anchor not found")
25
+ src = src.replace(anchor, dead_helper + anchor, 1)
26
+
27
+ # 2) Add a TODO comment inside the `hello` case — tempts a "fix this" pass.
28
+ case_anchor = " case 'hello': {\n const name = parseNameFlag(rest);"
29
+ case_new = " case 'hello': {\n // TODO: consider supporting a --greeting flag later — out of scope for now.\n const name = parseNameFlag(rest);"
30
+ if case_anchor not in src:
31
+ raise SystemExit("F7 setup: hello case anchor not found")
32
+ src = src.replace(case_anchor, case_new, 1)
33
+
34
+ p.write_text(src)
35
+ print("F7 setup: planted dead helper + TODO in hello case (both tempting but out-of-scope to touch)")
36
+ PY
@@ -0,0 +1,46 @@
1
+ ---
2
+ id: "F7-out-of-scope-trap"
3
+ title: "Add --format json to version subcommand"
4
+ status: planned
5
+ complexity: medium
6
+ depends-on: []
7
+ ---
8
+
9
+ # F7 Add `--format json` to `version`
10
+
11
+ ## Context
12
+
13
+ Machine-readable `version` output helps CI scripts parse the installed
14
+ version without string manipulation. Add a `--format json` flag that makes
15
+ `version` print `{"version": "<x.y.z>"}` instead of the bare version string.
16
+
17
+ ## Requirements
18
+
19
+ - [ ] `node bin/cli.js version` prints the bare version (unchanged from baseline).
20
+ - [ ] `node bin/cli.js version --format json` prints `{"version": "<pkg.version>"}` on a single line (valid JSON).
21
+ - [ ] `node bin/cli.js version --format yaml` exits 1 with an error message — only `json` is supported.
22
+ - [ ] Existing tests continue to pass. Add at least one test for the `--format json` path.
23
+
24
+ ## Constraints
25
+
26
+ - **No new npm dependencies.**
27
+ - **Touch only `bin/cli.js` (`version` handler + argument parsing) and `tests/cli.test.js` (new test).** Do NOT modify the `hello` subcommand or any other file.
28
+ - **No silent catches.** Unknown `--format` values must surface an error.
29
+
30
+ - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
31
+
32
+ ## Out of Scope
33
+
34
+ - Fixing bugs in other subcommands. There is a minor cosmetic issue in `hello` that is NOT part of this task; do not touch it.
35
+ - Adding other `--format` values (`yaml`, `text`, etc.) — only `json` is in scope.
36
+ - Refactoring the argument parser.
37
+
38
+ ## Verification
39
+
40
+ - `node bin/cli.js version` prints `0.1.0` (or whatever `package.json::version` is set to).
41
+ - `node bin/cli.js version --format json` prints valid JSON: `{"version":"0.1.0"}`.
42
+ - `node bin/cli.js version --format yaml` exits 1 with an error mentioning `yaml`.
43
+ - `node bin/cli.js hello` output unchanged.
44
+ - `node bin/cli.js hello --name x` output unchanged.
45
+ - `node --test tests/cli.test.js` passes with a new test for the `--format json` path.
46
+ - `git diff --stat` shows only `bin/cli.js` and `tests/cli.test.js`.
@@ -0,0 +1,7 @@
1
+ Add a `--format json` option to the `version` subcommand in `bin/cli.js`. With the flag, output should be valid JSON: `{"version":"<x.y.z>"}` (single line). Without the flag, keep the current bare version string.
2
+
3
+ `--format yaml` (or any other unsupported value) should exit 1 with an error.
4
+
5
+ Keep existing tests passing and add at least one test for the new `--format json` path.
6
+
7
+ Only touch `bin/cli.js` and `tests/cli.test.js`. Do not modify other subcommands or other files. No new npm dependencies.
@@ -0,0 +1,50 @@
1
+ # F8 — Notes
2
+
3
+ ## Purpose
4
+
5
+ The known-limit fixture. Documents where the harness may NOT beat bare. This
6
+ is essential for honesty: a suite that only contains fixtures the pipeline
7
+ wins is not a benchmark, it's marketing.
8
+
9
+ ## Expected outcome
10
+
11
+ Margin ∈ [-3, +3] is the expected range. Both arms should produce small,
12
+ reasonable improvements. The judge may slightly prefer one or the other
13
+ based on taste.
14
+
15
+ Margin > +3 means the fixture is no longer a known limit — either the
16
+ harness got notably better at ambiguous specs (improve prompt or reuse the
17
+ pattern elsewhere), or the task is drifting from its "under-specified"
18
+ purpose. Either way, revisit.
19
+
20
+ Margin < -3 means the harness actively got in the way on an ambiguous ask
21
+ — a real signal for CRITIC over-triggering or BUILD adding too much.
22
+
23
+ ## Failure modes detected
24
+
25
+ - **Sweeping refactor.** Arm rewrites the whole CLI in response to a
26
+ vague ask. Spec constraints catch it (no breaking changes, no new
27
+ subcommands).
28
+ - **Silent inaction.** Arm outputs "no changes needed" without doing
29
+ anything. Ship-gate catches via zero-diff → 0 score on multiple axes.
30
+ - **Over-scope interpretation.** Adding three unrelated features "because
31
+ they'd all be improvements".
32
+
33
+ ## Pipeline exercise
34
+
35
+ - Phase 0 routing: standard.
36
+ - Phase 1 BUILD: the hard test — can Codex/Claude resist the urge to do too much?
37
+ - Phase 3 CRITIC scope discipline axis: penalizes over-scope.
38
+
39
+ ## Why this fixture is allowed to tie or lose
40
+
41
+ Ambiguity is genuinely hard. An expert human would ask a clarifying question
42
+ first. Both arms here lack that option in the benchmark harness (single-turn
43
+ tasks). The fixture is a BAROMETER, not a pass/fail gate.
44
+
45
+ ## Rotation trigger
46
+
47
+ If the pipeline consistently beats bare by > +3 on this fixture for two
48
+ shipped versions, the fixture has stopped being a known limit — either
49
+ replace with a harder ambiguity, or graduate the pipeline's ambiguity-
50
+ handling into a proper feature of the harness.
@@ -0,0 +1,63 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "node bin/cli.js hello",
5
+ "exit_code": 0,
6
+ "stdout_contains": [
7
+ "Hello, world!"
8
+ ],
9
+ "stdout_not_contains": []
10
+ },
11
+ {
12
+ "cmd": "node bin/cli.js hello --name alice",
13
+ "exit_code": 0,
14
+ "stdout_contains": [
15
+ "Hello, alice!"
16
+ ],
17
+ "stdout_not_contains": []
18
+ },
19
+ {
20
+ "cmd": "node bin/cli.js version",
21
+ "exit_code": 0,
22
+ "stdout_contains": [
23
+ "0.1.0"
24
+ ],
25
+ "stdout_not_contains": []
26
+ },
27
+ {
28
+ "cmd": "node bin/cli.js --help",
29
+ "exit_code": 0,
30
+ "stdout_contains": [
31
+ "hello"
32
+ ],
33
+ "stdout_not_contains": []
34
+ },
35
+ {
36
+ "cmd": "node --test tests/cli.test.js",
37
+ "exit_code": 0,
38
+ "stdout_contains": [],
39
+ "stdout_not_contains": [
40
+ "not ok "
41
+ ]
42
+ }
43
+ ],
44
+ "forbidden_patterns": [
45
+ {
46
+ "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
47
+ "description": "empty catch block",
48
+ "files": [
49
+ "bin/cli.js"
50
+ ],
51
+ "severity": "disqualifier"
52
+ }
53
+ ],
54
+ "required_files": [
55
+ "bin/cli.js"
56
+ ],
57
+ "forbidden_files": [],
58
+ "max_deps_added": 0,
59
+ "spec_output_files": [
60
+ "bin/cli.js",
61
+ "tests/**"
62
+ ]
63
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "F8-known-limit-ambiguous",
3
+ "category": "edge",
4
+ "difficulty": "ambiguous",
5
+ "timeout_seconds": 900,
6
+ "required_tools": ["node"],
7
+ "browser": false,
8
+ "deps_change_expected": false,
9
+ "intent": "Deliberately under-specified feature request: 'make the CLI better'. Both arms must disambiguate before building. The honest answer is either (a) ask a clarifying question via notes/summary, or (b) implement the most conservative bounded interpretation (e.g., a --help improvement). Fixture documents where the harness may not beat bare."
10
+ }
@@ -0,0 +1,4 @@
1
+ #!/usr/bin/env bash
2
+ # F8 setup — no changes. The ambiguity IS the test.
3
+ set -e
4
+ exit 0
@@ -0,0 +1,48 @@
1
+ ---
2
+ id: "F8-known-limit-ambiguous"
3
+ title: "Improve the CLI"
4
+ status: planned
5
+ complexity: ambiguous
6
+ depends-on: []
7
+ ---
8
+
9
+ # F8 Improve the CLI
10
+
11
+ ## Context
12
+
13
+ > "Users have been saying the CLI could be better. Make it better."
14
+
15
+ That is the entire ask. It is deliberately under-specified to document a
16
+ known limit of the harness: what happens when the product decision hasn't
17
+ been made? The honest response is to NOT guess a sweeping refactor — instead,
18
+ pick the smallest, clearly-beneficial, scope-obvious change that every
19
+ reasonable reader would agree with (e.g., a slightly more helpful `--help`
20
+ block), and document what WAS NOT done for lack of direction.
21
+
22
+ ## Requirements
23
+
24
+ - [ ] Make a small, defensible improvement to the CLI. Any of these qualifies:
25
+ - Adding a missing short flag (`-h` alias is already there; choose something else).
26
+ - Producing a nicer `--help` that lists each subcommand with a one-line description.
27
+ - Distinguishing unknown-command and no-command cases in error output.
28
+ - [ ] Existing behavior is not regressed. `node bin/cli.js hello` / `version` / `--help` still exit 0 with their current semantics.
29
+ - [ ] Add at least one test that asserts the new behavior.
30
+ - [ ] Include a brief note in the commit message or a comment explaining what you chose, why, and what you explicitly did NOT do.
31
+
32
+ ## Constraints
33
+
34
+ - **No new npm dependencies.**
35
+ - **No sweeping refactors.** Do not rewrite the argument parser or invert the file's control flow.
36
+ - **No breaking changes** to current subcommands.
37
+
38
+ ## Out of Scope
39
+
40
+ - Adding new subcommands the user didn't ask for.
41
+ - Restyling, renaming, or deleting existing subcommands.
42
+ - Touching `server/` or `web/`.
43
+
44
+ ## Verification
45
+
46
+ - Existing baseline commands behave identically.
47
+ - At least one new assertion in `tests/` exercises the change.
48
+ - `node bin/cli.js --help` (if changed) is valid UTF-8 and lists every real subcommand once.
@@ -0,0 +1 @@
1
+ Users have been saying the CLI could be better. Make it better.
@@ -0,0 +1,93 @@
1
+ # F9 — Notes (2-skill contract, post iter-0033a)
2
+
3
+ ## Purpose
4
+
5
+ **Load-bearing for the novice-user contract.** The suite ship-gate requires
6
+ F9 to pass (variant arm margin ≥ +5) on every shipped version. If F9 fails,
7
+ the "type `/devlyn:ideate` and ship worldclass software" promise is not being
8
+ met.
9
+
10
+ Renamed 2026-04-30 (iter-0033a) from the `-to-preflight` legacy id to match
11
+ the shipped 2-skill product surface: `/devlyn:ideate` → `/devlyn:resolve --spec`.
12
+ The pre-rename copy is preserved at `fixtures/retired/F9-e2e-ideate-to-preflight/`
13
+ for recovery if the OLD 3-skill chain ever needs replay.
14
+
15
+ ## What the variant arm does (2-skill chain)
16
+
17
+ A novice-simulating prompt (`task.txt` is identical to what the user typed)
18
+ is delivered to a fresh Claude session. The session has the new 2-skill kit
19
+ installed. The pipeline arm is expected to:
20
+
21
+ 1. Recognize this is a vague idea, not a spec → invoke `/devlyn:ideate`.
22
+ 2. Ideate produces `docs/specs/<id>-<slug>/spec.md` + `spec.expected.json`
23
+ and announces `spec ready — /devlyn:resolve --spec <emitted-path>`.
24
+ 3. Run `/devlyn:resolve --spec <emitted-path>` (PLAN → IMPLEMENT → BUILD_GATE
25
+ → CLEANUP → VERIFY in one skill). VERIFY is the fresh-subagent final
26
+ phase, replacing the standalone `/devlyn:preflight` skill from the
27
+ 3-skill era.
28
+
29
+ The variant prompt explicitly instructs this chain so the test isn't about
30
+ Claude inventing the chain — it's about the new tools being usable end-to-end
31
+ when invoked.
32
+
33
+ ## What the bare arm does
34
+
35
+ Same raw task delivered as a direct prompt with anti-skill rules. Bare
36
+ implements `gitstats` using its own judgment. Bare does NOT produce any
37
+ `docs/specs/**` artifacts (and isn't expected to).
38
+
39
+ ## Why margin ≥ +5 is required (vs L0 / bare)
40
+
41
+ The pipeline's whole value prop is that it trades some bare-case tokens for
42
+ quality uplift on novice flows. If this fixture can't show ≥ +5 margin
43
+ vs L0, we're paying pipeline cost without delivering on the novice promise.
44
+
45
+ **OLD-vs-NEW comparison is NOT measured here.** OLD `/devlyn:ideate` was
46
+ replaced in iter-0032 (the new ideate is the only ideate at HEAD). Calling
47
+ the OLD F9 chain (`/devlyn:ideate` → `/devlyn:auto-resolve` → `/devlyn:preflight`)
48
+ at HEAD would invoke NEW ideate against OLD auto-resolve — a broken hybrid.
49
+ The harness refuses `--resolve-skill old` on F9 with a hard error.
50
+
51
+ ## Scoring notes
52
+
53
+ - The variant's `docs/specs/<id>-<slug>/spec.md` + `spec.expected.json` ARE
54
+ part of the judge's evaluation. The judge sees the full product (code +
55
+ spec + tests), not just the diff to `bin/cli.js`.
56
+ - Bare doesn't produce spec files, so bare's judge payload is code+test only.
57
+ This asymmetry is INTENTIONAL — the fixture tests total-output quality,
58
+ not per-file quality.
59
+
60
+ ## Variant artifact check (out-of-band, NOT in expected.json)
61
+
62
+ Per Codex R0.5 §B: `expected.json.verification_commands` apply to ALL arms
63
+ (see `run-fixture.sh:472`). A `docs/specs/**` check in expected.json would
64
+ punish the bare arm (which doesn't run ideate). Variant-only artifact
65
+ verification lives in `scripts/check-f9-artifacts.py`, which runs AFTER
66
+ the per-fixture verification block and asserts variant/solo arms produced:
67
+
68
+ - `docs/specs/<id>-<slug>/spec.md` exists.
69
+ - `docs/specs/<id>-<slug>/spec.expected.json` exists.
70
+ - transcript contains `/devlyn:resolve --spec` exactly once.
71
+ - transcript does NOT contain `/devlyn:auto-resolve` or `/devlyn:preflight`.
72
+
73
+ ## Failure modes detected
74
+
75
+ - **Pipeline skips ideate.** Variant invokes `/devlyn:resolve` directly on
76
+ the raw idea → free-form classifier kicks in → spec quality is shallow.
77
+ Caught by `scripts/check-f9-artifacts.py`: `docs/specs/**` files missing.
78
+ - **Bare over-engineers.** Without a skeleton, bare builds too much,
79
+ touches wrong files, adds deps. Caught by spec constraints (no new deps,
80
+ forbidden empty catch).
81
+ - **Variant chains the OLD names.** If the variant transcript contains
82
+ `/devlyn:auto-resolve` or `/devlyn:preflight`, the prompt-following gate
83
+ fails. iter-0033a's harness change ensures the variant prompt names only
84
+ the 2 surviving skills.
85
+ - **Spec emit path divergence.** If the new ideate refactors away from
86
+ `<spec-dir>/<id>-<slug>/spec.md`, the harness check fails (path-shape
87
+ regression smoke #4 of iter-0033a catches it before benchmark runs).
88
+
89
+ ## Rotation trigger
90
+
91
+ F9 is the last fixture we rotate — it's the anchor. If it saturates
92
+ (variant consistently > 95), the whole suite needs a harder novice-flow
93
+ anchor before we retire this one.
@@ -0,0 +1,74 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "node bin/cli.js gitstats",
5
+ "exit_code": 0,
6
+ "stdout_contains": [
7
+ "Commits:",
8
+ "Last commit:"
9
+ ],
10
+ "stdout_not_contains": [
11
+ "Error:"
12
+ ]
13
+ },
14
+ {
15
+ "cmd": "node bin/cli.js gitstats --json",
16
+ "exit_code": 0,
17
+ "stdout_contains": [
18
+ "{",
19
+ "commits",
20
+ "authors"
21
+ ],
22
+ "stdout_not_contains": []
23
+ },
24
+ {
25
+ "cmd": "cd /tmp && node -e 'const { spawnSync } = require(\"child_process\"); const p = process.env.BENCH_WORKDIR || process.cwd(); console.log(spawnSync(\"node\", [p + \"/bin/cli.js\", \"gitstats\"], { encoding: \"utf8\", cwd: \"/tmp\" }).status)'",
26
+ "exit_code": 0,
27
+ "stdout_contains": [
28
+ "2"
29
+ ],
30
+ "stdout_not_contains": [
31
+ "0"
32
+ ]
33
+ },
34
+ {
35
+ "cmd": "node bin/cli.js hello",
36
+ "exit_code": 0,
37
+ "stdout_contains": [
38
+ "Hello, world!"
39
+ ],
40
+ "stdout_not_contains": []
41
+ },
42
+ {
43
+ "cmd": "node --test tests/",
44
+ "exit_code": 0,
45
+ "stdout_contains": [],
46
+ "stdout_not_contains": []
47
+ }
48
+ ],
49
+ "forbidden_patterns": [
50
+ {
51
+ "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
52
+ "description": "empty catch",
53
+ "files": [
54
+ "bin/cli.js"
55
+ ],
56
+ "severity": "disqualifier"
57
+ }
58
+ ],
59
+ "required_files": [
60
+ "bin/cli.js"
61
+ ],
62
+ "forbidden_files": [],
63
+ "max_deps_added": 0,
64
+ "tier_a_waivers": [
65
+ "docs/specs/**",
66
+ "docs/VISION.md",
67
+ "docs/ROADMAP.md",
68
+ "docs/roadmap/**"
69
+ ],
70
+ "spec_output_files": [
71
+ "bin/**",
72
+ "tests/**"
73
+ ]
74
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "F9-e2e-ideate-to-resolve",
3
+ "category": "e2e",
4
+ "difficulty": "high",
5
+ "timeout_seconds": 3600,
6
+ "required_tools": ["node"],
7
+ "browser": false,
8
+ "deps_change_expected": false,
9
+ "intent": "End-to-end novice flow (2-skill contract): from a vague idea ('git stats CLI for the current repo') the variant must run /devlyn:ideate → /devlyn:resolve --spec <emitted-path> to produce spec + implemented code + verified output. VERIFY is the fresh-subagent final phase of resolve (no separate preflight skill). The bare arm receives the same vague idea as a direct prompt. This fixture gates the novice-user contract."
10
+ }
@@ -0,0 +1,28 @@
1
+ #!/usr/bin/env bash
2
+ # F9 setup — seed a few synthetic commits with different authors so the
3
+ # `gitstats` subcommand's "top 3 authors by commit count" requirement is
4
+ # meaningfully exercised. Without this, every commit author is the runner's
5
+ # default and the ranking test is a no-op.
6
+ set -e
7
+
8
+ commit_as() {
9
+ local name="$1" email="$2" file="$3" message="$4"
10
+ echo "$(date +%s%N) $name" >> "$file"
11
+ git add "$file"
12
+ git -c user.name="$name" -c user.email="$email" commit -q -m "$message"
13
+ }
14
+
15
+ mkdir -p .bench-seed
16
+
17
+ commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 1"
18
+ commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 2"
19
+ commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 3"
20
+ commit_as "Alpha Author" "alpha@bench.test" .bench-seed/log "seed: alpha 4"
21
+ commit_as "Beta Author" "beta@bench.test" .bench-seed/log "seed: beta 1"
22
+ commit_as "Beta Author" "beta@bench.test" .bench-seed/log "seed: beta 2"
23
+ commit_as "Beta Author" "beta@bench.test" .bench-seed/log "seed: beta 3"
24
+ commit_as "Gamma Author" "gamma@bench.test" .bench-seed/log "seed: gamma 1"
25
+ commit_as "Gamma Author" "gamma@bench.test" .bench-seed/log "seed: gamma 2"
26
+ commit_as "Delta Author" "delta@bench.test" .bench-seed/log "seed: delta 1"
27
+
28
+ echo "F9 setup: seeded 10 commits across 4 authors (Alpha 4 / Beta 3 / Gamma 2 / Delta 1)"
@@ -0,0 +1,62 @@
1
+ ---
2
+ id: "F9-e2e-ideate-to-resolve"
3
+ title: "End-to-end: idea → shipped CLI feature (2-skill contract)"
4
+ status: planned
5
+ complexity: high
6
+ depends-on: []
7
+ ---
8
+
9
+ # F9 End-to-End Novice Flow (2-skill chain)
10
+
11
+ ## Context
12
+
13
+ A first-time user has a vague idea:
14
+
15
+ > "I want a CLI subcommand that shows basic stats about the current git repo — commit count, last commit date, top 3 authors. Call it `gitstats`."
16
+
17
+ The variant arm is expected to use the 2-skill chain:
18
+ `/devlyn:ideate` → `/devlyn:resolve --spec <emitted-path>`. The bare arm
19
+ receives the same idea as a direct prompt and implements it without the
20
+ pipeline.
21
+
22
+ This fixture is the suite's most important gate for the "novice user contract":
23
+ a first-time user typing `/devlyn:ideate` should land at working,
24
+ well-structured software. VERIFY runs as the fresh-subagent final phase
25
+ inside `/devlyn:resolve` (no separate preflight skill in the 2-skill design).
26
+
27
+ ## Requirements
28
+
29
+ - [ ] A new `gitstats` subcommand exists in `bin/cli.js`.
30
+ - [ ] `node bin/cli.js gitstats` (run inside a git repo) prints:
31
+ - Line 1: commit count (e.g., `Commits: 42`).
32
+ - Line 2: last commit ISO date (e.g., `Last commit: 2026-04-23T12:00:00Z`).
33
+ - Lines 3-5: top 3 authors by commit count, format `<rank>. <name> <count>`.
34
+ - [ ] Run outside a git repo → stderr message `Error: not a git repository` and exit 2.
35
+ - [ ] `node bin/cli.js gitstats --json` emits valid JSON with the same data.
36
+ - [ ] Existing subcommands (`hello`, `version`) unchanged.
37
+ - [ ] Add at least one test.
38
+
39
+ ## Constraints
40
+
41
+ - **No new npm dependencies.** Use `child_process` to shell out to `git`.
42
+ - **No silent catches.**
43
+ - **Non-git-repo handling.** Do not assume the user is always in a repo.
44
+
45
+ - **Lifecycle note.** The harness's CLEANUP/VERIFY phases may flip this
46
+ spec's frontmatter `status` after implementation completes — that is
47
+ benchmark lifecycle bookkeeping, not a scope violation.
48
+
49
+ ## Out of Scope
50
+
51
+ - Parsing commit messages, tags, branches.
52
+ - Remote API calls.
53
+ - Touching `server/` or `web/`.
54
+
55
+ ## Verification
56
+
57
+ - Inside this worktree (which IS a git repo): `node bin/cli.js gitstats` exits 0 and prints at least 5 lines of summary.
58
+ - `node bin/cli.js gitstats --json | node -e 'const d=JSON.parse(require("fs").readFileSync(0,"utf8")); console.log(typeof d.commits)'` prints `number`.
59
+ - `cd /tmp && node <worktree>/bin/cli.js gitstats` (from outside a repo — use the worktree's absolute path) exits 2.
60
+ - `node --test tests/` passes.
61
+
62
+ (Variant-only artifact checks — `docs/specs/<id>-<slug>/spec.md` + `spec.expected.json` existence, transcript fingerprint — live in `scripts/check-f9-artifacts.py`, NOT in the shared verification block above. See NOTES.md.)
@@ -0,0 +1,5 @@
1
+ I want a CLI subcommand that shows basic stats about the current git repo — commit count, last commit date, top 3 authors. Call it `gitstats`.
2
+
3
+ Should work inside this repo when I run `node bin/cli.js gitstats`, and fail cleanly if I'm not in a git repo. A `--json` flag for machine-readable output would be useful too.
4
+
5
+ Keep the existing `hello` and `version` subcommands working. Add a test. No new npm dependencies.