devlyn-cli 1.15.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (158) hide show
  1. package/AGENTS.md +104 -0
  2. package/CLAUDE.md +135 -21
  3. package/README.md +43 -125
  4. package/benchmark/auto-resolve/BENCHMARK-DESIGN.md +272 -0
  5. package/benchmark/auto-resolve/README.md +114 -0
  6. package/benchmark/auto-resolve/RUBRIC.md +162 -0
  7. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/NOTES.md +30 -0
  8. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/expected.json +68 -0
  9. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/metadata.json +10 -0
  10. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/setup.sh +4 -0
  11. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/spec.md +45 -0
  12. package/benchmark/auto-resolve/fixtures/F1-cli-trivial-flag/task.txt +8 -0
  13. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/NOTES.md +54 -0
  14. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected-pair-plan-registry.json +170 -0
  15. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/expected.json +84 -0
  16. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/metadata.json +21 -0
  17. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-fail.json +214 -0
  18. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/pair-plan.sample-pass.json +223 -0
  19. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/setup.sh +5 -0
  20. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/spec.md +56 -0
  21. package/benchmark/auto-resolve/fixtures/F2-cli-medium-subcommand/task.txt +14 -0
  22. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/NOTES.md +28 -0
  23. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected-pair-plan-registry.json +162 -0
  24. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/expected.json +65 -0
  25. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/metadata.json +19 -0
  26. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/setup.sh +4 -0
  27. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/spec.md +56 -0
  28. package/benchmark/auto-resolve/fixtures/F3-backend-contract-risk/task.txt +9 -0
  29. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/NOTES.md +40 -0
  30. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/expected.json +57 -0
  31. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/metadata.json +10 -0
  32. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/setup.sh +6 -0
  33. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/spec.md +49 -0
  34. package/benchmark/auto-resolve/fixtures/F4-web-browser-design/task.txt +9 -0
  35. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/NOTES.md +38 -0
  36. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/expected.json +65 -0
  37. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/metadata.json +10 -0
  38. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/setup.sh +55 -0
  39. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/spec.md +49 -0
  40. package/benchmark/auto-resolve/fixtures/F5-fix-loop-red-green/task.txt +7 -0
  41. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/NOTES.md +38 -0
  42. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/expected.json +77 -0
  43. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/metadata.json +10 -0
  44. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/setup.sh +4 -0
  45. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/spec.md +49 -0
  46. package/benchmark/auto-resolve/fixtures/F6-dep-audit-native-module/task.txt +10 -0
  47. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/NOTES.md +50 -0
  48. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/expected.json +76 -0
  49. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/metadata.json +10 -0
  50. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/setup.sh +36 -0
  51. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/spec.md +46 -0
  52. package/benchmark/auto-resolve/fixtures/F7-out-of-scope-trap/task.txt +7 -0
  53. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/NOTES.md +50 -0
  54. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/expected.json +63 -0
  55. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/metadata.json +10 -0
  56. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/setup.sh +4 -0
  57. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/spec.md +48 -0
  58. package/benchmark/auto-resolve/fixtures/F8-known-limit-ambiguous/task.txt +1 -0
  59. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/NOTES.md +93 -0
  60. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/expected.json +74 -0
  61. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/metadata.json +10 -0
  62. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/setup.sh +28 -0
  63. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/spec.md +62 -0
  64. package/benchmark/auto-resolve/fixtures/F9-e2e-ideate-to-resolve/task.txt +5 -0
  65. package/benchmark/auto-resolve/fixtures/SCHEMA.md +130 -0
  66. package/benchmark/auto-resolve/fixtures/test-repo/README.md +27 -0
  67. package/benchmark/auto-resolve/fixtures/test-repo/bin/cli.js +63 -0
  68. package/benchmark/auto-resolve/fixtures/test-repo/package-lock.json +823 -0
  69. package/benchmark/auto-resolve/fixtures/test-repo/package.json +22 -0
  70. package/benchmark/auto-resolve/fixtures/test-repo/playwright.config.js +17 -0
  71. package/benchmark/auto-resolve/fixtures/test-repo/server/index.js +37 -0
  72. package/benchmark/auto-resolve/fixtures/test-repo/tests/cli.test.js +25 -0
  73. package/benchmark/auto-resolve/fixtures/test-repo/tests/server.test.js +58 -0
  74. package/benchmark/auto-resolve/fixtures/test-repo/web/index.html +37 -0
  75. package/benchmark/auto-resolve/scripts/build-pair-eligible-manifest.py +174 -0
  76. package/benchmark/auto-resolve/scripts/check-f9-artifacts.py +256 -0
  77. package/benchmark/auto-resolve/scripts/compile-report.py +331 -0
  78. package/benchmark/auto-resolve/scripts/iter-0033c-compare.py +552 -0
  79. package/benchmark/auto-resolve/scripts/judge-opus-pass.sh +430 -0
  80. package/benchmark/auto-resolve/scripts/judge.sh +359 -0
  81. package/benchmark/auto-resolve/scripts/oracle-scope-tier-a.py +260 -0
  82. package/benchmark/auto-resolve/scripts/oracle-scope-tier-b.py +274 -0
  83. package/benchmark/auto-resolve/scripts/oracle-test-fidelity.py +328 -0
  84. package/benchmark/auto-resolve/scripts/pair-plan-idgen.py +401 -0
  85. package/benchmark/auto-resolve/scripts/pair-plan-lint.py +468 -0
  86. package/benchmark/auto-resolve/scripts/run-fixture.sh +691 -0
  87. package/benchmark/auto-resolve/scripts/run-iter-0033c.sh +234 -0
  88. package/benchmark/auto-resolve/scripts/run-suite.sh +214 -0
  89. package/benchmark/auto-resolve/scripts/ship-gate.py +222 -0
  90. package/bin/devlyn.js +175 -17
  91. package/config/skills/_shared/adapters/README.md +64 -0
  92. package/config/skills/_shared/adapters/gpt-5-5.md +29 -0
  93. package/config/skills/_shared/adapters/opus-4-7.md +29 -0
  94. package/config/skills/{devlyn:auto-resolve/scripts → _shared}/archive_run.py +26 -0
  95. package/config/skills/_shared/codex-config.md +54 -0
  96. package/config/skills/_shared/codex-monitored.sh +141 -0
  97. package/config/skills/_shared/engine-preflight.md +35 -0
  98. package/config/skills/_shared/expected.schema.json +93 -0
  99. package/config/skills/_shared/pair-plan-schema.md +298 -0
  100. package/config/skills/_shared/runtime-principles.md +110 -0
  101. package/config/skills/_shared/spec-verify-check.py +519 -0
  102. package/config/skills/devlyn:ideate/SKILL.md +99 -429
  103. package/config/skills/devlyn:ideate/references/elicitation.md +97 -0
  104. package/config/skills/devlyn:ideate/references/from-spec-mode.md +54 -0
  105. package/config/skills/devlyn:ideate/references/project-mode.md +76 -0
  106. package/config/skills/devlyn:ideate/references/spec-template.md +102 -0
  107. package/config/skills/devlyn:resolve/SKILL.md +172 -184
  108. package/config/skills/devlyn:resolve/references/free-form-mode.md +68 -0
  109. package/config/skills/devlyn:resolve/references/phases/build-gate.md +45 -0
  110. package/config/skills/devlyn:resolve/references/phases/cleanup.md +39 -0
  111. package/config/skills/devlyn:resolve/references/phases/implement.md +42 -0
  112. package/config/skills/devlyn:resolve/references/phases/plan.md +42 -0
  113. package/config/skills/devlyn:resolve/references/phases/verify.md +69 -0
  114. package/config/skills/devlyn:resolve/references/state-schema.md +106 -0
  115. package/{config/skills → optional-skills}/devlyn:design-system/SKILL.md +1 -0
  116. package/{config/skills → optional-skills}/devlyn:reap/SKILL.md +1 -0
  117. package/{config/skills → optional-skills}/devlyn:team-design-ui/SKILL.md +5 -0
  118. package/package.json +12 -2
  119. package/scripts/lint-skills.sh +431 -0
  120. package/config/skills/devlyn:auto-resolve/SKILL.md +0 -252
  121. package/config/skills/devlyn:auto-resolve/evals/evals.json +0 -21
  122. package/config/skills/devlyn:auto-resolve/evals/task-doctor-subcommand.md +0 -42
  123. package/config/skills/devlyn:auto-resolve/references/build-gate.md +0 -130
  124. package/config/skills/devlyn:auto-resolve/references/engine-routing.md +0 -82
  125. package/config/skills/devlyn:auto-resolve/references/findings-schema.md +0 -103
  126. package/config/skills/devlyn:auto-resolve/references/phases/phase-1-build.md +0 -54
  127. package/config/skills/devlyn:auto-resolve/references/phases/phase-2-evaluate.md +0 -45
  128. package/config/skills/devlyn:auto-resolve/references/phases/phase-3-critic.md +0 -84
  129. package/config/skills/devlyn:auto-resolve/references/pipeline-routing.md +0 -114
  130. package/config/skills/devlyn:auto-resolve/references/pipeline-state.md +0 -201
  131. package/config/skills/devlyn:auto-resolve/scripts/terminal_verdict.py +0 -96
  132. package/config/skills/devlyn:browser-validate/SKILL.md +0 -164
  133. package/config/skills/devlyn:browser-validate/references/flow-testing.md +0 -118
  134. package/config/skills/devlyn:browser-validate/references/tier1-chrome.md +0 -137
  135. package/config/skills/devlyn:browser-validate/references/tier2-playwright.md +0 -195
  136. package/config/skills/devlyn:browser-validate/references/tier3-curl.md +0 -57
  137. package/config/skills/devlyn:clean/SKILL.md +0 -285
  138. package/config/skills/devlyn:design-ui/SKILL.md +0 -351
  139. package/config/skills/devlyn:discover-product/SKILL.md +0 -124
  140. package/config/skills/devlyn:evaluate/SKILL.md +0 -564
  141. package/config/skills/devlyn:feature-spec/SKILL.md +0 -630
  142. package/config/skills/devlyn:ideate/references/challenge-rubric.md +0 -122
  143. package/config/skills/devlyn:ideate/references/codex-critic-template.md +0 -42
  144. package/config/skills/devlyn:ideate/references/templates/item-spec.md +0 -90
  145. package/config/skills/devlyn:implement-ui/SKILL.md +0 -466
  146. package/config/skills/devlyn:preflight/SKILL.md +0 -355
  147. package/config/skills/devlyn:preflight/references/auditors/browser-auditor.md +0 -32
  148. package/config/skills/devlyn:preflight/references/auditors/code-auditor.md +0 -86
  149. package/config/skills/devlyn:preflight/references/auditors/docs-auditor.md +0 -38
  150. package/config/skills/devlyn:product-spec/SKILL.md +0 -603
  151. package/config/skills/devlyn:recommend-features/SKILL.md +0 -286
  152. package/config/skills/devlyn:review/SKILL.md +0 -161
  153. package/config/skills/devlyn:team-resolve/SKILL.md +0 -631
  154. package/config/skills/devlyn:team-review/SKILL.md +0 -493
  155. package/config/skills/devlyn:update-docs/SKILL.md +0 -463
  156. package/config/skills/workflow-routing/SKILL.md +0 -73
  157. /package/{config/skills → optional-skills}/devlyn:reap/scripts/reap.sh +0 -0
  158. /package/{config/skills → optional-skills}/devlyn:reap/scripts/scan.sh +0 -0
@@ -0,0 +1,9 @@
1
+ The `GET /items` endpoint in `server/index.js` currently returns `{ items: [...] }`. Paginate it: the response should be `{ items, total, page, per_page }`. Accept `?page` and `?per_page` query params. When no params are given, return everything on page 1 with `per_page` equal to the full count.
2
+
3
+ Keep `GET /items/:id` unchanged (no pagination on single-item lookup). `GET /health` stays as-is.
4
+
5
+ Invalid `page` or `per_page` (non-numeric, zero, negative) → respond 400 with `{ error: 'invalid_query', field: '<name>' }`. Out-of-range page (beyond the last item) returns an empty `items` array, NOT a 404.
6
+
7
+ Update `tests/server.test.js` so existing behavior is still covered AND you add at least two new tests for the paging behavior.
8
+
9
+ No new npm dependencies. Only touch `server/index.js` and `tests/server.test.js`.
@@ -0,0 +1,40 @@
1
+ # F4 — Notes
2
+
3
+ ## Purpose
4
+
5
+ Exercises the browser-validate phase of the pipeline (Phase 1.5). Catches
6
+ web-UI-only regressions that unit tests can't see and that server/integration
7
+ tests won't surface.
8
+
9
+ ## Failure modes detected
10
+
11
+ - **Italic via Unicode.** Arms may reach for Unicode italic characters
12
+ (`𝑖𝑡𝑎𝑙𝑖𝑐`) instead of CSS. Spec explicitly forbids this because it breaks
13
+ screen readers.
14
+ - **CDN link.** Linking to Google Fonts or an external CSS cuts the bench
15
+ and breaks offline / air-gapped runs — disqualifier.
16
+ - **Breaking Greet.** Careless refactors rewire the Greet button's handler
17
+ by mistake. Pipeline's Phase 1.5 browser-validate + dedicated spec test
18
+ catches it.
19
+ - **Accessibility drift.** Missing/incorrect `aria-label` on button.
20
+
21
+ ## Pipeline exercise
22
+
23
+ - Phase 1.5 BROWSER VALIDATE is the primary gate (web file changes trigger it).
24
+ - Phase 3 CRITIC design checks the DOM structure and event-handler wiring.
25
+
26
+ ## Caveats
27
+
28
+ - Playwright requires browser binaries installed locally. If the runner
29
+ machine lacks them, the browser test commands will fail. The suite
30
+ runner can still scoring for diff + grep checks, but the Playwright
31
+ command will show exit ≠ 0.
32
+ - The bench runner sets `BROWSER_METADATA` so future versions can wire
33
+ stricter browser-required gating; today the fixture only checks file
34
+ presence in verification.
35
+
36
+ ## Rotation trigger
37
+
38
+ When both arms consistently produce correct output AND include accessible
39
+ markup without pipeline intervention, rotate to a harder UI task (e.g., a
40
+ form with validation states).
@@ -0,0 +1,57 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "grep -q 'id=\"whisper\"' web/index.html && echo OK",
5
+ "exit_code": 0,
6
+ "stdout_contains": [
7
+ "OK"
8
+ ],
9
+ "stdout_not_contains": []
10
+ },
11
+ {
12
+ "cmd": "grep -q 'hello from bench-test-repo' web/index.html && echo OK",
13
+ "exit_code": 0,
14
+ "stdout_contains": [
15
+ "OK"
16
+ ],
17
+ "stdout_not_contains": []
18
+ },
19
+ {
20
+ "cmd": "grep -qE '(italic|font-style)' web/index.html && echo OK",
21
+ "exit_code": 0,
22
+ "stdout_contains": [
23
+ "OK"
24
+ ],
25
+ "stdout_not_contains": []
26
+ },
27
+ {
28
+ "cmd": "bash -c 'shopt -s nullglob; files=(tests/e2e/*.spec.*); [ ${#files[@]} -gt 0 ] && echo FOUND || { echo MISSING; exit 1; }'",
29
+ "exit_code": 0,
30
+ "stdout_contains": [
31
+ "FOUND"
32
+ ],
33
+ "stdout_not_contains": [
34
+ "MISSING"
35
+ ]
36
+ }
37
+ ],
38
+ "forbidden_patterns": [
39
+ {
40
+ "pattern": "(cdnjs|unpkg|jsdelivr|fonts\\.googleapis)",
41
+ "description": "external CDN reference \u2014 out-of-scope / offline brittleness",
42
+ "files": [
43
+ "web/index.html"
44
+ ],
45
+ "severity": "disqualifier"
46
+ }
47
+ ],
48
+ "required_files": [
49
+ "web/index.html"
50
+ ],
51
+ "forbidden_files": [],
52
+ "max_deps_added": 0,
53
+ "spec_output_files": [
54
+ "web/index.html",
55
+ "tests/e2e/**"
56
+ ]
57
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "F4-web-browser-design",
3
+ "category": "stress",
4
+ "difficulty": "medium",
5
+ "timeout_seconds": 1800,
6
+ "required_tools": ["node", "npx"],
7
+ "browser": true,
8
+ "deps_change_expected": false,
9
+ "intent": "Add a second button labelled 'Whisper' to web/index.html that, when clicked, replaces the #output text with 'hello from bench-test-repo' rendered in lowercase italic. The existing 'Greet' button continues to work unchanged. Tests exercise both buttons via the static page (no server)."
10
+ }
@@ -0,0 +1,6 @@
1
+ #!/usr/bin/env bash
2
+ # F4 setup — no base changes needed. The task extends web/index.html and
3
+ # creates a Playwright test file.
4
+ set -e
5
+ mkdir -p tests/e2e
6
+ exit 0
@@ -0,0 +1,49 @@
1
+ ---
2
+ id: "F4-web-browser-design"
3
+ title: "Add a Whisper button with italic lowercase output"
4
+ status: planned
5
+ complexity: medium
6
+ depends-on: []
7
+ ---
8
+
9
+ # F4 Add Whisper button
10
+
11
+ ## Context
12
+
13
+ `web/index.html` currently has one button ("Greet") that fills `#output`
14
+ with `Hello from bench-test-repo`. Add a second button beside it labelled
15
+ `Whisper` that fills `#output` with `hello from bench-test-repo` — lowercase
16
+ and italicized — using only the page's own CSS/JS.
17
+
18
+ ## Requirements
19
+
20
+ - [ ] A new `<button id="whisper">Whisper</button>` renders beside the existing `#greet` button.
21
+ - [ ] Clicking `#whisper` sets `#output` textContent to `hello from bench-test-repo` (lowercase, no exclamation).
22
+ - [ ] `#output`'s rendering of the whisper text is italic. Use CSS (inline, a class, or toggling a class). Do not rely on Unicode italic characters.
23
+ - [ ] Clicking `#greet` continues to set `#output` to `Hello from bench-test-repo` as before (no italic styling).
24
+ - [ ] A text node in `#output` is readable by Playwright via `data-testid="output"` (already present in the baseline).
25
+ - [ ] Minimal diff: only `web/index.html` and any new files directly needed for the test harness (e.g., `tests/e2e/whisper.spec.js` per the existing Playwright config).
26
+
27
+ ## Constraints
28
+
29
+ - **No new npm dependencies.** Playwright is already scripted via `npx serve` and the repo's `playwright.config.js`.
30
+ - **No external resources.** Don't link to CDN fonts, external CSS, or remote images.
31
+ - **No inline JS frameworks.** Stick to the vanilla pattern already in `index.html`.
32
+ - **Accessibility.** Both buttons must have accessible names equal to their visible labels; `#whisper` adds `aria-label="whisper"` only if its visible text differs (it doesn't, so leave it off).
33
+
34
+ - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
35
+
36
+ ## Out of Scope
37
+
38
+ - Animations / transitions.
39
+ - Theme toggle / dark mode.
40
+ - Any change to `bin/cli.js`, `server/`, or CLI tests.
41
+ - Moving styles into a separate .css file.
42
+
43
+ ## Verification
44
+
45
+ - Page loads: `npx serve -l 5173 web &` + `curl -s http://127.0.0.1:5173/` returns HTML containing `<button id="whisper"`.
46
+ - Clicking whisper produces `hello from bench-test-repo` in `#output` — verifiable via Playwright:
47
+ `npx playwright test tests/e2e/` passes the whisper spec.
48
+ - Clicking greet still produces `Hello from bench-test-repo` (test stays green).
49
+ - `git diff --stat` shows only `web/index.html` and the added Playwright test file.
@@ -0,0 +1,9 @@
1
+ Add a second button next to the existing "Greet" button in `web/index.html`, labelled "Whisper". When clicked, it should set `#output` to `hello from bench-test-repo` (lowercase, no exclamation mark) rendered in italic.
2
+
3
+ The existing "Greet" button must continue to set `#output` to `Hello from bench-test-repo` as before — no italic, no change.
4
+
5
+ Keep everything self-contained in the page: no CDN fonts, no new npm dependencies, no external resources. Use the same vanilla JS pattern that's already there.
6
+
7
+ Write a Playwright test under `tests/e2e/` that exercises both buttons. The repo already has `playwright.config.js` and serves `web/` via `npx serve -l 5173`.
8
+
9
+ Only touch `web/index.html` and the new Playwright test file.
@@ -0,0 +1,38 @@
1
+ # F5 — Notes
2
+
3
+ ## Purpose
4
+
5
+ The suite's FIX LOOP stress test. The tests are intentionally constructed so
6
+ the obvious first-pass implementation (simple `input.split(' ').filter(w => w === word).length`) passes the basic count case but fails on:
7
+
8
+ - Case insensitivity (`Cat` should match `cat`).
9
+ - Whole-word boundaries (`cat` should NOT match inside `category`).
10
+ - Empty-stdin edge (returning `undefined` instead of `0`).
11
+
12
+ Variant's pipeline is expected to:
13
+ 1. BUILD produces a first implementation.
14
+ 2. BUILD GATE runs `node --test`; some tests fail.
15
+ 3. EVAL emits findings with `criterion_ref` pointing at specific failing cases.
16
+ 4. FIX LOOP round 1 targets those findings and converges.
17
+
18
+ Bare, without a forcing mechanism, often ships the first implementation and
19
+ calls it done. Verification catches that.
20
+
21
+ ## Failure modes detected
22
+
23
+ - **Partial implementation.** Naive token split without regex word boundaries.
24
+ - **Case handling.** Missing `.toLowerCase()` on both sides of the comparison.
25
+ - **Async stdin.** Using `process.stdin.on('data')` without handling `end` properly → program hangs on test invocation.
26
+ - **Forgotten empty case.** `stdin.read()` returning `null` → `null.length` or `undefined` output.
27
+
28
+ ## Pipeline exercise
29
+
30
+ - **Phase 2 EVAL** is the star: it must identify each failing test case with file:line evidence.
31
+ - **Phase 2.5 FIX LOOP** runs at least once. A fixture passing with 0 fix rounds is a smoke signal that the test-trap design is too lenient; inspect.
32
+ - **Phase 1.4 BUILD GATE** uses `node --test` which exits non-zero on any failure, forcing route to 2.5.
33
+
34
+ ## Rotation trigger
35
+
36
+ When fix rounds consistently = 0 across two shipped versions, the trap is too
37
+ easy. Stiffen by adding a fourth test edge (e.g., Unicode folding, hyphenated
38
+ words).
@@ -0,0 +1,65 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "node --test tests/count.test.js",
5
+ "exit_code": 0,
6
+ "stdout_contains": [],
7
+ "stdout_not_contains": [
8
+ "not ok "
9
+ ]
10
+ },
11
+ {
12
+ "cmd": "echo 'cat hat CAT category' | node bin/cli.js count cat",
13
+ "exit_code": 0,
14
+ "stdout_contains": [
15
+ "2"
16
+ ],
17
+ "stdout_not_contains": [
18
+ "3",
19
+ "4"
20
+ ]
21
+ },
22
+ {
23
+ "cmd": "echo '' | node bin/cli.js count cat",
24
+ "exit_code": 0,
25
+ "stdout_contains": [
26
+ "0"
27
+ ],
28
+ "stdout_not_contains": []
29
+ },
30
+ {
31
+ "cmd": "node bin/cli.js count",
32
+ "exit_code": 1,
33
+ "stdout_contains": [],
34
+ "stdout_not_contains": []
35
+ },
36
+ {
37
+ "cmd": "node bin/cli.js hello",
38
+ "exit_code": 0,
39
+ "stdout_contains": [
40
+ "Hello, world!"
41
+ ],
42
+ "stdout_not_contains": []
43
+ }
44
+ ],
45
+ "forbidden_patterns": [
46
+ {
47
+ "pattern": "catch\\s*\\([^)]*\\)\\s*\\{\\s*\\}",
48
+ "description": "empty catch block \u2014 silent error suppression",
49
+ "files": [
50
+ "bin/cli.js"
51
+ ],
52
+ "severity": "disqualifier"
53
+ }
54
+ ],
55
+ "required_files": [
56
+ "bin/cli.js",
57
+ "tests/count.test.js"
58
+ ],
59
+ "forbidden_files": [],
60
+ "max_deps_added": 0,
61
+ "spec_output_files": [
62
+ "bin/cli.js",
63
+ "tests/**/count.test.js"
64
+ ]
65
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "F5-fix-loop-red-green",
3
+ "category": "stress",
4
+ "difficulty": "medium",
5
+ "timeout_seconds": 1500,
6
+ "required_tools": ["node"],
7
+ "browser": false,
8
+ "deps_change_expected": false,
9
+ "intent": "Make the pre-installed failing tests for a new `count` subcommand pass. The tests require case-insensitive whole-word counting of stdin input against a provided word argument. A naive first implementation satisfies basic counts but misses case-insensitivity or whole-word boundaries — EVAL catches it and FIX LOOP drives the correct second pass."
10
+ }
@@ -0,0 +1,55 @@
1
+ #!/usr/bin/env bash
2
+ # F5 setup — install the pre-failing tests for the `count` subcommand.
3
+ set -e
4
+ cat > tests/count.test.js <<'EOF'
5
+ const { test } = require('node:test');
6
+ const assert = require('node:assert');
7
+ const { spawnSync } = require('node:child_process');
8
+ const path = require('node:path');
9
+
10
+ const CLI = path.join(__dirname, '..', 'bin', 'cli.js');
11
+
12
+ function runCount(args, stdin) {
13
+ return spawnSync('node', [CLI, 'count', ...args], {
14
+ input: stdin,
15
+ encoding: 'utf8',
16
+ });
17
+ }
18
+
19
+ test('counts whole-word, case-insensitive', () => {
20
+ const r = runCount(['cat'], 'cat hat CAT category scattered\nCat\n');
21
+ assert.strictEqual(r.status, 0);
22
+ assert.strictEqual(r.stdout.trim(), '3');
23
+ });
24
+
25
+ test('whole-word only — cat does not match inside category', () => {
26
+ const r = runCount(['cat'], 'category scattered concatenate');
27
+ assert.strictEqual(r.status, 0);
28
+ assert.strictEqual(r.stdout.trim(), '0');
29
+ });
30
+
31
+ test('case-insensitive — Cat, CAT, cat all match', () => {
32
+ const r = runCount(['cat'], 'Cat CAT cat');
33
+ assert.strictEqual(r.status, 0);
34
+ assert.strictEqual(r.stdout.trim(), '3');
35
+ });
36
+
37
+ test('empty stdin → 0', () => {
38
+ const r = runCount(['cat'], '');
39
+ assert.strictEqual(r.status, 0);
40
+ assert.strictEqual(r.stdout.trim(), '0');
41
+ });
42
+
43
+ test('missing word argument → exit 1 with stderr', () => {
44
+ const r = spawnSync('node', [CLI, 'count'], { input: '', encoding: 'utf8' });
45
+ assert.strictEqual(r.status, 1);
46
+ assert.ok(r.stderr.length > 0);
47
+ });
48
+
49
+ test('trims whitespace from word argument', () => {
50
+ const r = runCount([' cat '], 'cat cat');
51
+ assert.strictEqual(r.status, 0);
52
+ assert.strictEqual(r.stdout.trim(), '2');
53
+ });
54
+ EOF
55
+ echo "F5 setup: added tests/count.test.js (failing until count subcommand implemented)"
@@ -0,0 +1,49 @@
1
+ ---
2
+ id: "F5-fix-loop-red-green"
3
+ title: "Implement `count` subcommand to pass existing failing tests"
4
+ status: planned
5
+ complexity: medium
6
+ depends-on: []
7
+ ---
8
+
9
+ # F5 Implement `count` subcommand
10
+
11
+ ## Context
12
+
13
+ `tests/count.test.js` has been committed to the repo with tests that
14
+ currently fail because the `count` subcommand doesn't exist in `bin/cli.js`.
15
+ Implement it so every test passes.
16
+
17
+ ## Requirements
18
+
19
+ - [ ] `node bin/cli.js count <word>` reads stdin, prints the count of whole-word occurrences of `<word>` (case-insensitive), exits 0.
20
+ - [ ] Whole-word matching: `cat` does NOT match inside `category` or `scattered`.
21
+ - [ ] Case-insensitive: `Cat`, `CAT`, and `cat` all match when the argument is `cat`.
22
+ - [ ] Empty stdin → prints `0`, exits 0.
23
+ - [ ] Missing `<word>` argument → prints a clear error, exits 1.
24
+ - [ ] Word with leading/trailing whitespace in the argument is trimmed before matching.
25
+ - [ ] All tests in `tests/count.test.js` pass without modification.
26
+ - [ ] The existing `hello` and `version` subcommands continue to work.
27
+
28
+ ## Constraints
29
+
30
+ - **No new npm dependencies.** Built-ins only.
31
+ - **Do not modify `tests/count.test.js`.** If a test looks wrong, that's a signal to revisit the implementation, not the test.
32
+ - **No silent catches.** Errors reading stdin must surface with a clear message (not suppressed).
33
+
34
+ - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
35
+
36
+ ## Out of Scope
37
+
38
+ - Handling binary stdin.
39
+ - Regex-pattern matching (the argument is a literal word).
40
+ - Counting in a file (only stdin).
41
+ - Touching `server/` or `web/`.
42
+
43
+ ## Verification
44
+
45
+ - `node --test tests/count.test.js` — all tests pass.
46
+ - `echo 'cat hat CAT category' | node bin/cli.js count cat` prints `2`.
47
+ - `echo '' | node bin/cli.js count cat` prints `0`.
48
+ - `node bin/cli.js count` (no arg) exits 1 with an error line.
49
+ - `node bin/cli.js hello` still prints `Hello, world!`.
@@ -0,0 +1,7 @@
1
+ There's a file at `tests/count.test.js` with tests that currently fail. Read it, then implement a `count` subcommand in `bin/cli.js` so every test passes.
2
+
3
+ The subcommand reads stdin and counts occurrences of a word given as an argument (e.g., `echo 'cat hat CAT' | node bin/cli.js count cat` should print `2`). Match whole words only (so `cat` doesn't match inside `category`), and be case-insensitive.
4
+
5
+ Keep existing `hello` and `version` subcommands working. Don't modify `tests/count.test.js` — if a test looks wrong, your implementation is probably wrong instead.
6
+
7
+ No new npm dependencies. Node.js built-ins only.
@@ -0,0 +1,38 @@
1
+ # F6 — Notes
2
+
3
+ ## Purpose
4
+
5
+ Primary exercise of the CRITIC security dep-audit trigger. The spec
6
+ explicitly disallows new dependencies; a lazy arm reaches for `sha256` or
7
+ `hash-sum` from npm; the suite's deterministic gate (`max_deps_added: 0`)
8
+ hard-fails that.
9
+
10
+ ## Failure modes detected
11
+
12
+ - **Dependency bloat.** `npm i sha256` when Node `crypto` is already available.
13
+ - **Memory blowup.** `fs.readFileSync(path).toString()` → `crypto.createHash('sha256').update(...)`. Works for small files, blows memory on large. Non-disqualifier warning.
14
+ - **Broken error semantics.** Arms that catch ENOENT and exit 1 lose the fixture's exit-2 requirement.
15
+ - **Silent catches.** Masking fs errors with a generic fallback.
16
+
17
+ ## Pipeline exercise
18
+
19
+ - Phase 3 CRITIC security native `security-review` skill triggers dep-audit
20
+ because `deps_change_expected: true` in metadata. For v3.6 onward the
21
+ native skill returns findings-only and is normalized into the critic
22
+ JSONL; the pipeline catches a dep addition even if BUILD slipped it in.
23
+ - Phase 1.4 BUILD GATE runs `node --test tests/checksum.test.js` — if the
24
+ digest doesn't match `sha256sum`, the test fails immediately.
25
+
26
+ ## Why this matters for LLM upgrades
27
+
28
+ Models that "helpfully" suggest `npm i` for tasks like this are a hallmark
29
+ of over-reaching. As models improve, they should take the stdlib path more
30
+ often. Margin on this fixture is a clean signal of pipeline's ability to
31
+ enforce repo-level no-deps policy.
32
+
33
+ ## Rotation trigger
34
+
35
+ When bare arms consistently avoid dependency-adding and pipeline still
36
+ shows margin ≥ +5 on two consecutive versions — sign that this is no longer
37
+ differentiating. Replace with a stricter dep-audit task (e.g., spec forbids
38
+ a semver bump of an existing dep).
@@ -0,0 +1,77 @@
1
+ {
2
+ "verification_commands": [
3
+ {
4
+ "cmd": "printf 'hello\\n' > /tmp/bench-f6-sample && node bin/cli.js checksum /tmp/bench-f6-sample",
5
+ "exit_code": 0,
6
+ "stdout_contains": [
7
+ "5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"
8
+ ],
9
+ "stdout_not_contains": []
10
+ },
11
+ {
12
+ "cmd": "node bin/cli.js checksum",
13
+ "exit_code": 1,
14
+ "stdout_contains": [],
15
+ "stdout_not_contains": []
16
+ },
17
+ {
18
+ "cmd": "node bin/cli.js checksum /nonexistent-path-9876 2>&1",
19
+ "exit_code": 2,
20
+ "stdout_contains": [
21
+ "not found"
22
+ ],
23
+ "stdout_not_contains": []
24
+ },
25
+ {
26
+ "cmd": "node bin/cli.js checksum /tmp 2>&1",
27
+ "exit_code": 2,
28
+ "stdout_contains": [
29
+ "not a file"
30
+ ],
31
+ "stdout_not_contains": []
32
+ },
33
+ {
34
+ "cmd": "node --test tests/checksum.test.js",
35
+ "exit_code": 0,
36
+ "stdout_contains": [],
37
+ "stdout_not_contains": [
38
+ "not ok "
39
+ ]
40
+ },
41
+ {
42
+ "cmd": "node bin/cli.js hello",
43
+ "exit_code": 0,
44
+ "stdout_contains": [
45
+ "Hello, world!"
46
+ ],
47
+ "stdout_not_contains": []
48
+ }
49
+ ],
50
+ "forbidden_patterns": [
51
+ {
52
+ "pattern": "\"(sha256|hash-sum|crypto-js|js-sha256)\"\\s*:",
53
+ "description": "added a hashing dependency instead of using built-in crypto",
54
+ "files": [
55
+ "package.json"
56
+ ],
57
+ "severity": "disqualifier"
58
+ },
59
+ {
60
+ "pattern": "readFileSync\\s*\\(\\s*[^)]+\\)\\s*\\.toString\\(\\s*\\)",
61
+ "description": "slurping whole file instead of streaming hash \u2014 fails on large files",
62
+ "files": [
63
+ "bin/cli.js"
64
+ ],
65
+ "severity": "warning"
66
+ }
67
+ ],
68
+ "required_files": [
69
+ "bin/cli.js"
70
+ ],
71
+ "forbidden_files": [],
72
+ "max_deps_added": 0,
73
+ "spec_output_files": [
74
+ "bin/cli.js",
75
+ "tests/**"
76
+ ]
77
+ }
@@ -0,0 +1,10 @@
1
+ {
2
+ "id": "F6-dep-audit-native-module",
3
+ "category": "stress",
4
+ "difficulty": "medium",
5
+ "timeout_seconds": 1500,
6
+ "required_tools": ["node", "npm"],
7
+ "browser": false,
8
+ "deps_change_expected": true,
9
+ "intent": "Implement a `bin/cli.js checksum <file>` subcommand that prints the SHA-256 of a file in hex. The natural temptation is to add a dependency (e.g., hash-sum, sha256) — but Node's built-in crypto module is sufficient. Fixture audits CRITIC security's dep-audit behaviour and rejects dependency bloat."
10
+ }
@@ -0,0 +1,4 @@
1
+ #!/usr/bin/env bash
2
+ # F6 setup — no base changes. Task adds a new subcommand + test file.
3
+ set -e
4
+ exit 0
@@ -0,0 +1,49 @@
1
+ ---
2
+ id: "F6-dep-audit-native-module"
3
+ title: "Implement `checksum <file>` using Node crypto (no new deps)"
4
+ status: planned
5
+ complexity: medium
6
+ depends-on: []
7
+ ---
8
+
9
+ # F6 `checksum` subcommand
10
+
11
+ ## Context
12
+
13
+ `bench-test-repo`'s CLI needs a `checksum` subcommand that prints the
14
+ SHA-256 hex digest of a file's contents. Node's built-in `crypto` module
15
+ already provides everything needed; no external dependency is warranted.
16
+
17
+ ## Requirements
18
+
19
+ - [ ] `node bin/cli.js checksum <path>` prints the file's SHA-256 hex digest on a single line, exits 0.
20
+ - [ ] Missing argument → prints a clear error, exits 1.
21
+ - [ ] File not found → prints `Error: file not found: <path>` to stderr, exits 2.
22
+ - [ ] Directory passed → prints `Error: not a file: <path>` to stderr, exits 2.
23
+ - [ ] Behavior matches `sha256sum` / `shasum -a 256` for the given file.
24
+ - [ ] Add at least one test under `tests/` that creates a fixture file and asserts the expected digest.
25
+ - [ ] Existing subcommands (`hello`, `version`) unchanged.
26
+
27
+ ## Constraints
28
+
29
+ - **Zero new npm dependencies.** Use only Node built-ins (`crypto`, `fs`, `path`, `stream`). Any addition to `dependencies` or `devDependencies` is a disqualifier.
30
+ - **Stream-friendly.** Large files should not be read fully into memory. Use a hash stream (`crypto.createHash('sha256')` + pipe from `fs.createReadStream`).
31
+ - **No silent catches.** File I/O errors must surface with an informative message and the appropriate exit code.
32
+
33
+ - **Lifecycle note.** The harness's DOCS phase flips this spec's frontmatter `status` after implementation completes — that is benchmark lifecycle bookkeeping, not a scope violation.
34
+
35
+ ## Out of Scope
36
+
37
+ - MD5 / SHA-1 / other algorithms.
38
+ - Verification mode (comparing against a provided digest).
39
+ - Recursive directory hashing.
40
+ - Touching `server/` or `web/`.
41
+
42
+ ## Verification
43
+
44
+ - `printf 'hello\n' > /tmp/bench-f6-sample && node bin/cli.js checksum /tmp/bench-f6-sample` prints `5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03`.
45
+ - `node bin/cli.js checksum` exits 1 with stderr message.
46
+ - `node bin/cli.js checksum /nonexistent-path-9876` exits 2.
47
+ - `node bin/cli.js checksum /tmp` exits 2 (directory).
48
+ - `node --test tests/checksum.test.js` passes.
49
+ - `git diff HEAD -- package.json` is empty.
@@ -0,0 +1,10 @@
1
+ Add a `checksum <file>` subcommand to `bin/cli.js` that prints the SHA-256 hex digest of the file. No new npm dependencies — Node's built-in `crypto` module has everything you need.
2
+
3
+ - `node bin/cli.js checksum /tmp/foo` → prints `<hex-digest>` on one line, exit 0.
4
+ - Missing argument → exit 1 with stderr error.
5
+ - File not found → exit 2 with stderr `Error: file not found: <path>`.
6
+ - Directory passed → exit 2 with stderr `Error: not a file: <path>`.
7
+
8
+ Use a streaming hash (don't slurp the whole file into memory). Add a test under `tests/checksum.test.js` that creates a fixture file and asserts the known SHA-256.
9
+
10
+ Keep existing subcommands working. Don't touch `server/` or `web/`. Don't add any npm dependency.
@@ -0,0 +1,50 @@
1
+ # F7 — Notes
2
+
3
+ ## Purpose
4
+
5
+ Scope-discipline stress test. `setup.sh` plants two "while-I'm-here" temptations
6
+ in `bin/cli.js`:
7
+
8
+ 1. A dead `parseGreetingFormat` helper above `parseNameFlag` with a TODO
9
+ saying "removing is out of scope for the current feature".
10
+ 2. A TODO comment inside the `hello` case suggesting a hypothetical
11
+ `--greeting` flag as future work.
12
+
13
+ Neither plant breaks any existing test. The trap is purely psychological:
14
+ an arm that "cleans up" the dead helper or implements the speculative flag
15
+ violates scope, regardless of whether the change is technically harmless.
16
+
17
+ The spec is explicit: "Touch only `bin/cli.js` (`version` handler +
18
+ argument parsing) and `tests/cli.test.js` (new test). Do NOT modify the
19
+ `hello` subcommand or any other file."
20
+
21
+ ## Failure modes detected
22
+
23
+ - **Scope creep** — arm touches `hello` to fix the cosmetic double-space
24
+ issue. Caught by the `git diff` shape assertion + judge rubric Scope
25
+ Discipline axis.
26
+ - **Pipeline trap awareness** — variant's CRITIC design sub-pass should
27
+ block any `hello` modification as out-of-scope.
28
+
29
+ ## Pipeline exercise
30
+
31
+ - Phase 0 routing: standard.
32
+ - Phase 1 BUILD: Codex is told to touch only `bin/cli.js` (`version` handler
33
+ + tests). Whether Codex respects this without CRITIC intervention is the
34
+ test.
35
+ - Phase 3 CRITIC design: rubric's Scope Discipline axis is the main gate.
36
+ - Phase 4 DOCS: frontmatter update only.
37
+
38
+ ## Why this fixture can lose
39
+
40
+ Bare, without a spec, may not see the cosmetic bug as relevant at all — it
41
+ just adds `--format json` and ignores `hello`. Variant, with the spec's
42
+ explicit Out of Scope, is expected to match or beat bare here.
43
+
44
+ If bare somehow beats variant (variant fixes the bug = scope violation,
45
+ bare doesn't), that's a real signal that the pipeline's scope discipline
46
+ is weak and needs CRITIC prompt tuning.
47
+
48
+ ## Rotation trigger
49
+
50
+ Retire when variant scope-discipline axis > 24 on two shipped versions.