autonomous-coding-toolkit 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (324) hide show
  1. package/.claude-plugin/marketplace.json +22 -0
  2. package/.claude-plugin/plugin.json +13 -0
  3. package/LICENSE +21 -0
  4. package/Makefile +21 -0
  5. package/README.md +140 -0
  6. package/SECURITY.md +28 -0
  7. package/agents/bash-expert.md +113 -0
  8. package/agents/dependency-auditor.md +138 -0
  9. package/agents/integration-tester.md +120 -0
  10. package/agents/lesson-scanner.md +149 -0
  11. package/agents/python-expert.md +179 -0
  12. package/agents/service-monitor.md +141 -0
  13. package/agents/shell-expert.md +147 -0
  14. package/benchmarks/runner.sh +147 -0
  15. package/benchmarks/tasks/01-rest-endpoint/rubric.sh +29 -0
  16. package/benchmarks/tasks/01-rest-endpoint/task.md +17 -0
  17. package/benchmarks/tasks/02-refactor-module/task.md +8 -0
  18. package/benchmarks/tasks/03-fix-integration-bug/task.md +8 -0
  19. package/benchmarks/tasks/04-add-test-coverage/task.md +8 -0
  20. package/benchmarks/tasks/05-multi-file-feature/task.md +8 -0
  21. package/bin/act.js +238 -0
  22. package/commands/autocode.md +6 -0
  23. package/commands/cancel-ralph.md +18 -0
  24. package/commands/code-factory.md +53 -0
  25. package/commands/create-prd.md +55 -0
  26. package/commands/ralph-loop.md +18 -0
  27. package/commands/run-plan.md +117 -0
  28. package/commands/submit-lesson.md +122 -0
  29. package/docs/ARCHITECTURE.md +630 -0
  30. package/docs/CONTRIBUTING.md +125 -0
  31. package/docs/lessons/0001-bare-exception-swallowing.md +34 -0
  32. package/docs/lessons/0002-async-def-without-await.md +28 -0
  33. package/docs/lessons/0003-create-task-without-callback.md +28 -0
  34. package/docs/lessons/0004-hardcoded-test-counts.md +28 -0
  35. package/docs/lessons/0005-sqlite-without-closing.md +33 -0
  36. package/docs/lessons/0006-venv-pip-path.md +27 -0
  37. package/docs/lessons/0007-runner-state-self-rejection.md +35 -0
  38. package/docs/lessons/0008-quality-gate-blind-spot.md +33 -0
  39. package/docs/lessons/0009-parser-overcount-empty-batches.md +36 -0
  40. package/docs/lessons/0010-local-outside-function-bash.md +33 -0
  41. package/docs/lessons/0011-batch-tests-for-unimplemented-code.md +36 -0
  42. package/docs/lessons/0012-api-markdown-unescaped-chars.md +33 -0
  43. package/docs/lessons/0013-export-prefix-env-parsing.md +33 -0
  44. package/docs/lessons/0014-decorator-registry-import-side-effect.md +43 -0
  45. package/docs/lessons/0015-frontend-backend-schema-drift.md +43 -0
  46. package/docs/lessons/0016-event-driven-cold-start-seeding.md +44 -0
  47. package/docs/lessons/0017-copy-paste-logic-diverges.md +43 -0
  48. package/docs/lessons/0018-layer-passes-pipeline-broken.md +45 -0
  49. package/docs/lessons/0019-systemd-envfile-ignores-export.md +41 -0
  50. package/docs/lessons/0020-persist-state-incrementally.md +44 -0
  51. package/docs/lessons/0021-dual-axis-testing.md +48 -0
  52. package/docs/lessons/0022-jsx-factory-shadowing.md +43 -0
  53. package/docs/lessons/0023-static-analysis-spiral.md +51 -0
  54. package/docs/lessons/0024-shared-pipeline-implementation.md +55 -0
  55. package/docs/lessons/0025-defense-in-depth-all-entry-points.md +65 -0
  56. package/docs/lessons/0026-linter-no-rules-false-enforcement.md +54 -0
  57. package/docs/lessons/0027-jsx-silent-prop-drop.md +64 -0
  58. package/docs/lessons/0028-no-infrastructure-in-client-code.md +49 -0
  59. package/docs/lessons/0029-never-write-secrets-to-files.md +61 -0
  60. package/docs/lessons/0030-cache-merge-not-replace.md +62 -0
  61. package/docs/lessons/0031-verify-units-at-boundaries.md +66 -0
  62. package/docs/lessons/0032-module-lifecycle-subscribe-unsubscribe.md +89 -0
  63. package/docs/lessons/0033-async-iteration-mutable-snapshot.md +72 -0
  64. package/docs/lessons/0034-caller-missing-await-silent-discard.md +65 -0
  65. package/docs/lessons/0035-duplicate-registration-silent-overwrite.md +85 -0
  66. package/docs/lessons/0036-websocket-dirty-disconnect.md +33 -0
  67. package/docs/lessons/0037-parallel-agents-worktree-corruption.md +31 -0
  68. package/docs/lessons/0038-subscribe-no-stored-ref.md +36 -0
  69. package/docs/lessons/0039-fallback-or-default-hides-bugs.md +34 -0
  70. package/docs/lessons/0040-event-firehose-filter-first.md +36 -0
  71. package/docs/lessons/0041-ambiguous-base-dir-path-nesting.md +32 -0
  72. package/docs/lessons/0042-spec-compliance-insufficient.md +36 -0
  73. package/docs/lessons/0043-exact-count-extensible-collections.md +32 -0
  74. package/docs/lessons/0044-relative-file-deps-worktree.md +39 -0
  75. package/docs/lessons/0045-iterative-design-improvement.md +33 -0
  76. package/docs/lessons/0046-plan-assertion-math-bugs.md +38 -0
  77. package/docs/lessons/0047-pytest-single-threaded-default.md +37 -0
  78. package/docs/lessons/0048-integration-wiring-batch.md +40 -0
  79. package/docs/lessons/0049-ab-verification.md +41 -0
  80. package/docs/lessons/0050-editing-sourced-files-during-execution.md +33 -0
  81. package/docs/lessons/0051-infrastructure-fixes-cant-self-heal.md +30 -0
  82. package/docs/lessons/0052-uncommitted-changes-poison-quality-gates.md +31 -0
  83. package/docs/lessons/0053-jq-compact-flag-inconsistency.md +31 -0
  84. package/docs/lessons/0054-parser-matches-inside-code-blocks.md +30 -0
  85. package/docs/lessons/0055-agents-compensate-for-garbled-prompts.md +31 -0
  86. package/docs/lessons/0056-grep-count-exit-code-on-zero.md +42 -0
  87. package/docs/lessons/0057-new-artifacts-break-git-clean-gates.md +42 -0
  88. package/docs/lessons/0058-dead-config-keys-never-consumed.md +49 -0
  89. package/docs/lessons/0059-contract-test-shared-structures.md +53 -0
  90. package/docs/lessons/0060-set-e-silent-death-in-runners.md +53 -0
  91. package/docs/lessons/0061-context-injection-dirty-state.md +50 -0
  92. package/docs/lessons/0062-sibling-bug-neighborhood-scan.md +29 -0
  93. package/docs/lessons/0063-one-flag-two-lifetimes.md +31 -0
  94. package/docs/lessons/0064-test-passes-wrong-reason.md +31 -0
  95. package/docs/lessons/0065-pipefail-grep-count-double-output.md +39 -0
  96. package/docs/lessons/0066-local-keyword-outside-function.md +37 -0
  97. package/docs/lessons/0067-stdin-hang-non-interactive-shell.md +36 -0
  98. package/docs/lessons/0068-agent-builds-wrong-thing-correctly.md +31 -0
  99. package/docs/lessons/0069-plan-quality-dominates-execution.md +30 -0
  100. package/docs/lessons/0070-spec-echo-back-prevents-drift.md +31 -0
  101. package/docs/lessons/0071-positive-instructions-outperform-negative.md +30 -0
  102. package/docs/lessons/0072-lost-in-the-middle-context-placement.md +30 -0
  103. package/docs/lessons/0073-unscoped-lessons-cause-false-positives.md +30 -0
  104. package/docs/lessons/0074-stale-context-injection-wrong-batch.md +32 -0
  105. package/docs/lessons/0075-research-artifacts-must-persist.md +32 -0
  106. package/docs/lessons/0076-wrong-decomposition-contaminates-downstream.md +30 -0
  107. package/docs/lessons/0077-cherry-pick-merges-need-manual-resolution.md +30 -0
  108. package/docs/lessons/0078-static-review-without-live-test.md +30 -0
  109. package/docs/lessons/0079-integration-wiring-batch-required.md +32 -0
  110. package/docs/lessons/FRAMEWORK.md +161 -0
  111. package/docs/lessons/SUMMARY.md +201 -0
  112. package/docs/lessons/TEMPLATE.md +85 -0
  113. package/docs/plans/2026-02-21-code-factory-v2-design.md +204 -0
  114. package/docs/plans/2026-02-21-code-factory-v2-implementation-plan.md +2189 -0
  115. package/docs/plans/2026-02-21-code-factory-v2-phase4-design.md +537 -0
  116. package/docs/plans/2026-02-21-code-factory-v2-phase4-implementation-plan.md +2012 -0
  117. package/docs/plans/2026-02-21-hardening-pass-design.md +108 -0
  118. package/docs/plans/2026-02-21-hardening-pass-plan.md +1378 -0
  119. package/docs/plans/2026-02-21-mab-research-report.md +406 -0
  120. package/docs/plans/2026-02-21-marketplace-restructure-design.md +240 -0
  121. package/docs/plans/2026-02-21-marketplace-restructure-plan.md +832 -0
  122. package/docs/plans/2026-02-21-phase4-completion-plan.md +697 -0
  123. package/docs/plans/2026-02-21-validator-suite-design.md +148 -0
  124. package/docs/plans/2026-02-21-validator-suite-plan.md +540 -0
  125. package/docs/plans/2026-02-22-mab-research-round2.md +556 -0
  126. package/docs/plans/2026-02-22-mab-run-design.md +462 -0
  127. package/docs/plans/2026-02-22-mab-run-plan.md +2046 -0
  128. package/docs/plans/2026-02-22-operations-design-methodology-research.md +681 -0
  129. package/docs/plans/2026-02-22-research-agent-failure-taxonomy.md +532 -0
  130. package/docs/plans/2026-02-22-research-code-guideline-policies.md +886 -0
  131. package/docs/plans/2026-02-22-research-codebase-audit-refactoring.md +908 -0
  132. package/docs/plans/2026-02-22-research-coding-standards-documentation.md +541 -0
  133. package/docs/plans/2026-02-22-research-competitive-landscape.md +687 -0
  134. package/docs/plans/2026-02-22-research-comprehensive-testing.md +1076 -0
  135. package/docs/plans/2026-02-22-research-context-utilization.md +459 -0
  136. package/docs/plans/2026-02-22-research-cost-quality-tradeoff.md +548 -0
  137. package/docs/plans/2026-02-22-research-lesson-transferability.md +508 -0
  138. package/docs/plans/2026-02-22-research-multi-agent-coordination.md +312 -0
  139. package/docs/plans/2026-02-22-research-phase-integration.md +602 -0
  140. package/docs/plans/2026-02-22-research-plan-quality.md +428 -0
  141. package/docs/plans/2026-02-22-research-prompt-engineering.md +558 -0
  142. package/docs/plans/2026-02-22-research-unconventional-perspectives.md +528 -0
  143. package/docs/plans/2026-02-22-research-user-adoption.md +638 -0
  144. package/docs/plans/2026-02-22-research-verification-effectiveness.md +433 -0
  145. package/docs/plans/2026-02-23-agent-suite-design.md +299 -0
  146. package/docs/plans/2026-02-23-agent-suite-plan.md +578 -0
  147. package/docs/plans/2026-02-23-phase3-cost-infrastructure-design.md +148 -0
  148. package/docs/plans/2026-02-23-phase3-cost-infrastructure-plan.md +1062 -0
  149. package/docs/plans/2026-02-23-research-bash-expert-agent.md +543 -0
  150. package/docs/plans/2026-02-23-research-dependency-auditor-agent.md +564 -0
  151. package/docs/plans/2026-02-23-research-improving-existing-agents.md +503 -0
  152. package/docs/plans/2026-02-23-research-integration-tester-agent.md +454 -0
  153. package/docs/plans/2026-02-23-research-python-expert-agent.md +429 -0
  154. package/docs/plans/2026-02-23-research-service-monitor-agent.md +425 -0
  155. package/docs/plans/2026-02-23-research-shell-expert-agent.md +533 -0
  156. package/docs/plans/2026-02-23-roadmap-to-completion.md +530 -0
  157. package/docs/plans/2026-02-24-headless-module-split-design.md +98 -0
  158. package/docs/plans/2026-02-24-headless-module-split.md +443 -0
  159. package/docs/plans/2026-02-24-lesson-scope-metadata-design.md +228 -0
  160. package/docs/plans/2026-02-24-lesson-scope-metadata-plan.md +968 -0
  161. package/docs/plans/2026-02-24-npm-packaging-design.md +841 -0
  162. package/docs/plans/2026-02-24-npm-packaging-plan.md +1965 -0
  163. package/docs/plans/audit-findings.md +186 -0
  164. package/docs/telegram-notification-format.md +98 -0
  165. package/examples/example-plan.md +51 -0
  166. package/examples/example-prd.json +72 -0
  167. package/examples/example-roadmap.md +33 -0
  168. package/examples/quickstart-plan.md +63 -0
  169. package/hooks/hooks.json +26 -0
  170. package/hooks/setup-symlinks.sh +48 -0
  171. package/hooks/stop-hook.sh +135 -0
  172. package/package.json +47 -0
  173. package/policies/bash.md +71 -0
  174. package/policies/python.md +71 -0
  175. package/policies/testing.md +61 -0
  176. package/policies/universal.md +60 -0
  177. package/scripts/analyze-report.sh +97 -0
  178. package/scripts/architecture-map.sh +145 -0
  179. package/scripts/auto-compound.sh +273 -0
  180. package/scripts/batch-audit.sh +42 -0
  181. package/scripts/batch-test.sh +101 -0
  182. package/scripts/entropy-audit.sh +221 -0
  183. package/scripts/failure-digest.sh +51 -0
  184. package/scripts/generate-ast-rules.sh +96 -0
  185. package/scripts/init.sh +112 -0
  186. package/scripts/lesson-check.sh +428 -0
  187. package/scripts/lib/common.sh +61 -0
  188. package/scripts/lib/cost-tracking.sh +153 -0
  189. package/scripts/lib/ollama.sh +60 -0
  190. package/scripts/lib/progress-writer.sh +128 -0
  191. package/scripts/lib/run-plan-context.sh +215 -0
  192. package/scripts/lib/run-plan-echo-back.sh +231 -0
  193. package/scripts/lib/run-plan-headless.sh +396 -0
  194. package/scripts/lib/run-plan-notify.sh +57 -0
  195. package/scripts/lib/run-plan-parser.sh +81 -0
  196. package/scripts/lib/run-plan-prompt.sh +215 -0
  197. package/scripts/lib/run-plan-quality-gate.sh +132 -0
  198. package/scripts/lib/run-plan-routing.sh +315 -0
  199. package/scripts/lib/run-plan-sampling.sh +170 -0
  200. package/scripts/lib/run-plan-scoring.sh +146 -0
  201. package/scripts/lib/run-plan-state.sh +142 -0
  202. package/scripts/lib/run-plan-team.sh +199 -0
  203. package/scripts/lib/telegram.sh +54 -0
  204. package/scripts/lib/thompson-sampling.sh +176 -0
  205. package/scripts/license-check.sh +74 -0
  206. package/scripts/mab-run.sh +575 -0
  207. package/scripts/module-size-check.sh +146 -0
  208. package/scripts/patterns/async-no-await.yml +5 -0
  209. package/scripts/patterns/bare-except.yml +6 -0
  210. package/scripts/patterns/empty-catch.yml +6 -0
  211. package/scripts/patterns/hardcoded-localhost.yml +9 -0
  212. package/scripts/patterns/retry-loop-no-backoff.yml +12 -0
  213. package/scripts/pipeline-status.sh +197 -0
  214. package/scripts/policy-check.sh +226 -0
  215. package/scripts/prior-art-search.sh +133 -0
  216. package/scripts/promote-mab-lessons.sh +126 -0
  217. package/scripts/prompts/agent-a-superpowers.md +29 -0
  218. package/scripts/prompts/agent-b-ralph.md +29 -0
  219. package/scripts/prompts/judge-agent.md +61 -0
  220. package/scripts/prompts/planner-agent.md +44 -0
  221. package/scripts/pull-community-lessons.sh +90 -0
  222. package/scripts/quality-gate.sh +266 -0
  223. package/scripts/research-gate.sh +90 -0
  224. package/scripts/run-plan.sh +329 -0
  225. package/scripts/scope-infer.sh +159 -0
  226. package/scripts/setup-ralph-loop.sh +155 -0
  227. package/scripts/telemetry.sh +230 -0
  228. package/scripts/tests/run-all-tests.sh +52 -0
  229. package/scripts/tests/test-act-cli.sh +46 -0
  230. package/scripts/tests/test-agents-md.sh +87 -0
  231. package/scripts/tests/test-analyze-report.sh +114 -0
  232. package/scripts/tests/test-architecture-map.sh +89 -0
  233. package/scripts/tests/test-auto-compound.sh +169 -0
  234. package/scripts/tests/test-batch-test.sh +65 -0
  235. package/scripts/tests/test-benchmark-runner.sh +25 -0
  236. package/scripts/tests/test-common.sh +168 -0
  237. package/scripts/tests/test-cost-tracking.sh +158 -0
  238. package/scripts/tests/test-echo-back.sh +180 -0
  239. package/scripts/tests/test-entropy-audit.sh +146 -0
  240. package/scripts/tests/test-failure-digest.sh +66 -0
  241. package/scripts/tests/test-generate-ast-rules.sh +145 -0
  242. package/scripts/tests/test-helpers.sh +82 -0
  243. package/scripts/tests/test-init.sh +47 -0
  244. package/scripts/tests/test-lesson-check.sh +278 -0
  245. package/scripts/tests/test-lesson-local.sh +55 -0
  246. package/scripts/tests/test-license-check.sh +109 -0
  247. package/scripts/tests/test-mab-run.sh +182 -0
  248. package/scripts/tests/test-ollama-lib.sh +49 -0
  249. package/scripts/tests/test-ollama.sh +60 -0
  250. package/scripts/tests/test-pipeline-status.sh +198 -0
  251. package/scripts/tests/test-policy-check.sh +124 -0
  252. package/scripts/tests/test-prior-art-search.sh +96 -0
  253. package/scripts/tests/test-progress-writer.sh +140 -0
  254. package/scripts/tests/test-promote-mab-lessons.sh +110 -0
  255. package/scripts/tests/test-pull-community-lessons.sh +149 -0
  256. package/scripts/tests/test-quality-gate.sh +241 -0
  257. package/scripts/tests/test-research-gate.sh +132 -0
  258. package/scripts/tests/test-run-plan-cli.sh +86 -0
  259. package/scripts/tests/test-run-plan-context.sh +305 -0
  260. package/scripts/tests/test-run-plan-e2e.sh +153 -0
  261. package/scripts/tests/test-run-plan-headless.sh +424 -0
  262. package/scripts/tests/test-run-plan-notify.sh +124 -0
  263. package/scripts/tests/test-run-plan-parser.sh +217 -0
  264. package/scripts/tests/test-run-plan-prompt.sh +254 -0
  265. package/scripts/tests/test-run-plan-quality-gate.sh +222 -0
  266. package/scripts/tests/test-run-plan-routing.sh +178 -0
  267. package/scripts/tests/test-run-plan-scoring.sh +148 -0
  268. package/scripts/tests/test-run-plan-state.sh +261 -0
  269. package/scripts/tests/test-run-plan-team.sh +157 -0
  270. package/scripts/tests/test-scope-infer.sh +150 -0
  271. package/scripts/tests/test-setup-ralph-loop.sh +63 -0
  272. package/scripts/tests/test-telegram-env.sh +38 -0
  273. package/scripts/tests/test-telegram.sh +121 -0
  274. package/scripts/tests/test-telemetry.sh +46 -0
  275. package/scripts/tests/test-thompson-sampling.sh +139 -0
  276. package/scripts/tests/test-validate-all.sh +60 -0
  277. package/scripts/tests/test-validate-commands.sh +89 -0
  278. package/scripts/tests/test-validate-hooks.sh +98 -0
  279. package/scripts/tests/test-validate-lessons.sh +150 -0
  280. package/scripts/tests/test-validate-plan-quality.sh +235 -0
  281. package/scripts/tests/test-validate-plans.sh +187 -0
  282. package/scripts/tests/test-validate-plugin.sh +106 -0
  283. package/scripts/tests/test-validate-prd.sh +184 -0
  284. package/scripts/tests/test-validate-skills.sh +134 -0
  285. package/scripts/validate-all.sh +57 -0
  286. package/scripts/validate-commands.sh +67 -0
  287. package/scripts/validate-hooks.sh +89 -0
  288. package/scripts/validate-lessons.sh +98 -0
  289. package/scripts/validate-plan-quality.sh +369 -0
  290. package/scripts/validate-plans.sh +120 -0
  291. package/scripts/validate-plugin.sh +86 -0
  292. package/scripts/validate-policies.sh +42 -0
  293. package/scripts/validate-prd.sh +118 -0
  294. package/scripts/validate-skills.sh +96 -0
  295. package/skills/autocode/SKILL.md +285 -0
  296. package/skills/autocode/ab-verification.md +51 -0
  297. package/skills/autocode/code-quality-standards.md +37 -0
  298. package/skills/autocode/competitive-mode.md +364 -0
  299. package/skills/brainstorming/SKILL.md +97 -0
  300. package/skills/capture-lesson/SKILL.md +187 -0
  301. package/skills/check-lessons/SKILL.md +116 -0
  302. package/skills/dispatching-parallel-agents/SKILL.md +110 -0
  303. package/skills/executing-plans/SKILL.md +85 -0
  304. package/skills/finishing-a-development-branch/SKILL.md +201 -0
  305. package/skills/receiving-code-review/SKILL.md +72 -0
  306. package/skills/requesting-code-review/SKILL.md +59 -0
  307. package/skills/requesting-code-review/code-reviewer.md +82 -0
  308. package/skills/research/SKILL.md +145 -0
  309. package/skills/roadmap/SKILL.md +115 -0
  310. package/skills/subagent-driven-development/SKILL.md +98 -0
  311. package/skills/subagent-driven-development/code-quality-reviewer-prompt.md +18 -0
  312. package/skills/subagent-driven-development/implementer-prompt.md +73 -0
  313. package/skills/subagent-driven-development/spec-reviewer-prompt.md +57 -0
  314. package/skills/systematic-debugging/SKILL.md +134 -0
  315. package/skills/systematic-debugging/condition-based-waiting.md +64 -0
  316. package/skills/systematic-debugging/defense-in-depth.md +32 -0
  317. package/skills/systematic-debugging/root-cause-tracing.md +55 -0
  318. package/skills/test-driven-development/SKILL.md +167 -0
  319. package/skills/using-git-worktrees/SKILL.md +219 -0
  320. package/skills/using-superpowers/SKILL.md +54 -0
  321. package/skills/verification-before-completion/SKILL.md +140 -0
  322. package/skills/verify/SKILL.md +82 -0
  323. package/skills/writing-plans/SKILL.md +128 -0
  324. package/skills/writing-skills/SKILL.md +93 -0
@@ -0,0 +1,886 @@
1
+ # Research: Code Guideline Policies for AI Coding Agents
2
+
3
+ > **Date:** 2026-02-22
4
+ > **Status:** Research complete
5
+ > **Method:** Web research + competitive analysis + academic literature
6
+ > **Confidence:** High on landscape/format findings, Medium on optimal instruction counts, Low on long-term measurability
7
+
8
+ ## Executive Summary
9
+
10
+ The autonomous-coding-toolkit has strong negative enforcement (lessons, hookify, quality gates) but no positive policy system. This research examines how to add one.
11
+
12
+ **Key findings:**
13
+
14
+ 1. **Positive instructions outperform negative ones.** LLMs follow "always do X" significantly better than "never do Y" due to how token generation works. The toolkit's anti-pattern system (negative) should be complemented by a positive policy system, not replaced.
15
+
16
+ 2. **There is an emerging cross-tool standard.** AGENTS.md (Linux Foundation/Agentic AI Foundation) is the closest thing to a universal policy file, supported by 20+ tools. CLAUDE.md remains Claude Code's native format. The toolkit should support both.
17
+
18
+ 3. **Instruction saturation is real and quantifiable.** Frontier LLMs can follow ~150-200 instructions with reasonable consistency. Claude Code's system prompt consumes ~50, leaving ~100-150 for user instructions. Policy files must be ruthlessly pruned to stay within this budget.
19
+
20
+ 4. **Scoped, file-triggered policies beat monolithic rule files.** Cursor's `.mdc` format (glob-triggered rules) and GitHub Copilot's path-specific instructions demonstrate that policies attached to relevant files outperform global rule dumps. The signal-to-noise ratio matters more than rule count.
21
+
22
+ 5. **The enforcement spectrum has three tiers.** Hard (gate/block) for safety and correctness, soft (prompt injection) for style and conventions, and post-hoc (review/audit) for subjective quality. The toolkit already covers hard enforcement; the gap is in soft and post-hoc.
23
+
24
+ 6. **Policies and lessons are complementary, not overlapping.** Lessons capture "what went wrong" (reactive, negative). Policies capture "how we work" (proactive, positive). They serve different functions and should remain separate systems with cross-references.
25
+
26
+ ---
27
+
28
+ ## 1. Policy File Landscape
29
+
30
+ ### Findings
31
+
32
+ The AI coding agent ecosystem has converged on markdown-based instruction files placed in the repository, with tool-specific naming conventions:
33
+
34
+ | Tool | File(s) | Format | Scoping | Injection Point |
35
+ |------|---------|--------|---------|-----------------|
36
+ | **Claude Code** | `CLAUDE.md` (root, parents, children, `~/.claude/`) | Markdown, freeform | Directory hierarchy | Every session start |
37
+ | **Cursor** | `.cursor/rules/*.mdc` (replaces legacy `.cursorrules`) | Markdown with YAML frontmatter (globs, alwaysApply, description) | Glob patterns, always, agent-requested, manual | Per-file or always, based on type |
38
+ | **GitHub Copilot** | `.github/copilot-instructions.md` + `.github/instructions/**/*.instructions.md` | Markdown with `applyTo` glob frontmatter | Repository-wide + path-specific | Attached to every chat/inline request |
39
+ | **Windsurf** | `.windsurf/rules/*.md` (replaces legacy `.windsurfrules`) | Markdown, topic-organized | Global (user) + project | Per-session |
40
+ | **Amazon Q** | `.amazonq/rules/*.md` | Markdown | Project-level | Scanned on first interaction, evaluated per request |
41
+ | **Aider** | `CONVENTIONS.md` (or any markdown via `--read`) | Markdown | Project-level | Loaded as read-only context |
42
+ | **JetBrains Junie** | `.junie/guidelines.md` | Markdown | Project-level | Per-task |
43
+ | **AGENTS.md** | `AGENTS.md` (root + subdirectories) | Markdown, freeform | Directory hierarchy (closest wins) | Cross-tool standard, 20+ agents |
44
+ | **ESLint** | `eslint.config.js` | JavaScript/JSON | Rule-level with overrides | Build-time enforcement |
45
+ | **Ruff/Black** | `pyproject.toml [tool.ruff]` | TOML | File/directory patterns | Build-time enforcement |
46
+ | **EditorConfig** | `.editorconfig` | INI with globs | File patterns | Editor-level |
47
+ | **Prettier** | `.prettierrc` | JSON/YAML/JS | File patterns | Build-time |
48
+
49
+ ### Evidence
50
+
51
+ Every major AI coding tool has independently arrived at the same pattern: **a markdown file in the repository that gets injected into the agent's context**. The format is always natural language markdown, not structured config. This is because:
52
+
53
+ 1. LLMs consume natural language natively -- no parsing overhead.
54
+ 2. Markdown is human-readable and version-controllable.
55
+ 3. The instruction set is inherently fuzzy (style, conventions, preferences) and resists formalization into structured schemas.
56
+
57
+ Traditional tools (ESLint, Ruff, Prettier, EditorConfig) use structured config because they perform deterministic enforcement. AI agent tools use natural language because the enforcement is probabilistic.
58
+
59
+ **AGENTS.md** is emerging as the cross-tool standard, stewarded by the Agentic AI Foundation under the Linux Foundation. It is supported by OpenAI Codex, Google Jules/Gemini CLI, GitHub Copilot, Cursor, Windsurf, Factory, Aider, and 15+ others. Its key design principle: "The closest AGENTS.md to the edited file wins; explicit user chat prompts override everything."
60
+
61
+ ### Implications for the Toolkit
62
+
63
+ The toolkit should:
64
+ - **Support AGENTS.md** as the cross-tool policy format (broadest compatibility).
65
+ - **Continue using CLAUDE.md** for Claude Code-specific instructions (deepest integration).
66
+ - **Adopt scoped policies** (directory-level or glob-triggered) rather than a single monolithic file.
67
+ - **Use markdown** as the policy format -- not YAML, not JSON, not structured config. Natural language is the native format for LLM consumption.
68
+
69
+ **Confidence: High.** The convergence across 10+ independent tools is strong evidence that markdown-in-repo is the right format.
70
+
71
+ ---
72
+
73
+ ## 2. Positive vs. Negative Enforcement
74
+
75
+ ### Findings
76
+
77
+ Research and practitioner evidence consistently show that LLMs respond better to positive instructions ("do X") than negative constraints ("don't do Y"):
78
+
79
+ 1. **Token generation is inherently positive-selective.** LLMs predict the next most likely token -- they boost probabilities of desired outputs rather than suppressing undesired ones. Negative prompts only slightly reduce probabilities of unwanted tokens, while positive prompts actively boost desired outcomes.
80
+
81
+ 2. **InstructGPT performance degrades with negative prompts at scale.** Research on the NeQA benchmark shows that negation understanding does not reliably improve as models get larger. Models like GPT-3, GPT-Neo, and InstructGPT "consistently struggle with negation across multiple benchmarks."
82
+
83
+ 3. **Practitioner evidence is consistent.** Users report that LLMs "seem to produce worse output" the more "DO NOTs" appear in prompts. Specific example: Claude Code continued creating duplicate files despite explicit rules stating "NEVER create duplicate files."
84
+
85
+ 4. **The "Pink Elephant Problem."** Analogous to Ironic Process Theory in psychology -- trying to suppress a specific thought makes it more likely to surface. When you tell an LLM "don't use mock data," the tokens "mock" and "data" get activated in the attention mechanism, potentially increasing their probability.
86
+
87
+ **Reframing examples that work:**
88
+
89
+ | Negative (less effective) | Positive (more effective) |
90
+ |--------------------------|--------------------------|
91
+ | "Don't use mock data" | "Only use real-world data" |
92
+ | "Don't uppercase names" | "Always lowercase names" |
93
+ | "Avoid creating new files" | "Apply all fixes to existing files" |
94
+ | "Don't include fields with no value" | "Only include fields that have a value" |
95
+ | "Never use bare except" | "Always catch specific exception types and log them" |
96
+ | "Don't hardcode test counts" | "Assert test discovery dynamically using `len(collected)`" |
97
+
98
+ ### Evidence
99
+
100
+ - Research papers: NeQA benchmark studies on InstructGPT negation performance.
101
+ - Practitioner reports: Multiple Reddit threads and blog posts documenting negative instruction failures in Cursor, Claude Code, and GPT-4.
102
+ - Theoretical basis: Token generation mechanics (positive selection), Ironic Process Theory applied to neural networks.
103
+
104
+ ### Implications for the Toolkit
105
+
106
+ The toolkit's existing lesson system is primarily negative ("don't do X" -- bare exceptions, async without await, create_task without callback). This is **correct for the enforcement layer** (quality gates should block known-bad patterns) but **incomplete for the guidance layer** (agents need to know what TO do, not just what NOT to do).
107
+
108
+ **Recommendation:** Create a policy system that complements lessons:
109
+
110
+ | System | Framing | Function | Enforcement |
111
+ |--------|---------|----------|-------------|
112
+ | Lessons | Negative ("don't do X") | Catch known bugs | Hard gate (lesson-check.sh) |
113
+ | Policies | Positive ("always do Y") | Guide correct patterns | Soft injection (prompt context) |
114
+
115
+ Every lesson should have an optional `positive_alternative` field that gets injected into agent prompts as a policy. Example: Lesson 0001 (bare except) maps to policy "Always catch specific exception types (ValueError, KeyError, etc.) and log the exception before any fallback behavior."
116
+
117
+ **Confidence: High** on the principle. **Medium** on the magnitude of improvement -- the research is qualitative, not quantified with compliance percentages.
118
+
119
+ ---
120
+
121
+ ## 3. Policy Injection Mechanics
122
+
123
+ ### Findings
124
+
125
+ #### Where in the prompt should policies go?
126
+
127
+ Claude Code's CLAUDE.md content is injected as a system-level context block that Claude reads at the start of every conversation. Based on Anthropic's best practices documentation:
128
+
129
+ - CLAUDE.md goes into **every single session**, so contents must be universally applicable.
130
+ - Claude Code's system prompt contains ~50 individual instructions.
131
+ - Frontier thinking LLMs can follow **~150-200 instructions** with reasonable consistency.
132
+ - This leaves a budget of **~100-150 instructions** for user-defined policies.
133
+
134
+ Position matters: research shows models retrieve best from the **beginning or end** of long context, and degrade for information buried in the middle (the "lost in the middle" phenomenon).
135
+
136
+ #### How much policy text can an LLM absorb?
137
+
138
+ - **Optimal operating range:** 70-80% of context window capacity. Beyond this, accuracy drops regardless of remaining token capacity.
139
+ - **Smaller models show exponential decay** in instruction-following as count increases. Frontier models show **linear decay** -- more graceful but still real.
140
+ - Long system prompts increase prefill latency (time to first token) and bloat the KV cache for the entire turn.
141
+
142
+ #### What format works best?
143
+
144
+ Based on cross-tool convergence and Anthropic's official guidance:
145
+
146
+ | Format | LLM Effectiveness | Human Readability | Tooling Support |
147
+ |--------|-------------------|-------------------|-----------------|
148
+ | Prose paragraphs | Low -- buried signal | Medium | Universal |
149
+ | Bullet lists | High -- scannable | High | Universal |
150
+ | Numbered rules | High -- ordered, referenceable | High | Universal |
151
+ | Code examples | Highest -- concrete, unambiguous | Medium | Universal |
152
+ | YAML/JSON | Medium -- parseable but noisy | Low | Requires parser |
153
+ | Tables | High for comparisons | High | Markdown renderers |
154
+
155
+ **Best practice from Anthropic's documentation:**
156
+
157
+ ```markdown
158
+ # Code style
159
+ - Use ES modules (import/export) syntax, not CommonJS (require)
160
+ - Destructure imports when possible (eg. import { foo } from 'bar')
161
+
162
+ # Workflow
163
+ - Be sure to typecheck when you're done making a series of code changes
164
+ - Prefer running single tests, not the whole test suite, for performance
165
+ ```
166
+
167
+ Short, imperative bullet points. No prose. No explanation unless the rule is non-obvious. Code examples when the pattern is complex.
168
+
169
+ **Emphasis markers improve adherence:** Adding "IMPORTANT" or "YOU MUST" to critical rules measurably improves compliance, per Anthropic's documentation.
170
+
171
+ ### Implications for the Toolkit
172
+
173
+ Policy injection should:
174
+ 1. **Use bullet-list format** -- one rule per line, imperative voice, positive framing.
175
+ 2. **Include code examples** for non-obvious patterns (the single highest-fidelity format for LLMs).
176
+ 3. **Stay under 100 rules total** across all injected policy sources for a given session.
177
+ 4. **Position critical rules first and last** in the policy block (primacy and recency effects).
178
+ 5. **Inject scoped policies only when relevant** (glob-triggered, like Cursor's `.mdc`) to avoid wasting the instruction budget on irrelevant rules.
179
+ 6. **Never duplicate what linters enforce.** Anthropic's own guidance: "Never send an LLM to do a linter's job."
180
+
181
+ **Confidence: High** on format and positioning. **Medium** on the 150-200 instruction limit (single source, no independent replication found).
182
+
183
+ ---
184
+
185
+ ## 4. Policy Scoping
186
+
187
+ ### Findings
188
+
189
+ Policies naturally fall into four scopes, each with different update frequency and blast radius:
190
+
191
+ | Scope | Examples | Update Frequency | Applies To |
192
+ |-------|----------|------------------|------------|
193
+ | **Universal** | "Always handle errors explicitly," "Include type hints" | Rarely | All projects |
194
+ | **Language** | "Use pathlib over os.path," "Prefer f-strings over .format()" | Occasionally | All Python projects |
195
+ | **Framework** | "Use Pydantic models for API schemas," "Use pytest fixtures over setUp/tearDown" | Per-project | Projects using that framework |
196
+ | **Project** | "State management is in src/stores/," "Use the `ApiClient` class for all HTTP calls" | Frequently | Single project |
197
+
198
+ #### How to prevent policy bloat
199
+
200
+ The Cursor `.mdc` format provides the best model for scoped policies:
201
+
202
+ - **Always rules** (`alwaysApply: true`): Injected into every context. Reserve for universal and language-level policies. Budget: 20-30 rules max.
203
+ - **Auto-attached rules** (glob-triggered): Injected only when working with matching files. Example: `*.test.py` triggers testing conventions. Budget: 10-15 rules per scope.
204
+ - **Agent-requested rules** (description-based): The agent reads descriptions and pulls in rules it deems relevant. Lowest injection overhead.
205
+ - **Manual rules**: Never auto-injected. Reference material only.
206
+
207
+ GitHub Copilot's approach is similar but simpler: `.github/copilot-instructions.md` for global, `.github/instructions/**/*.instructions.md` with `applyTo` globs for scoped.
208
+
209
+ AGENTS.md uses directory hierarchy: place `AGENTS.md` in each subdirectory, and the closest one to the file being edited wins. This is identical to how CLAUDE.md works in Claude Code.
210
+
211
+ #### Evidence for scoping effectiveness
212
+
213
+ Cursor users report that migrating from a single `.cursorrules` file (often 500+ lines) to scoped `.mdc` files dramatically improved compliance: "Isn't 20 mdc files too much information for Cursor? No. This is what .mdc files solve." The key insight: the agent sees only the rules relevant to its current task, keeping the effective instruction count low even as the total policy library grows.
214
+
215
+ ### Implications for the Toolkit
216
+
217
+ The toolkit should implement a **three-tier scoping model**:
218
+
219
+ ```
220
+ policies/
221
+ universal.md # Always injected (20-30 rules max)
222
+ python.md # Auto-attached for *.py files
223
+ javascript.md # Auto-attached for *.js/*.ts files
224
+ testing.md # Auto-attached for test files
225
+ project.md # Project-specific conventions
226
+ ```
227
+
228
+ Each policy file includes frontmatter specifying scope:
229
+
230
+ ```markdown
231
+ ---
232
+ scope: auto-attach
233
+ globs: ["*.py", "*.pyi"]
234
+ ---
235
+ # Python Conventions
236
+ - Use pathlib for all file path operations
237
+ - Type-hint all function signatures (parameters and return)
238
+ - Use dataclasses or Pydantic for structured data, not plain dicts
239
+ ```
240
+
241
+ The injection mechanism assembles the relevant policy set per batch/task/session:
242
+ 1. Always include `universal.md`.
243
+ 2. Auto-attach policies matching the files in the current batch.
244
+ 3. Count total rules. If >100, warn and suggest pruning.
245
+
246
+ **Confidence: High.** Multiple tools have converged on scoped policies independently.
247
+
248
+ ---
249
+
250
+ ## 5. Enforcement Spectrum
251
+
252
+ ### Findings
253
+
254
+ Policies exist on an enforcement spectrum from advisory to blocking. The appropriate level depends on two axes: **cost of violation** and **detectability**.
255
+
256
+ | Enforcement Level | Mechanism | When to Use | Examples |
257
+ |-------------------|-----------|-------------|---------|
258
+ | **Hard gate (block)** | Quality gate, hookify rule, linter | Violation causes bugs, data loss, or security issues. Pattern is syntactically detectable with near-zero false positives. | Bare exceptions, secrets in code, force-push |
259
+ | **Soft injection (advisory)** | Prompt context via CLAUDE.md / AGENTS.md / policy files | Violation degrades quality but isn't catastrophic. Pattern is stylistic or contextual. | Naming conventions, import organization, docstring format, preferred libraries |
260
+ | **Post-hoc review (audit)** | Code review agent, entropy audit, manual review | Violation is subjective or requires broad context to evaluate. | Architecture decisions, API design quality, test coverage adequacy |
261
+
262
+ #### Criteria for choosing enforcement level
263
+
264
+ Adapted from the Open Policy Agent (OPA) model of separating policy logic from enforcement:
265
+
266
+ 1. **Is it syntactically detectable with >95% precision?** Hard gate.
267
+ 2. **Is it a clear positive convention that an LLM can follow?** Soft injection.
268
+ 3. **Does it require understanding the full codebase context?** Post-hoc review.
269
+ 4. **Is the cost of a false positive higher than the cost of a missed violation?** Lower the enforcement level.
270
+
271
+ #### What the toolkit already covers
272
+
273
+ | Tier | Current Coverage | Gap |
274
+ |------|-----------------|-----|
275
+ | Hard gate | lesson-check.sh (6 syntactic patterns), hookify (5 rules), ast-grep (5 patterns), test suite, test count monotonicity | Well-covered for anti-patterns |
276
+ | Soft injection | CLAUDE.md instructions, skill prompts, AGENTS.md per worktree | **No structured positive policy system** |
277
+ | Post-hoc review | lesson-scanner agent, entropy-audit.sh, code review skill | Partially covered |
278
+
279
+ ### Implications for the Toolkit
280
+
281
+ The gap is entirely in the **soft injection** tier. The toolkit has excellent hard gates and reasonable post-hoc review, but no systematic way to inject positive coding conventions into agent context.
282
+
283
+ The proposed policy system fills exactly this gap:
284
+ - Policies are **soft-injected** into agent prompts (CLAUDE.md, AGENTS.md, or dedicated policy files).
285
+ - They complement hard gates (lessons) rather than replacing them.
286
+ - They can optionally **graduate to hard enforcement** if a policy violation causes enough bugs (policy -> lesson -> hookify rule).
287
+
288
+ **Confidence: High.** The three-tier model maps cleanly to the toolkit's existing architecture.
289
+
290
+ ---
291
+
292
+ ## 6. Existing Implementations: Competitive Analysis
293
+
294
+ ### Claude Code (CLAUDE.md)
295
+
296
+ **Strengths:**
297
+ - Directory hierarchy (root, parent, child) enables natural scoping.
298
+ - `@import` syntax lets CLAUDE.md reference other files without duplicating content.
299
+ - `/init` command auto-generates starter CLAUDE.md from project analysis.
300
+ - Skills (`.claude/skills/`) provide on-demand policy loading without bloating every session.
301
+ - Hooks provide deterministic enforcement for must-happen rules.
302
+
303
+ **Weaknesses:**
304
+ - No glob-based auto-attachment (unlike Cursor's `.mdc`).
305
+ - No structured frontmatter -- all freeform markdown.
306
+ - No built-in mechanism to measure instruction compliance.
307
+ - Official guidance says to "ruthlessly prune" but provides no tools to identify stale rules.
308
+
309
+ **Key quote from Anthropic:** "If your CLAUDE.md is too long, Claude ignores half of it because important rules get lost in the noise."
310
+
311
+ ### Cursor (.mdc rules)
312
+
313
+ **Strengths:**
314
+ - Four rule types (Always, Auto-Attach, Agent-Requested, Manual) provide fine-grained injection control.
315
+ - Glob-based attachment means agents see only relevant rules.
316
+ - YAML frontmatter enables machine-parseable metadata.
317
+ - `.cursor/rules/` directory keeps rules organized by topic.
318
+
319
+ **Weaknesses:**
320
+ - Proprietary format (`.mdc`) not supported by other tools.
321
+ - Legacy `.cursorrules` migration path is confusing.
322
+ - No enforcement mechanism -- purely advisory.
323
+ - Rule effectiveness is not measurable.
324
+
325
+ ### GitHub Copilot
326
+
327
+ **Strengths:**
328
+ - Path-specific instructions (`.github/instructions/**/*.instructions.md` with `applyTo` globs) is an elegant scoping model.
329
+ - Committed to repo, shared with team via git.
330
+ - Instructions attached to both chat and inline suggestions.
331
+
332
+ **Weaknesses:**
333
+ - Limited to Copilot ecosystem.
334
+ - No enforcement -- purely advisory.
335
+ - Relatively new feature, limited community examples.
336
+
337
+ ### Amazon Q Developer
338
+
339
+ **Strengths:**
340
+ - Rules explicitly designed for coding standards enforcement.
341
+ - Scans `.amazonq/rules/` on first interaction, evaluates per request.
342
+ - Supports language-specific style guidelines with concrete examples.
343
+
344
+ **Weaknesses:**
345
+ - AWS ecosystem lock-in.
346
+ - No scoping beyond project-level.
347
+ - Limited community sharing.
348
+
349
+ ### Aider (CONVENTIONS.md)
350
+
351
+ **Strengths:**
352
+ - Simplest model: one markdown file, loaded as read-only context.
353
+ - Community conventions repository for sharing.
354
+ - Integrates with post-edit linting (errors sent back to LLM for fixing).
355
+
356
+ **Weaknesses:**
357
+ - No scoping -- entire file loaded every time.
358
+ - No frontmatter or metadata.
359
+ - Relies on the LLM to decide relevance.
360
+
361
+ ### JetBrains Junie
362
+
363
+ **Strengths:**
364
+ - `.junie/guidelines.md` can be auto-generated by prompting Junie to explore the project.
365
+ - Community guidelines catalog (GitHub: JetBrains/junie-guidelines).
366
+
367
+ **Weaknesses:**
368
+ - Single file, no scoping.
369
+ - JetBrains ecosystem only.
370
+
371
+ ### AGENTS.md (Cross-Tool Standard)
372
+
373
+ **Strengths:**
374
+ - Supported by 20+ tools (broadest compatibility).
375
+ - Directory hierarchy scoping (closest file wins).
376
+ - No required fields -- flexible structure.
377
+ - Linux Foundation stewardship ensures longevity.
378
+
379
+ **Weaknesses:**
380
+ - No frontmatter standard for glob patterns or rule types.
381
+ - No enforcement mechanism.
382
+ - Still early -- limited community policy libraries.
383
+
384
+ ### ESLint Shareable Configs
385
+
386
+ **Strengths:**
387
+ - Best example of **policy distribution at scale**: npm packages with versioned configs.
388
+ - `eslint-config-airbnb` has 3M+ weekly downloads -- proof that shared conventions work.
389
+ - Extends/overrides model for layered policies.
390
+
391
+ **Weaknesses:**
392
+ - Deterministic enforcement only (no fuzzy style guidance).
393
+ - JavaScript/TypeScript ecosystem only.
394
+ - Not consumed by LLMs.
395
+
396
+ ### Implications for the Toolkit
397
+
398
+ The toolkit should:
399
+ 1. **Generate AGENTS.md** in worktrees (already done for plan metadata -- extend with policies).
400
+ 2. **Support a `policies/` directory** with scoped markdown files.
401
+ 3. **Inject policies into the prompt assembly pipeline** (`scripts/lib/prompt.sh` or equivalent) during headless execution.
402
+ 4. **Adopt Cursor's glob-trigger model** for auto-attachment, implemented in the toolkit's own prompt assembly rather than relying on Cursor.
403
+ 5. **Build a policy distribution model** inspired by ESLint shareable configs -- community policy packs as git repos or directories.
404
+
405
+ **Confidence: High.** Analysis based on official documentation from all listed tools.
406
+
407
+ ---
408
+
409
+ ## 7. Interaction with Existing Systems
410
+
411
+ ### Findings
412
+
413
+ The toolkit currently has three enforcement layers:
414
+
415
+ | Layer | System | Timing | Nature |
416
+ |-------|--------|--------|--------|
417
+ | Pre-write | Hookify rules | Before file write | Behavioral enforcement (block/warn) |
418
+ | Post-batch | lesson-check.sh + quality-gate.sh | Between batches | Anti-pattern detection (block) |
419
+ | Post-implementation | lesson-scanner agent, entropy-audit.sh | At verification | Semantic analysis (advisory) |
420
+
421
+ Policies would add a fourth layer:
422
+
423
+ | Layer | System | Timing | Nature |
424
+ |-------|--------|--------|--------|
425
+ | **Pre-execution** | Policy injection | Before agent starts each batch | **Positive guidance (advisory)** |
426
+
427
+ ### How policies interact with each system
428
+
429
+ **Policies and Lessons:**
430
+ - Complementary, not overlapping. Lessons are reactive (capture past failures). Policies are proactive (define desired behavior).
431
+ - Cross-reference: Each lesson's `positive_alternative` field generates a corresponding policy entry.
432
+ - Example: Lesson 0001 ("bare except swallowing") cross-references policy "Always catch specific exception types and log them."
433
+
434
+ **Policies and Hookify:**
435
+ - Non-overlapping enforcement targets. Hookify enforces behavioral rules (no force-push, no secrets). Policies guide stylistic conventions (naming, patterns, preferred libraries).
436
+ - Exception: If a policy is consistently violated despite soft injection, it may indicate the need for escalation to hookify (policy graduation).
437
+
438
+ **Policies and Quality Gates:**
439
+ - Quality gates verify after the fact. Policies guide before the fact.
440
+ - Quality gates can optionally check policy compliance by running a lightweight audit of generated code against active policies (post-hoc tier).
441
+ - New gate step: `policy-check.sh` -- a grep-based scanner for positive pattern presence (e.g., "all new Python functions have type hints").
442
+
443
+ **Policies and Skills:**
444
+ - Skills define HOW to execute stages. Policies define WHAT conventions to follow during execution.
445
+ - Skills reference policies: "Follow the policies in `policies/python.md` for all Python code in this task."
446
+ - Skills are rigid process templates. Policies are flexible convention sets.
447
+
448
+ **Policies and AGENTS.md:**
449
+ - AGENTS.md is already generated per worktree with plan metadata.
450
+ - Extend it to include relevant policies assembled from the `policies/` directory.
451
+ - This makes policies visible to non-Claude agents that read AGENTS.md.
452
+
453
+ ### Implications for the Toolkit
454
+
455
+ The policy system slots cleanly into the existing architecture without duplicating any existing system:
456
+
457
+ ```
458
+ Policy injection (pre-execution, positive, advisory)
459
+
460
+ Agent executes batch
461
+
462
+ Hookify (pre-write, behavioral, block/warn)
463
+
464
+ lesson-check.sh (post-batch, anti-pattern, block)
465
+
466
+ quality-gate.sh (post-batch, composite, block)
467
+
468
+ policy-check.sh (post-batch, convention, advisory) [NEW]
469
+
470
+ lesson-scanner (post-implementation, semantic, advisory)
471
+ ```
472
+
473
+ **Confidence: High.** The mapping is clean and non-overlapping.
474
+
475
+ ---
476
+
477
+ ## 8. Policy Lifecycle
478
+
479
+ ### Findings
480
+
481
+ Policies, like code, need a lifecycle: creation, testing, versioning, and retirement. Without this, stale or contradictory policies accumulate and degrade agent performance.
482
+
483
+ #### Creation
484
+
485
+ Based on patterns from ESLint shareable configs and the toolkit's lesson system:
486
+
487
+ 1. **Discovery:** A team member identifies a recurring pattern that should be standardized (not a bug -- that's a lesson).
488
+ 2. **Drafting:** Write the policy as a positive instruction with an optional code example.
489
+ 3. **Testing:** Run the policy through at least one batch execution and verify the agent follows it.
490
+ 4. **Review:** Peer review (or `/counter` adversarial review) to check for ambiguity, conflicts with existing policies, and enforceability.
491
+ 5. **Merge:** Add to `policies/` directory.
492
+
493
+ #### Testing
494
+
495
+ Policies are harder to test than lessons (which have grep-detectable patterns). Testing approaches:
496
+
497
+ - **Behavioral test:** Run a controlled batch with and without the policy. Diff the output. Does the policy produce measurably different code?
498
+ - **Compliance audit:** After a batch, grep for evidence of policy compliance (e.g., all new functions have type hints).
499
+ - **Contradiction check:** Automated scan for policies that conflict with each other or with existing lessons.
500
+
501
+ #### Versioning
502
+
503
+ - Policies live in git alongside code. Changes are tracked via commits.
504
+ - Each policy file has a `last_reviewed` date in frontmatter.
505
+ - Policies not reviewed in 90 days get flagged by `entropy-audit.sh`.
506
+
507
+ #### Retirement
508
+
509
+ A policy should be retired when:
510
+ 1. It has been superseded by a linter rule (deterministic enforcement > probabilistic).
511
+ 2. The convention it enforces has become default LLM behavior (Claude already does it without being told).
512
+ 3. It consistently produces false positives or conflicts with other policies.
513
+ 4. The technology it targets is no longer used in the project.
514
+
515
+ **Retirement process:** Move to `policies/archived/` with a note explaining why. Never delete -- stale policies may become relevant again.
516
+
517
+ ### Implications for the Toolkit
518
+
519
+ Add to `entropy-audit.sh`:
520
+ - Check for policies with `last_reviewed` > 90 days ago.
521
+ - Check for policies that reference files or patterns no longer in the codebase.
522
+ - Check for policy-lesson contradictions (negative lesson says "don't X" but no positive policy says "do Y instead").
523
+
524
+ Policy template:
525
+
526
+ ```markdown
527
+ ---
528
+ scope: auto-attach
529
+ globs: ["*.py"]
530
+ last_reviewed: 2026-02-22
531
+ source: team-convention # or: lesson-derived, community, framework-default
532
+ ---
533
+ # Python Error Handling
534
+
535
+ - Always catch specific exception types (ValueError, KeyError, ConnectionError), never bare `except:`
536
+ - Log the exception with `logger.exception()` before any fallback behavior
537
+ - Use `contextlib.suppress()` only when the suppression is intentional and documented with a comment
538
+
539
+ ## Example
540
+ ```python
541
+ # Correct
542
+ try:
543
+ result = parse_config(path)
544
+ except (FileNotFoundError, json.JSONDecodeError) as e:
545
+ logger.exception("Config parse failed for %s", path)
546
+ result = DEFAULT_CONFIG
547
+ ```
548
+ ```
549
+
550
+ **Confidence: High** on the lifecycle model. **Medium** on the 90-day review cadence (arbitrary, needs calibration).
551
+
552
+ ---
553
+
554
+ ## 9. Measurability
555
+
556
+ ### Findings
557
+
558
+ Measuring policy effectiveness is the weakest area across all tools studied. No tool provides built-in policy compliance metrics. The industry relies on:
559
+
560
+ 1. **Task compliance rate:** How often the agent produces code that follows the policy. Measured via post-hoc audit of generated code. Industry recommendation: 80% automated evaluation + 20% expert review.
561
+
562
+ 2. **Policy violation rate over time:** Track how often `policy-check.sh` flags violations. A declining trend indicates the policy is working. A flat trend indicates the agent is ignoring it.
563
+
564
+ 3. **Policy-triggered lesson rate:** If a policy's subject area keeps generating new lessons, the policy isn't effective enough. The policy-to-lesson ratio should trend toward zero new lessons in covered areas.
565
+
566
+ 4. **Before/after code quality metrics:** Run the same batch with and without policies. Measure: test pass rate, lint violations, code review findings, time to completion. This is the gold standard but expensive to run.
567
+
568
+ 5. **Agent self-report:** Ask the agent to report which policies it consulted during execution. Low-cost signal, but unreliable (agents may hallucinate compliance).
569
+
570
+ #### What metrics matter most
571
+
572
+ From the DX Research "Measuring AI Code Assistants and Agents" framework:
573
+
574
+ | Metric | What It Measures | Cost to Collect |
575
+ |--------|-----------------|-----------------|
576
+ | Utilization | Is the policy being injected? | Low (log injection events) |
577
+ | Compliance | Does the output follow the policy? | Medium (grep/audit post-batch) |
578
+ | Impact | Does the policy improve quality outcomes? | High (A/B testing, longitudinal tracking) |
579
+
580
+ #### Practical approach for the toolkit
581
+
582
+ Given the toolkit's existing infrastructure:
583
+
584
+ 1. **Log policy injection.** When `run-plan.sh` assembles a prompt, log which policies were injected. Stored in `logs/policy-injection.log`.
585
+ 2. **Grep-audit compliance.** Add optional `compliance_check` field to policy frontmatter: a grep pattern that should appear in compliant code. `policy-check.sh` runs these after each batch.
586
+ 3. **Track violations in failure-patterns.json.** Extend the existing failure pattern learning to include policy violations. If a policy-related issue recurs, escalate to lesson.
587
+ 4. **Monthly review.** During `/reflect`, review policy compliance logs. Retire ineffective policies.
588
+
589
+ ### Implications for the Toolkit
590
+
591
+ Build measurability into the policy system from day one, but keep it lightweight:
592
+
593
+ ```
594
+ Policy injection → log which policies applied
595
+
596
+ Batch execution
597
+
598
+ policy-check.sh → grep for compliance patterns
599
+
600
+ Log results to logs/policy-compliance.json
601
+
602
+ Monthly: /reflect reviews compliance trends
603
+
604
+ Decision: keep / revise / retire / escalate to lesson
605
+ ```
606
+
607
+ **Confidence: Medium.** No tool has solved this well. The proposed approach is practical but unvalidated.
608
+
609
+ ---
610
+
611
+ ## 10. Community Policies
612
+
613
+ ### Findings
614
+
615
+ The lesson system already supports community contribution (`/submit-lesson` -> PR). Can policies be shared similarly?
616
+
617
+ #### Transferability comparison
618
+
619
+ | Characteristic | Anti-Pattern Lessons | Positive Policies |
620
+ |---------------|---------------------|-------------------|
621
+ | Transferability | High -- bugs are universal | Medium -- conventions are context-dependent |
622
+ | Example | "Bare except swallows errors" (true everywhere) | "Use Pydantic for API schemas" (only if you use Pydantic) |
623
+ | Specificity | Narrow (one pattern per lesson) | Broad (multiple conventions per policy) |
624
+ | Overlap risk | Low (bugs are distinct) | High (my "clean code" != your "clean code") |
625
+ | Distribution model | Single files, additive | Sets/packs, composable |
626
+
627
+ #### Community policy distribution models
628
+
629
+ 1. **ESLint model (npm packages):** Versioned, named configs. `eslint-config-airbnb` sets a standard that millions use. Proven at scale. Requires a package manager.
630
+
631
+ 2. **Cursor community model (awesome-cursorrules):** Git repos with categorized rule files. Users copy what they need. No versioning, no dependency management. Simple but fragile.
632
+
633
+ 3. **Junie model (guidelines catalog):** Official repo with technology-specific guideline files. Community contributes via PR. Curated but slow to update.
634
+
635
+ 4. **Aider model (conventions repo):** `github.com/Aider-AI/conventions` -- shared conventions files. Simple directory of markdown files.
636
+
637
+ #### Proposed model for the toolkit
638
+
639
+ **Policy packs** -- curated sets of policies for specific technology stacks, distributed as directories:
640
+
641
+ ```
642
+ community-policies/
643
+ python-standard/
644
+ error-handling.md
645
+ type-hints.md
646
+ testing.md
647
+ imports.md
648
+ typescript-standard/
649
+ error-handling.md
650
+ types.md
651
+ testing.md
652
+ fastapi/
653
+ api-conventions.md
654
+ pydantic-models.md
655
+ react/
656
+ component-patterns.md
657
+ state-management.md
658
+ ```
659
+
660
+ Users install a pack:
661
+
662
+ ```bash
663
+ # Copy a policy pack into your project
664
+ cp -r community-policies/python-standard/ policies/
665
+
666
+ # Or symlink for auto-updates
667
+ ln -s path/to/community-policies/python-standard/ policies/python
668
+ ```
669
+
670
+ Each pack has a `manifest.md` describing what it covers, dependencies, and compatibility.
671
+
672
+ ### Implications for the Toolkit
673
+
674
+ Community policies are viable but require more curation than lessons:
675
+ - **Lessons are additive** (each lesson catches one specific bug -- no conflicts).
676
+ - **Policies can conflict** ("use dataclasses" vs. "use Pydantic" vs. "use TypedDict").
677
+ - **Solution:** Policy packs declare what they cover. Users choose one pack per domain. Conflict detection in `entropy-audit.sh`.
678
+
679
+ **Confidence: Medium.** The ESLint model proves community standards work at scale. Whether this translates to LLM-consumed natural language policies is unproven.
680
+
681
+ ---
682
+
683
+ ## Policy System Design Recommendation
684
+
685
+ ### Architecture
686
+
687
+ ```
688
+ policies/ # Policy directory (per-project)
689
+ universal.md # Always injected (cross-language)
690
+ python.md # Auto-attached for *.py
691
+ javascript.md # Auto-attached for *.js/*.ts
692
+ testing.md # Auto-attached for test files
693
+ project.md # Project-specific conventions
694
+ archived/ # Retired policies (never delete)
695
+
696
+ scripts/
697
+ policy-check.sh # Post-batch compliance audit [NEW]
698
+ lib/policy-inject.sh # Policy assembly for prompt injection [NEW]
699
+ ```
700
+
701
+ ### Policy File Format
702
+
703
+ ```markdown
704
+ ---
705
+ scope: auto-attach # always | auto-attach | on-demand
706
+ globs: ["*.py", "*.pyi"] # file patterns (for auto-attach scope)
707
+ last_reviewed: 2026-02-22
708
+ source: team-convention # team-convention | lesson-derived | community | framework
709
+ related_lessons: [1, 7] # cross-reference to lesson IDs
710
+ ---
711
+ # Python Error Handling
712
+
713
+ - Always catch specific exception types, never bare `except:`
714
+ - Log exceptions with `logger.exception()` before any fallback
715
+ - Use `contextlib.suppress()` only with an explanatory comment
716
+
717
+ ## Example
718
+
719
+ ```python
720
+ try:
721
+ result = parse_config(path)
722
+ except (FileNotFoundError, json.JSONDecodeError) as e:
723
+ logger.exception("Config parse failed for %s", path)
724
+ result = DEFAULT_CONFIG
725
+ ```
726
+
727
+ ## Compliance Check
728
+
729
+ ```bash
730
+ # Verify no bare except in changed files
731
+ ! grep -n 'except:' "$FILE" || echo "POLICY VIOLATION: Use specific exception types"
732
+ ```
733
+ ```
734
+
735
+ ### Injection Pipeline
736
+
737
+ During headless execution (`run-plan.sh`), before each batch:
738
+
739
+ 1. Read `policies/universal.md` (always).
740
+ 2. Identify file types in the current batch.
741
+ 3. Auto-attach matching policies based on globs.
742
+ 4. Count total instruction lines across all injected policies.
743
+ 5. If >100 instructions, warn and truncate least-relevant (on-demand scope first).
744
+ 6. Append assembled policies to the batch prompt (after task description, before CLAUDE.md general instructions).
745
+ 7. Log injected policies to `logs/policy-injection.log`.
746
+
747
+ For interactive sessions, policies are referenced via CLAUDE.md `@import`:
748
+
749
+ ```markdown
750
+ # CLAUDE.md
751
+ @policies/universal.md
752
+ @policies/python.md
753
+ ```
754
+
755
+ ### Enforcement Tiers
756
+
757
+ | Tier | Mechanism | Timing | Action |
758
+ |------|-----------|--------|--------|
759
+ | **Guidance** | Prompt injection | Pre-execution | Advisory -- agent sees policies as instructions |
760
+ | **Audit** | policy-check.sh | Post-batch | Warning -- logs violations, does not block |
761
+ | **Escalation** | Manual review + lesson creation | On repeated violation | Policy becomes lesson, soft becomes hard |
762
+
763
+ ### Integration Points
764
+
765
+ | Existing System | Integration |
766
+ |----------------|-------------|
767
+ | `run-plan.sh` | Call `lib/policy-inject.sh` to assemble policies per batch |
768
+ | `quality-gate.sh` | Add optional `policy-check.sh` step (advisory, non-blocking) |
769
+ | AGENTS.md generation | Include assembled policies in generated AGENTS.md |
770
+ | `entropy-audit.sh` | Add policy staleness check (last_reviewed > 90 days) |
771
+ | Lesson files | Add optional `positive_alternative` field -> auto-generates policy |
772
+ | `/submit-lesson` | Prompt for positive alternative when submitting a lesson |
773
+
774
+ ### Policy Lifecycle
775
+
776
+ ```
777
+ Convention identified
778
+
779
+ Draft policy (positive framing, code example, compliance check)
780
+
781
+ Test: run batch with policy, verify compliance
782
+
783
+ Review: /counter or peer review for ambiguity/conflicts
784
+
785
+ Merge to policies/ directory
786
+
787
+ Monitor: policy-check.sh logs compliance rate
788
+
789
+ Monthly /reflect: review compliance trends
790
+
791
+ Decision: keep | revise | retire | escalate to lesson+hookify
792
+ ```
793
+
794
+ ### Implementation Plan (Suggested Batches)
795
+
796
+ | Batch | Scope | Deliverables |
797
+ |-------|-------|-------------|
798
+ | 1 | Foundation | `policies/` directory, policy file format, `universal.md` with 10-15 starter rules |
799
+ | 2 | Injection | `lib/policy-inject.sh`, integration with `run-plan.sh` prompt assembly |
800
+ | 3 | Audit | `policy-check.sh`, integration with `quality-gate.sh` (advisory mode) |
801
+ | 4 | Lesson bridge | `positive_alternative` field in lesson template, auto-generation of policy entries from lessons |
802
+ | 5 | AGENTS.md | Extend AGENTS.md generation to include assembled policies |
803
+ | 6 | Measurability | `logs/policy-compliance.json`, compliance trend reporting |
804
+ | 7 | Community | Policy pack format, `manifest.md`, conflict detection in `entropy-audit.sh` |
805
+
806
+ ### Starter Policies (Batch 1)
807
+
808
+ Based on the toolkit's existing lessons and common cross-project conventions:
809
+
810
+ **universal.md:**
811
+ 1. Handle errors explicitly -- catch specific exception types and log before fallback
812
+ 2. Include type annotations on all function signatures
813
+ 3. Write docstrings for public functions and classes
814
+ 4. Use descriptive variable names -- no single-letter names except loop indices
815
+ 5. Keep functions under 50 lines -- extract helpers when they grow
816
+ 6. Return early for error conditions -- happy path last
817
+ 7. Use constants for magic numbers and strings
818
+ 8. Commit after each logical unit of work with a descriptive message
819
+ 9. Write the test first, then the implementation
820
+ 10. When importing, prefer explicit imports over wildcards
821
+
822
+ **python.md:**
823
+ 1. Use pathlib for all file path operations
824
+ 2. Use f-strings for string formatting
825
+ 3. Use dataclasses or Pydantic for structured data, not plain dicts
826
+ 4. Use `contextlib.suppress()` for intentional exception suppression, with a comment
827
+ 5. Use `logging.exception()` in except blocks to capture tracebacks
828
+
829
+ **testing.md:**
830
+ 1. Use pytest fixtures over setUp/tearDown methods
831
+ 2. Name tests descriptively: `test_<function>_<scenario>_<expected_result>`
832
+ 3. Assert specific values, not truthiness
833
+ 4. Use `pytest.raises()` for expected exceptions, not try/except in tests
834
+ 5. One logical assertion per test function
835
+
836
+ ---
837
+
838
+ ## Sources
839
+
840
+ ### Official Documentation
841
+ - [Anthropic: Best Practices for Claude Code](https://code.claude.com/docs/en/best-practices)
842
+ - [Anthropic: Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
843
+ - [GitHub: Adding Repository Custom Instructions for Copilot](https://docs.github.com/copilot/customizing-copilot/adding-custom-instructions-for-github-copilot)
844
+ - [Amazon Q Developer: Project Rules](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/command-line-project-rules.html)
845
+ - [Amazon Q Developer: Creating Project Rules](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/context-project-rules.html)
846
+ - [Aider: Specifying Coding Conventions](https://aider.chat/docs/usage/conventions.html)
847
+ - [ESLint: Shareable Configs](https://eslint.org/docs/latest/extend/shareable-configs)
848
+ - [AGENTS.md Specification](https://agents.md/)
849
+ - [AGENTS.md GitHub Repository](https://github.com/agentsmd/agents.md)
850
+ - [OpenAI Codex: Custom Instructions with AGENTS.md](https://developers.openai.com/codex/guides/agents-md/)
851
+
852
+ ### Research and Analysis
853
+ - [The Pink Elephant Problem: Why "Don't Do That" Fails with LLMs](https://eval.16x.engineer/blog/the-pink-elephant-negative-instructions-llms-effectiveness-analysis)
854
+ - [Why Positive Prompts Outperform Negative Ones with LLMs](https://gadlet.com/posts/negative-prompting/)
855
+ - [Understanding the Relationship Between LLMs and Negation (Swimm)](https://swimm.io/blog/understanding-llms-and-negation)
856
+ - [Prompt Length vs. Context Window: The Real Limits of LLM Performance (HackerNoon)](https://hackernoon.com/prompt-length-vs-context-window-the-real-limits-of-llm-performance)
857
+ - [Why Long System Prompts Hurt Context Windows](https://medium.com/data-science-collective/why-long-system-prompts-hurt-context-windows-and-how-to-fix-it-7a3696e1cdf9)
858
+ - [DX Research: Measuring AI Code Assistants and Agents](https://getdx.com/research/measuring-ai-code-assistants-and-agents/)
859
+ - [Three Metrics for Measuring the Impact of AI on Code Quality](https://getdx.com/blog/3-metrics-for-measuring-the-impact-of-ai-on-code-quality/)
860
+
861
+ ### Practitioner Guides
862
+ - [Writing a Good CLAUDE.md (HumanLayer)](https://www.humanlayer.dev/blog/writing-a-good-claude-md)
863
+ - [Creating the Perfect CLAUDE.md (Dometrain)](https://dometrain.com/blog/creating-the-perfect-claudemd-for-claude-code/)
864
+ - [How to Write Great Cursor Rules (Trigger.dev)](https://trigger.dev/blog/cursor-rules)
865
+ - [Top Cursor Rules for Coding Agents (PromptHub)](https://www.prompthub.us/blog/top-cursor-rules-for-coding-agents)
866
+ - [Windsurf AI Rules Guide](https://uibakery.io/blog/windsurf-ai-rules)
867
+ - [Improve Your AI Code Output with AGENTS.md (Builder.io)](https://www.builder.io/blog/agents-md)
868
+ - [A Complete Guide to AGENTS.md (AI Hero)](https://www.aihero.dev/a-complete-guide-to-agents-md)
869
+ - [Coding Guidelines for Your AI Agents (JetBrains)](https://blog.jetbrains.com/idea/2025/05/coding-guidelines-for-your-ai-agents/)
870
+ - [JetBrains Junie Guidelines Catalog](https://github.com/JetBrains/junie-guidelines)
871
+ - [Mastering Amazon Q Developer with Rules (AWS Blog)](https://aws.amazon.com/blogs/devops/mastering-amazon-q-developer-with-rules/)
872
+
873
+ ### Community Resources
874
+ - [awesome-cursorrules (GitHub)](https://github.com/PatrickJS/awesome-cursorrules)
875
+ - [awesome-cursor-rules-mdc (GitHub)](https://github.com/sanjeed5/awesome-cursor-rules-mdc)
876
+ - [Aider Conventions Repository](https://github.com/Aider-AI/conventions)
877
+ - [awesome-claude-code (GitHub)](https://github.com/hesreallyhim/awesome-claude-code)
878
+ - [dotcursorrules.com](https://dotcursorrules.com/)
879
+
880
+ ### Policy as Code
881
+ - [Open Policy Agent (OPA) Documentation](https://www.openpolicyagent.org/docs/latest/)
882
+ - [Policy as Code: Introduction to Open Policy Agent (GitGuardian)](https://blog.gitguardian.com/what-is-policy-as-code-an-introduction-to-open-policy-agent/)
883
+
884
+ ### Academic
885
+ - [LLMBar: Evaluating LLMs at Evaluating Instruction Following (ICLR 2024)](https://github.com/princeton-nlp/LLMBar)
886
+ - [Source Framing Triggers Systematic Bias in LLMs (Science Advances, 2025)](https://www.science.org/doi/10.1126/sciadv.adz2924)