shipwright-cli 3.2.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (279) hide show
  1. package/.claude/agents/code-reviewer.md +2 -0
  2. package/.claude/agents/devops-engineer.md +2 -0
  3. package/.claude/agents/doc-fleet-agent.md +2 -0
  4. package/.claude/agents/pipeline-agent.md +2 -0
  5. package/.claude/agents/shell-script-specialist.md +2 -0
  6. package/.claude/agents/test-specialist.md +2 -0
  7. package/.claude/hooks/agent-crash-capture.sh +32 -0
  8. package/.claude/hooks/post-tool-use.sh +3 -2
  9. package/.claude/hooks/pre-tool-use.sh +35 -3
  10. package/README.md +4 -4
  11. package/claude-code/hooks/config-change.sh +18 -0
  12. package/claude-code/hooks/instructions-reloaded.sh +7 -0
  13. package/claude-code/hooks/worktree-create.sh +25 -0
  14. package/claude-code/hooks/worktree-remove.sh +20 -0
  15. package/config/code-constitution.json +130 -0
  16. package/dashboard/middleware/auth.ts +134 -0
  17. package/dashboard/middleware/constants.ts +21 -0
  18. package/dashboard/public/index.html +2 -6
  19. package/dashboard/public/styles.css +100 -97
  20. package/dashboard/routes/auth.ts +38 -0
  21. package/dashboard/server.ts +66 -25
  22. package/dashboard/services/config.ts +26 -0
  23. package/dashboard/services/db.ts +118 -0
  24. package/dashboard/src/canvas/pixel-agent.ts +298 -0
  25. package/dashboard/src/canvas/pixel-sprites.ts +440 -0
  26. package/dashboard/src/canvas/shipyard-effects.ts +367 -0
  27. package/dashboard/src/canvas/shipyard-scene.ts +616 -0
  28. package/dashboard/src/canvas/submarine-layout.ts +267 -0
  29. package/dashboard/src/components/header.ts +8 -7
  30. package/dashboard/src/core/router.ts +1 -0
  31. package/dashboard/src/design/submarine-theme.ts +253 -0
  32. package/dashboard/src/main.ts +2 -0
  33. package/dashboard/src/types/api.ts +2 -1
  34. package/dashboard/src/views/activity.ts +2 -1
  35. package/dashboard/src/views/shipyard.ts +39 -0
  36. package/dashboard/types/index.ts +166 -0
  37. package/docs/plans/2026-02-28-compound-audit-and-shipyard-design.md +186 -0
  38. package/docs/plans/2026-02-28-skipper-shipwright-implementation-plan.md +1182 -0
  39. package/docs/plans/2026-02-28-skipper-shipwright-integration-design.md +531 -0
  40. package/docs/plans/2026-03-01-ai-powered-skill-injection-design.md +298 -0
  41. package/docs/plans/2026-03-01-ai-powered-skill-injection-plan.md +1109 -0
  42. package/docs/plans/2026-03-01-capabilities-cleanup-plan.md +658 -0
  43. package/docs/plans/2026-03-01-clean-architecture-plan.md +924 -0
  44. package/docs/plans/2026-03-01-compound-audit-cascade-design.md +191 -0
  45. package/docs/plans/2026-03-01-compound-audit-cascade-plan.md +921 -0
  46. package/docs/plans/2026-03-01-deep-integration-plan.md +851 -0
  47. package/docs/plans/2026-03-01-pipeline-audit-trail-design.md +145 -0
  48. package/docs/plans/2026-03-01-pipeline-audit-trail-plan.md +770 -0
  49. package/docs/plans/2026-03-01-refined-depths-brand-design.md +382 -0
  50. package/docs/plans/2026-03-01-refined-depths-implementation.md +599 -0
  51. package/docs/plans/2026-03-01-skipper-kernel-integration-design.md +203 -0
  52. package/docs/plans/2026-03-01-unified-platform-design.md +272 -0
  53. package/docs/plans/2026-03-07-claude-code-feature-integration-design.md +189 -0
  54. package/docs/plans/2026-03-07-claude-code-feature-integration-plan.md +1165 -0
  55. package/docs/research/BACKLOG_QUICK_REFERENCE.md +352 -0
  56. package/docs/research/CUTTING_EDGE_RESEARCH_2026.md +546 -0
  57. package/docs/research/RESEARCH_INDEX.md +439 -0
  58. package/docs/research/RESEARCH_SOURCES.md +440 -0
  59. package/docs/research/RESEARCH_SUMMARY.txt +275 -0
  60. package/docs/superpowers/specs/2026-03-10-pipeline-quality-revolution-design.md +341 -0
  61. package/package.json +2 -2
  62. package/scripts/lib/adaptive-model.sh +427 -0
  63. package/scripts/lib/adaptive-timeout.sh +316 -0
  64. package/scripts/lib/audit-trail.sh +309 -0
  65. package/scripts/lib/auto-recovery.sh +471 -0
  66. package/scripts/lib/bandit-selector.sh +431 -0
  67. package/scripts/lib/bootstrap.sh +104 -2
  68. package/scripts/lib/causal-graph.sh +455 -0
  69. package/scripts/lib/compat.sh +126 -0
  70. package/scripts/lib/compound-audit.sh +337 -0
  71. package/scripts/lib/constitutional.sh +454 -0
  72. package/scripts/lib/context-budget.sh +359 -0
  73. package/scripts/lib/convergence.sh +594 -0
  74. package/scripts/lib/cost-optimizer.sh +634 -0
  75. package/scripts/lib/daemon-adaptive.sh +10 -0
  76. package/scripts/lib/daemon-dispatch.sh +106 -17
  77. package/scripts/lib/daemon-failure.sh +34 -4
  78. package/scripts/lib/daemon-patrol.sh +23 -2
  79. package/scripts/lib/daemon-poll-github.sh +361 -0
  80. package/scripts/lib/daemon-poll-health.sh +299 -0
  81. package/scripts/lib/daemon-poll.sh +27 -611
  82. package/scripts/lib/daemon-state.sh +112 -66
  83. package/scripts/lib/daemon-triage.sh +10 -0
  84. package/scripts/lib/dod-scorecard.sh +442 -0
  85. package/scripts/lib/error-actionability.sh +300 -0
  86. package/scripts/lib/formal-spec.sh +461 -0
  87. package/scripts/lib/helpers.sh +177 -4
  88. package/scripts/lib/intent-analysis.sh +409 -0
  89. package/scripts/lib/loop-convergence.sh +350 -0
  90. package/scripts/lib/loop-iteration.sh +682 -0
  91. package/scripts/lib/loop-progress.sh +48 -0
  92. package/scripts/lib/loop-restart.sh +185 -0
  93. package/scripts/lib/memory-effectiveness.sh +506 -0
  94. package/scripts/lib/mutation-executor.sh +352 -0
  95. package/scripts/lib/outcome-feedback.sh +521 -0
  96. package/scripts/lib/pipeline-cli.sh +336 -0
  97. package/scripts/lib/pipeline-commands.sh +1216 -0
  98. package/scripts/lib/pipeline-detection.sh +100 -2
  99. package/scripts/lib/pipeline-execution.sh +897 -0
  100. package/scripts/lib/pipeline-github.sh +28 -3
  101. package/scripts/lib/pipeline-intelligence-compound.sh +431 -0
  102. package/scripts/lib/pipeline-intelligence-scoring.sh +407 -0
  103. package/scripts/lib/pipeline-intelligence-skip.sh +181 -0
  104. package/scripts/lib/pipeline-intelligence.sh +100 -1136
  105. package/scripts/lib/pipeline-quality-bash-compat.sh +182 -0
  106. package/scripts/lib/pipeline-quality-checks.sh +17 -715
  107. package/scripts/lib/pipeline-quality-gates.sh +563 -0
  108. package/scripts/lib/pipeline-stages-build.sh +730 -0
  109. package/scripts/lib/pipeline-stages-delivery.sh +965 -0
  110. package/scripts/lib/pipeline-stages-intake.sh +1133 -0
  111. package/scripts/lib/pipeline-stages-monitor.sh +407 -0
  112. package/scripts/lib/pipeline-stages-review.sh +1022 -0
  113. package/scripts/lib/pipeline-stages.sh +59 -2929
  114. package/scripts/lib/pipeline-state.sh +36 -5
  115. package/scripts/lib/pipeline-util.sh +487 -0
  116. package/scripts/lib/policy-learner.sh +438 -0
  117. package/scripts/lib/process-reward.sh +493 -0
  118. package/scripts/lib/project-detect.sh +649 -0
  119. package/scripts/lib/quality-profile.sh +334 -0
  120. package/scripts/lib/recruit-commands.sh +885 -0
  121. package/scripts/lib/recruit-learning.sh +739 -0
  122. package/scripts/lib/recruit-roles.sh +648 -0
  123. package/scripts/lib/reward-aggregator.sh +458 -0
  124. package/scripts/lib/rl-optimizer.sh +362 -0
  125. package/scripts/lib/root-cause.sh +427 -0
  126. package/scripts/lib/scope-enforcement.sh +445 -0
  127. package/scripts/lib/session-restart.sh +493 -0
  128. package/scripts/lib/skill-memory.sh +300 -0
  129. package/scripts/lib/skill-registry.sh +775 -0
  130. package/scripts/lib/spec-driven.sh +476 -0
  131. package/scripts/lib/test-helpers.sh +18 -7
  132. package/scripts/lib/test-holdout.sh +429 -0
  133. package/scripts/lib/test-optimizer.sh +511 -0
  134. package/scripts/shipwright-file-suggest.sh +45 -0
  135. package/scripts/skills/adversarial-quality.md +61 -0
  136. package/scripts/skills/api-design.md +44 -0
  137. package/scripts/skills/architecture-design.md +50 -0
  138. package/scripts/skills/brainstorming.md +43 -0
  139. package/scripts/skills/data-pipeline.md +44 -0
  140. package/scripts/skills/deploy-safety.md +64 -0
  141. package/scripts/skills/documentation.md +38 -0
  142. package/scripts/skills/frontend-design.md +45 -0
  143. package/scripts/skills/generated/.gitkeep +0 -0
  144. package/scripts/skills/generated/_refinements/.gitkeep +0 -0
  145. package/scripts/skills/generated/_refinements/adversarial-quality.patch.md +3 -0
  146. package/scripts/skills/generated/_refinements/architecture-design.patch.md +3 -0
  147. package/scripts/skills/generated/_refinements/brainstorming.patch.md +3 -0
  148. package/scripts/skills/generated/cli-version-management.md +29 -0
  149. package/scripts/skills/generated/collection-system-validation.md +99 -0
  150. package/scripts/skills/generated/large-scale-c-refactoring-coordination.md +97 -0
  151. package/scripts/skills/generated/pattern-matching-similarity-scoring.md +195 -0
  152. package/scripts/skills/generated/test-parallelization-detection.md +65 -0
  153. package/scripts/skills/observability.md +79 -0
  154. package/scripts/skills/performance.md +48 -0
  155. package/scripts/skills/pr-quality.md +49 -0
  156. package/scripts/skills/product-thinking.md +43 -0
  157. package/scripts/skills/security-audit.md +49 -0
  158. package/scripts/skills/systematic-debugging.md +40 -0
  159. package/scripts/skills/testing-strategy.md +47 -0
  160. package/scripts/skills/two-stage-review.md +52 -0
  161. package/scripts/skills/validation-thoroughness.md +55 -0
  162. package/scripts/sw +9 -3
  163. package/scripts/sw-activity.sh +9 -2
  164. package/scripts/sw-adaptive.sh +2 -1
  165. package/scripts/sw-adversarial.sh +2 -1
  166. package/scripts/sw-architecture-enforcer.sh +3 -1
  167. package/scripts/sw-auth.sh +12 -2
  168. package/scripts/sw-autonomous.sh +5 -1
  169. package/scripts/sw-changelog.sh +4 -1
  170. package/scripts/sw-checkpoint.sh +2 -1
  171. package/scripts/sw-ci.sh +5 -1
  172. package/scripts/sw-cleanup.sh +4 -26
  173. package/scripts/sw-code-review.sh +10 -4
  174. package/scripts/sw-connect.sh +2 -1
  175. package/scripts/sw-context.sh +2 -1
  176. package/scripts/sw-cost.sh +48 -3
  177. package/scripts/sw-daemon.sh +66 -9
  178. package/scripts/sw-dashboard.sh +3 -1
  179. package/scripts/sw-db.sh +59 -16
  180. package/scripts/sw-decide.sh +8 -2
  181. package/scripts/sw-decompose.sh +360 -17
  182. package/scripts/sw-deps.sh +4 -1
  183. package/scripts/sw-developer-simulation.sh +4 -1
  184. package/scripts/sw-discovery.sh +325 -2
  185. package/scripts/sw-doc-fleet.sh +4 -1
  186. package/scripts/sw-docs-agent.sh +3 -1
  187. package/scripts/sw-docs.sh +2 -1
  188. package/scripts/sw-doctor.sh +453 -2
  189. package/scripts/sw-dora.sh +4 -1
  190. package/scripts/sw-durable.sh +4 -3
  191. package/scripts/sw-e2e-orchestrator.sh +17 -16
  192. package/scripts/sw-eventbus.sh +7 -1
  193. package/scripts/sw-evidence.sh +364 -12
  194. package/scripts/sw-feedback.sh +550 -9
  195. package/scripts/sw-fix.sh +20 -1
  196. package/scripts/sw-fleet-discover.sh +6 -2
  197. package/scripts/sw-fleet-viz.sh +4 -1
  198. package/scripts/sw-fleet.sh +5 -1
  199. package/scripts/sw-github-app.sh +16 -3
  200. package/scripts/sw-github-checks.sh +3 -2
  201. package/scripts/sw-github-deploy.sh +3 -2
  202. package/scripts/sw-github-graphql.sh +18 -7
  203. package/scripts/sw-guild.sh +5 -1
  204. package/scripts/sw-heartbeat.sh +5 -30
  205. package/scripts/sw-hello.sh +67 -0
  206. package/scripts/sw-hygiene.sh +6 -1
  207. package/scripts/sw-incident.sh +265 -1
  208. package/scripts/sw-init.sh +18 -2
  209. package/scripts/sw-instrument.sh +10 -2
  210. package/scripts/sw-intelligence.sh +42 -6
  211. package/scripts/sw-jira.sh +5 -1
  212. package/scripts/sw-launchd.sh +2 -1
  213. package/scripts/sw-linear.sh +4 -1
  214. package/scripts/sw-logs.sh +4 -1
  215. package/scripts/sw-loop.sh +432 -1128
  216. package/scripts/sw-memory.sh +356 -2
  217. package/scripts/sw-mission-control.sh +6 -1
  218. package/scripts/sw-model-router.sh +481 -26
  219. package/scripts/sw-otel.sh +13 -4
  220. package/scripts/sw-oversight.sh +14 -5
  221. package/scripts/sw-patrol-meta.sh +334 -0
  222. package/scripts/sw-pipeline-composer.sh +5 -1
  223. package/scripts/sw-pipeline-vitals.sh +2 -1
  224. package/scripts/sw-pipeline.sh +53 -2664
  225. package/scripts/sw-pm.sh +12 -5
  226. package/scripts/sw-pr-lifecycle.sh +2 -1
  227. package/scripts/sw-predictive.sh +7 -1
  228. package/scripts/sw-prep.sh +185 -2
  229. package/scripts/sw-ps.sh +5 -25
  230. package/scripts/sw-public-dashboard.sh +15 -3
  231. package/scripts/sw-quality.sh +2 -1
  232. package/scripts/sw-reaper.sh +8 -25
  233. package/scripts/sw-recruit.sh +156 -2303
  234. package/scripts/sw-regression.sh +19 -12
  235. package/scripts/sw-release-manager.sh +3 -1
  236. package/scripts/sw-release.sh +4 -1
  237. package/scripts/sw-remote.sh +3 -1
  238. package/scripts/sw-replay.sh +7 -1
  239. package/scripts/sw-retro.sh +158 -1
  240. package/scripts/sw-review-rerun.sh +3 -1
  241. package/scripts/sw-scale.sh +10 -3
  242. package/scripts/sw-security-audit.sh +6 -1
  243. package/scripts/sw-self-optimize.sh +6 -3
  244. package/scripts/sw-session.sh +9 -3
  245. package/scripts/sw-setup.sh +3 -1
  246. package/scripts/sw-stall-detector.sh +406 -0
  247. package/scripts/sw-standup.sh +15 -7
  248. package/scripts/sw-status.sh +3 -1
  249. package/scripts/sw-strategic.sh +4 -1
  250. package/scripts/sw-stream.sh +7 -1
  251. package/scripts/sw-swarm.sh +18 -6
  252. package/scripts/sw-team-stages.sh +13 -6
  253. package/scripts/sw-templates.sh +5 -29
  254. package/scripts/sw-testgen.sh +7 -1
  255. package/scripts/sw-tmux-pipeline.sh +4 -1
  256. package/scripts/sw-tmux-role-color.sh +2 -0
  257. package/scripts/sw-tmux-status.sh +1 -1
  258. package/scripts/sw-tmux.sh +3 -1
  259. package/scripts/sw-trace.sh +3 -1
  260. package/scripts/sw-tracker-github.sh +3 -0
  261. package/scripts/sw-tracker-jira.sh +3 -0
  262. package/scripts/sw-tracker-linear.sh +3 -0
  263. package/scripts/sw-tracker.sh +3 -1
  264. package/scripts/sw-triage.sh +2 -1
  265. package/scripts/sw-upgrade.sh +3 -1
  266. package/scripts/sw-ux.sh +5 -2
  267. package/scripts/sw-webhook.sh +3 -1
  268. package/scripts/sw-widgets.sh +3 -1
  269. package/scripts/sw-worktree.sh +15 -3
  270. package/scripts/test-skill-injection.sh +1233 -0
  271. package/templates/pipelines/autonomous.json +27 -3
  272. package/templates/pipelines/cost-aware.json +34 -8
  273. package/templates/pipelines/deployed.json +12 -0
  274. package/templates/pipelines/enterprise.json +12 -0
  275. package/templates/pipelines/fast.json +6 -0
  276. package/templates/pipelines/full.json +27 -3
  277. package/templates/pipelines/hotfix.json +6 -0
  278. package/templates/pipelines/standard.json +12 -0
  279. package/templates/pipelines/tdd.json +12 -0
@@ -0,0 +1,352 @@
1
+ # Shipwright Backlog: Quick Reference (20-Item Priority List)
2
+
3
+ ## At-a-Glance Priority Matrix
4
+
5
+ | Priority | ID | Feature | Impact | Effort | ROI | Category |
6
+ | -------- | --- | ---------------------------------------------------- | -------- | -------- | --------------- | ------------- |
7
+ | 🔴 P0 | #1 | Semantic trajectory analysis + convergence detection | 🟢🟢🟢 | 🟡🟡 | **EXCEPTIONAL** | Loop Patterns |
8
+ | 🔴 P0 | #2 | Intent Specification Engine (business → outcomes) | 🟢🟢🟢🟢 | 🔴🔴🔴 | **EXCEPTIONAL** | Dark Factory |
9
+ | 🔴 P0 | #3 | Vulnerability Reward Model + online RL | 🟢🟢🟢 | 🟡🟡 | **EXCEPTIONAL** | RL/Security |
10
+ | 🔴 P0 | #5 | Speculative Cascade Model Routing | 🟢🟢🟢🟢 | 🟡🟡 | **VERY HIGH** | Cost |
11
+ | 🟡 P1 | #4 | Episodic Memory Layer | 🟢🟢🟢 | 🔴🔴🔴 | **HIGH** | Memory |
12
+ | 🟡 P1 | #6 | Mutation Testing Feedback Loop | 🟢🟢🟢 | 🟡🟡 | **HIGH** | Testing |
13
+ | 🟡 P1 | #7 | CI Repair Agent | 🟢🟢🟢 | 🔴🔴🔴 | **HIGH** | Self-Healing |
14
+ | 🟡 P1 | #8 | LLM-as-a-Judge validation | 🟢🟢 | 🟡🟡 | **HIGH** | Quality |
15
+ | 🟢 P2 | #9 | Explicit Conflict Detection + DAG Scheduling | 🟢🟢 | 🟡🟡 | **MEDIUM** | Multi-Agent |
16
+ | 🟢 P2 | #10 | Intelligent Reasoning Budget Allocation | 🟢🟢 | 🟡🟡 | **MEDIUM** | Reasoning |
17
+ | 🟢 P2 | #11 | Formal Verification Integration (Dafny/Lean) | 🟢🟢 | 🔴🔴🔴🔴 | **MEDIUM** | Verification |
18
+ | 🟢 P2 | #12 | Active Context Compression + Semantic Memory | 🟢🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Memory |
19
+ | 🟢 P2 | #13 | Multi-Pass Mutation Generation (LLM-based) | 🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Testing |
20
+ | 🟢 P2 | #14 | Anomaly Detection + Predictive Repair | 🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Self-Healing |
21
+ | 🟢 P2 | #15 | Cross-Repo Fleet Learning | 🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Memory/Fleet |
22
+ | 🟢 P3 | #16 | Quorum-Based Merge Decisions | 🟢 | 🟡 | **LOW** | Quality |
23
+ | 🟢 P3 | #17 | Privacy-Hardening Mutations | 🟢 | 🔴🔴 | **LOW** | Compliance |
24
+ | 🟢 P3 | #18 | Dependency-Aware Task Scheduling (DAG) | 🟢 | 🟡 | **LOW** | Multi-Agent |
25
+ | 🟢 P3 | #19 | Symbol Caching + Semantic Search | 🟢 | 🟡 | **LOW** | Performance |
26
+ | 🟢 P3 | #20 | WebSocket Real-Time Loop Monitoring | 🟢 | 🟡 | **LOW** | Observability |
27
+
28
+ ---
29
+
30
+ ## PHASE 1 (Weeks 1-4): Convergence & Cost
31
+
32
+ ### #1 Semantic Trajectory Analysis + Convergence Detection
33
+
34
+ **What it does:** Tracks embedding-space distance of consecutive agent outputs; detects stuck (contractive) vs wandering (exploratory) loops
35
+
36
+ **Why it matters:**
37
+
38
+ - Current: Hard iteration limit (5 iterations) wastes compute on stuck loops
39
+ - SOTA: Geometric Dynamics paper (arxiv 2512.10350) shows regime detection enables early exit
40
+ - Impact: 25-40% iteration waste reduction
41
+
42
+ **How to implement:**
43
+
44
+ 1. On each loop iteration: encode agent output to embedding space (use Claude's embeddings)
45
+ 2. Compute cosine distance to previous iteration's embedding
46
+ 3. Track distance trend (contracting = converging, diverging = exploring)
47
+ 4. Early exit if contracting + distance < threshold
48
+ 5. Escalate to longer thinking if diverging unbounded
49
+
50
+ **Effort:** Medium (embedding integration, vector math, tracking state)
51
+ **Blocking:** Nothing (can implement in isolation)
52
+ **Files to modify:** `sw-loop.sh`, `sw-convergence-test.sh`
53
+
54
+ ---
55
+
56
+ ### #5 Speculative Cascade Model Routing
57
+
58
+ **What it does:** Try Haiku first (short timeout), escalate to Sonnet → Opus on failure
59
+
60
+ **Why it matters:**
61
+
62
+ - Current: Pick model upfront (per `--effort` flag), no escalation
63
+ - SOTA: Google Speculative Cascades paper; 30-60% cost reduction on median tasks
64
+ - Impact: 40-60% cost reduction while maintaining quality
65
+
66
+ **How to implement:**
67
+
68
+ 1. Build failure prediction model: (query_type, difficulty) → success_rate on Haiku
69
+ 2. For new query: estimate difficulty via embedding similarity
70
+ 3. Route to Haiku with timeout (e.g., 30s)
71
+ 4. If timeout/failure (tests fail), cascade to Sonnet, then Opus
72
+ 5. Track cascade effectiveness per query type in memory
73
+
74
+ **Effort:** Medium (timeout management, cascade orchestration, tracking)
75
+ **Blocking:** Nothing
76
+ **Files to modify:** `sw-model-router.sh`, `sw-loop.sh`, new: `sw-cascade-router.sh`
77
+
78
+ ---
79
+
80
+ ## PHASE 2 (Weeks 5-8): Security & Testing
81
+
82
+ ### #3 Vulnerability Reward Model + Online RL Hardening
83
+
84
+ **What it does:** Add security signals (detected vulnerabilities, CWE patterns) to reward model; enable vulnerability-aware RL
85
+
86
+ **Why it matters:**
87
+
88
+ - Current: Reward signals are functional-only (test pass, coverage)
89
+ - SOTA: Meta's SecCoderX, Anthropic's security research
90
+ - Impact: 30-40% security issue reduction; compliance-ready code
91
+
92
+ **How to implement:**
93
+
94
+ 1. Integrate lightweight SAST (e.g., Semgrep, bandit, Trivy)
95
+ 2. Run on generated code; extract (vulnerability_count, cwe_classes)
96
+ 3. Add to reward signal as negative reward: reward -= vulnerability_count \* weight
97
+ 4. Store effective security fixes in episodic memory
98
+ 5. Fine-tune on secure code examples
99
+
100
+ **Effort:** Medium (scanner integration, signal weighting, RL loop)
101
+ **Blocking:** Nothing
102
+ **Files to modify:** `sw-reward-aggregator.sh`, `sw-rl-optimizer.sh`, new: `sw-security-reward.sh`
103
+
104
+ ---
105
+
106
+ ### #6 Mutation Testing Feedback Loop
107
+
108
+ **What it does:** Validate test quality by checking % of mutants killed; regenerate tests if score low
109
+
110
+ **Why it matters:**
111
+
112
+ - Current: Coverage metrics inflated; 45% of LLM-generated tests are ineffective
113
+ - SOTA: Meta ACH, MutGen papers show mutation feedback improves test quality
114
+ - Impact: 30-40% better test effectiveness; catches subtle bugs
115
+
116
+ **How to implement:**
117
+
118
+ 1. After test generation: run mutation tool (Major, PIT) on code
119
+ 2. Run generated tests against mutants; compute mutation_score = killed / total
120
+ 3. If score < threshold (e.g., 80%): add feedback to testgen prompt
121
+ 4. Regenerate tests with mutation feedback
122
+ 5. Store effective test patterns for reuse
123
+
124
+ **Effort:** Medium (mutation tool integration, feedback loop)
125
+ **Blocking:** Nothing
126
+ **Files to modify:** `sw-testgen.sh`, new: `sw-mutation-validator.sh`
127
+
128
+ ---
129
+
130
+ ### #13 Multi-Pass Mutation Generation (LLM-based)
131
+
132
+ **What it does:** Use LLM to generate diverse mutants (not just rule-based); Meta-style compliance
133
+
134
+ **Why it matters:**
135
+
136
+ - Current: Traditional mutation tools (Major) have limited operators
137
+ - SOTA: GPT-4o/DeepSeek-R1 generate 57 different AST node types vs 2 for rules
138
+ - Impact: Better mutation diversity; more confident test validation
139
+
140
+ **How to implement:**
141
+
142
+ 1. Take source code + list of mutation types
143
+ 2. Prompt LLM: "Generate N mutants that change behavior but keep syntax valid"
144
+ 3. Validate mutants compile + are distinct from originals
145
+ 4. Run tests; track mutation score
146
+ 5. Feed back into testgen loop if coverage is low
147
+
148
+ **Effort:** High (prompt engineering, mutation validation)
149
+ **Blocking:** Nothing
150
+ **Files to modify:** new: `sw-llm-mutant-generator.sh`
151
+
152
+ ---
153
+
154
+ ## PHASE 3 (Weeks 9-12): Memory & Self-Healing
155
+
156
+ ### #4 Episodic Memory Layer
157
+
158
+ **What it does:** Store complete execution traces (inputs, actions, outcomes); enable case-based reasoning
159
+
160
+ **Why it matters:**
161
+
162
+ - Current: Memory is pattern-based ("when X fails, do Y")
163
+ - SOTA: Mem0, EM-LLM, MemRL papers show episodic learning 20-35% faster
164
+ - Impact: Case-based analogy; long-horizon self-improvement
165
+
166
+ **How to implement:**
167
+
168
+ 1. On each pipeline run: capture episode JSON (inputs, agent_actions, outputs, duration, cost, test_results)
169
+ 2. Store in episodic DB (SQLite + JSON or Postgres)
170
+ 3. Query: "Find 3 similar past episodes" (via embedding similarity)
171
+ 4. Inject case as few-shot examples into new agent prompts
172
+ 5. Active compression: every 10 episodes, consolidate → semantic facts
173
+
174
+ **Effort:** High (episode storage, retrieval, compression)
175
+ **Blocking:** Nothing
176
+ **Files to modify:** `sw-memory.sh`, new: `sw-episodic-memory.sh`
177
+
178
+ ---
179
+
180
+ ### #7 CI Repair Agent
181
+
182
+ **What it does:** When test/check fails, spawn repair agent to diagnose & fix root cause
183
+
184
+ **Why it matters:**
185
+
186
+ - Current: Retries on failure; no diagnosis
187
+ - SOTA: Pipeline Doctor pattern (2026 AIOps trend); 67% MTTR drop
188
+ - Impact: 50% fewer retries; faster merge times
189
+
190
+ **How to implement:**
191
+
192
+ 1. Detect test/check failure (via CI logs)
193
+ 2. Classify failure: timeout, race condition, assertion, resource, flaky
194
+ 3. Spawn repair agent with failure context (logs, git diff, error)
195
+ 4. Agent proposes fix (increase timeout, add sync, skip flaky test, etc.)
196
+ 5. Re-run test; if passes, commit repair
197
+ 6. Track effective repairs in memory
198
+
199
+ **Effort:** High (log parsing, classification, repair proposals, commit management)
200
+ **Blocking:** Nothing
201
+ **Files to modify:** `sw-ci.sh`, new: `sw-repair-agent.sh`
202
+
203
+ ---
204
+
205
+ ### #8 LLM-as-a-Judge Validation
206
+
207
+ **What it does:** Secondary model evaluates primary agent output; triggers repair if needed
208
+
209
+ **Why it matters:**
210
+
211
+ - Current: Quality gates are rule-based (coverage > X%, no ASan)
212
+ - SOTA: 2026 standard design pattern for agentic systems
213
+ - Impact: 10-15% fewer merge regressions; catches issues rules miss
214
+
215
+ **How to implement:**
216
+
217
+ 1. After primary agent completes task: send code + acceptance criteria to Judge model
218
+ 2. Judge evaluates: "Does this code meet requirements? Any issues?"
219
+ 3. If Judge flags issues: auto-trigger repair agent or escalate
220
+ 4. Log Judge decisions for learning
221
+ 5. Track Judge accuracy (via post-merge bug rates)
222
+
223
+ **Effort:** Medium (prompt engineering, logic orchestration)
224
+ **Blocking:** Nothing
225
+ **Files to modify:** `sw-quality.sh`, new: `sw-judge.sh`
226
+
227
+ ---
228
+
229
+ ## TIER 2 Items (Brief Summary)
230
+
231
+ | # | Feature | Quick Implementation Path |
232
+ | --- | ------------------------------------- | ------------------------------------------------------------------------------ |
233
+ | #2 | Intent Specification Engine | Research phase; build DSL for constraints; integrate formal spec generation |
234
+ | #9 | Conflict Detection + DAG | Track file locks per agent; build task DAG scheduler; merge conflict resolver |
235
+ | #10 | Reasoning Budget Allocation | Track thinking cost vs outcome; build (task_type, complexity) → tokens lookup |
236
+ | #11 | Formal Verification (Dafny/Lean) | Integrate theorem prover APIs; generate specs; gate merge on proof completion |
237
+ | #12 | Active Context Compression | EM-LLM approach: Bayesian surprise + graph refinement for episode boundaries |
238
+ | #14 | Anomaly Detection + Predictive Repair | Time-series analysis on logs; ML model for failure prediction; repair triggers |
239
+ | #15 | Cross-Repo Fleet Learning | Share patterns via fleet event bus; rank patterns by repo similarity |
240
+
241
+ ---
242
+
243
+ ## Implementation Checklist
244
+
245
+ ### PHASE 1 (Target: 2 weeks per item)
246
+
247
+ - [ ] #1 Semantic trajectory analysis
248
+ - [ ] Embedding integration
249
+ - [ ] Distance tracking + regime classification
250
+ - [ ] Early exit logic
251
+ - [ ] Tests + monitoring
252
+ - [ ] #5 Speculative cascade routing
253
+ - [ ] Failure prediction model
254
+ - [ ] Cascade orchestration
255
+ - [ ] Timeout management
256
+ - [ ] Tracking + learning
257
+
258
+ ### PHASE 2 (Target: 1.5-2 weeks per item)
259
+
260
+ - [ ] #3 Vulnerability reward model
261
+ - [ ] #6 Mutation testing loop
262
+ - [ ] #13 LLM-based mutants
263
+
264
+ ### PHASE 3 (Target: 2-3 weeks per item)
265
+
266
+ - [ ] #4 Episodic memory layer
267
+ - [ ] #7 CI repair agent
268
+ - [ ] #8 LLM-as-a-Judge
269
+
270
+ ---
271
+
272
+ ## Success Metrics (Post-Implementation)
273
+
274
+ | Feature | Metric | Target | Current |
275
+ | ------------------- | ----------------------------------- | ------- | -------- |
276
+ | #1 Loop convergence | Iteration waste reduction | -25-40% | Baseline |
277
+ | #5 Cascade routing | Cost reduction on median tasks | -40-60% | Baseline |
278
+ | #3 Security rewards | Bug reduction | -30-40% | Current |
279
+ | #6 Mutation testing | Test effectiveness (mutation score) | >80% | ~60% |
280
+ | #4 Episodic memory | Solution time on similar tasks | -20-35% | Baseline |
281
+ | #7 CI repair | Retry cycles | -50% | Baseline |
282
+ | Overall | Pipeline success rate | >85% | ~77% |
283
+
284
+ ---
285
+
286
+ ## Dependencies & Blocking Relationships
287
+
288
+ ```
289
+ #1 (trajectory) ─────┐
290
+ ├──→ #5 (cascade) ──→ Cost optimization ✓
291
+
292
+ #2 (intent) [research phase; no immediate blocks]
293
+
294
+ #3 (vulnerability) ──┐
295
+ #6 (mutations) ├──→ Security + Testing quality
296
+ #13 (LLM mutants) ───┘
297
+
298
+ #4 (episodic) ───────┐
299
+ #12 (compression) ───┤
300
+ #15 (fleet learning) ┤ All feed each other; can implement in parallel
301
+ └──→ Long-horizon learning
302
+
303
+ #7 (CI repair) ──┐
304
+ #8 (judge) └──→ Quality gates
305
+
306
+ No critical blocking path: all items can start immediately with risk.
307
+ Recommend: Start #1 + #5 in week 1, #3 + #6 in week 5, #4 + #7 in week 9.
308
+ ```
309
+
310
+ ---
311
+
312
+ ## Cost-Benefit Analysis
313
+
314
+ ### Immediate ROI (Phase 1-2, Weeks 1-8)
315
+
316
+ **Investment:**
317
+
318
+ - 2 engineers × 8 weeks @ $200K/year = ~$60K engineering cost
319
+ - Compute for research + prototyping = ~$5K
320
+
321
+ **Returns (Annual):**
322
+
323
+ - Cost reduction via cascade: 40-60% savings on compute (current $50K/month → $20-30K) = **$240-360K/year**
324
+ - Faster iteration: 30% speedup on 200 pipelines/month × $5/pipeline = **$30K/year**
325
+ - Security improvement: 30-40% fewer CVEs → reduced incident response = **$50K+ saved**
326
+
327
+ **Total Annual ROI: $320-440K on $65K investment = 5-7x**
328
+
329
+ ### Long-Term ROI (Phase 3 + Beyond, Weeks 9-26)
330
+
331
+ **Additional returns:**
332
+
333
+ - Episodic memory: 20-35% faster solutions × 200 pipelines = **$50-85K/year**
334
+ - Self-healing CI: 50% fewer retries = **$30K/year** (fewer human reviews)
335
+ - Fleet learning: 20% faster on new projects = **$40K/year**
336
+
337
+ **Total Long-Term ROI: $440-555K on $120K investment = 3-4x**
338
+
339
+ ---
340
+
341
+ ## Next Steps
342
+
343
+ 1. **This week:** Review [CUTTING_EDGE_RESEARCH_2026.md](./CUTTING_EDGE_RESEARCH_2026.md) for full details on each feature
344
+ 2. **Next week:** Spike on #1 (trajectory analysis) — prototype embedding-space distance tracking
345
+ 3. **Following week:** Begin #5 (cascade routing) and #3 (vulnerability rewards) in parallel
346
+ 4. **Week 4+:** Ramp up to PHASE 2 items as Phase 1 items ship
347
+
348
+ ---
349
+
350
+ **Generated:** April 4, 2026
351
+ **Total research effort:** 50+ sources, 25+ papers, 8 research areas
352
+ **Full report:** See CUTTING_EDGE_RESEARCH_2026.md (comprehensive analysis)