shipwright-cli 3.2.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (279) hide show
  1. package/.claude/agents/code-reviewer.md +2 -0
  2. package/.claude/agents/devops-engineer.md +2 -0
  3. package/.claude/agents/doc-fleet-agent.md +2 -0
  4. package/.claude/agents/pipeline-agent.md +2 -0
  5. package/.claude/agents/shell-script-specialist.md +2 -0
  6. package/.claude/agents/test-specialist.md +2 -0
  7. package/.claude/hooks/agent-crash-capture.sh +32 -0
  8. package/.claude/hooks/post-tool-use.sh +3 -2
  9. package/.claude/hooks/pre-tool-use.sh +35 -3
  10. package/README.md +4 -4
  11. package/claude-code/hooks/config-change.sh +18 -0
  12. package/claude-code/hooks/instructions-reloaded.sh +7 -0
  13. package/claude-code/hooks/worktree-create.sh +25 -0
  14. package/claude-code/hooks/worktree-remove.sh +20 -0
  15. package/config/code-constitution.json +130 -0
  16. package/dashboard/middleware/auth.ts +134 -0
  17. package/dashboard/middleware/constants.ts +21 -0
  18. package/dashboard/public/index.html +2 -6
  19. package/dashboard/public/styles.css +100 -97
  20. package/dashboard/routes/auth.ts +38 -0
  21. package/dashboard/server.ts +66 -25
  22. package/dashboard/services/config.ts +26 -0
  23. package/dashboard/services/db.ts +118 -0
  24. package/dashboard/src/canvas/pixel-agent.ts +298 -0
  25. package/dashboard/src/canvas/pixel-sprites.ts +440 -0
  26. package/dashboard/src/canvas/shipyard-effects.ts +367 -0
  27. package/dashboard/src/canvas/shipyard-scene.ts +616 -0
  28. package/dashboard/src/canvas/submarine-layout.ts +267 -0
  29. package/dashboard/src/components/header.ts +8 -7
  30. package/dashboard/src/core/router.ts +1 -0
  31. package/dashboard/src/design/submarine-theme.ts +253 -0
  32. package/dashboard/src/main.ts +2 -0
  33. package/dashboard/src/types/api.ts +2 -1
  34. package/dashboard/src/views/activity.ts +2 -1
  35. package/dashboard/src/views/shipyard.ts +39 -0
  36. package/dashboard/types/index.ts +166 -0
  37. package/docs/plans/2026-02-28-compound-audit-and-shipyard-design.md +186 -0
  38. package/docs/plans/2026-02-28-skipper-shipwright-implementation-plan.md +1182 -0
  39. package/docs/plans/2026-02-28-skipper-shipwright-integration-design.md +531 -0
  40. package/docs/plans/2026-03-01-ai-powered-skill-injection-design.md +298 -0
  41. package/docs/plans/2026-03-01-ai-powered-skill-injection-plan.md +1109 -0
  42. package/docs/plans/2026-03-01-capabilities-cleanup-plan.md +658 -0
  43. package/docs/plans/2026-03-01-clean-architecture-plan.md +924 -0
  44. package/docs/plans/2026-03-01-compound-audit-cascade-design.md +191 -0
  45. package/docs/plans/2026-03-01-compound-audit-cascade-plan.md +921 -0
  46. package/docs/plans/2026-03-01-deep-integration-plan.md +851 -0
  47. package/docs/plans/2026-03-01-pipeline-audit-trail-design.md +145 -0
  48. package/docs/plans/2026-03-01-pipeline-audit-trail-plan.md +770 -0
  49. package/docs/plans/2026-03-01-refined-depths-brand-design.md +382 -0
  50. package/docs/plans/2026-03-01-refined-depths-implementation.md +599 -0
  51. package/docs/plans/2026-03-01-skipper-kernel-integration-design.md +203 -0
  52. package/docs/plans/2026-03-01-unified-platform-design.md +272 -0
  53. package/docs/plans/2026-03-07-claude-code-feature-integration-design.md +189 -0
  54. package/docs/plans/2026-03-07-claude-code-feature-integration-plan.md +1165 -0
  55. package/docs/research/BACKLOG_QUICK_REFERENCE.md +352 -0
  56. package/docs/research/CUTTING_EDGE_RESEARCH_2026.md +546 -0
  57. package/docs/research/RESEARCH_INDEX.md +439 -0
  58. package/docs/research/RESEARCH_SOURCES.md +440 -0
  59. package/docs/research/RESEARCH_SUMMARY.txt +275 -0
  60. package/docs/superpowers/specs/2026-03-10-pipeline-quality-revolution-design.md +341 -0
  61. package/package.json +2 -2
  62. package/scripts/lib/adaptive-model.sh +427 -0
  63. package/scripts/lib/adaptive-timeout.sh +316 -0
  64. package/scripts/lib/audit-trail.sh +309 -0
  65. package/scripts/lib/auto-recovery.sh +471 -0
  66. package/scripts/lib/bandit-selector.sh +431 -0
  67. package/scripts/lib/bootstrap.sh +104 -2
  68. package/scripts/lib/causal-graph.sh +455 -0
  69. package/scripts/lib/compat.sh +126 -0
  70. package/scripts/lib/compound-audit.sh +337 -0
  71. package/scripts/lib/constitutional.sh +454 -0
  72. package/scripts/lib/context-budget.sh +359 -0
  73. package/scripts/lib/convergence.sh +594 -0
  74. package/scripts/lib/cost-optimizer.sh +634 -0
  75. package/scripts/lib/daemon-adaptive.sh +10 -0
  76. package/scripts/lib/daemon-dispatch.sh +106 -17
  77. package/scripts/lib/daemon-failure.sh +34 -4
  78. package/scripts/lib/daemon-patrol.sh +23 -2
  79. package/scripts/lib/daemon-poll-github.sh +361 -0
  80. package/scripts/lib/daemon-poll-health.sh +299 -0
  81. package/scripts/lib/daemon-poll.sh +27 -611
  82. package/scripts/lib/daemon-state.sh +112 -66
  83. package/scripts/lib/daemon-triage.sh +10 -0
  84. package/scripts/lib/dod-scorecard.sh +442 -0
  85. package/scripts/lib/error-actionability.sh +300 -0
  86. package/scripts/lib/formal-spec.sh +461 -0
  87. package/scripts/lib/helpers.sh +177 -4
  88. package/scripts/lib/intent-analysis.sh +409 -0
  89. package/scripts/lib/loop-convergence.sh +350 -0
  90. package/scripts/lib/loop-iteration.sh +682 -0
  91. package/scripts/lib/loop-progress.sh +48 -0
  92. package/scripts/lib/loop-restart.sh +185 -0
  93. package/scripts/lib/memory-effectiveness.sh +506 -0
  94. package/scripts/lib/mutation-executor.sh +352 -0
  95. package/scripts/lib/outcome-feedback.sh +521 -0
  96. package/scripts/lib/pipeline-cli.sh +336 -0
  97. package/scripts/lib/pipeline-commands.sh +1216 -0
  98. package/scripts/lib/pipeline-detection.sh +100 -2
  99. package/scripts/lib/pipeline-execution.sh +897 -0
  100. package/scripts/lib/pipeline-github.sh +28 -3
  101. package/scripts/lib/pipeline-intelligence-compound.sh +431 -0
  102. package/scripts/lib/pipeline-intelligence-scoring.sh +407 -0
  103. package/scripts/lib/pipeline-intelligence-skip.sh +181 -0
  104. package/scripts/lib/pipeline-intelligence.sh +100 -1136
  105. package/scripts/lib/pipeline-quality-bash-compat.sh +182 -0
  106. package/scripts/lib/pipeline-quality-checks.sh +17 -715
  107. package/scripts/lib/pipeline-quality-gates.sh +563 -0
  108. package/scripts/lib/pipeline-stages-build.sh +730 -0
  109. package/scripts/lib/pipeline-stages-delivery.sh +965 -0
  110. package/scripts/lib/pipeline-stages-intake.sh +1133 -0
  111. package/scripts/lib/pipeline-stages-monitor.sh +407 -0
  112. package/scripts/lib/pipeline-stages-review.sh +1022 -0
  113. package/scripts/lib/pipeline-stages.sh +59 -2929
  114. package/scripts/lib/pipeline-state.sh +36 -5
  115. package/scripts/lib/pipeline-util.sh +487 -0
  116. package/scripts/lib/policy-learner.sh +438 -0
  117. package/scripts/lib/process-reward.sh +493 -0
  118. package/scripts/lib/project-detect.sh +649 -0
  119. package/scripts/lib/quality-profile.sh +334 -0
  120. package/scripts/lib/recruit-commands.sh +885 -0
  121. package/scripts/lib/recruit-learning.sh +739 -0
  122. package/scripts/lib/recruit-roles.sh +648 -0
  123. package/scripts/lib/reward-aggregator.sh +458 -0
  124. package/scripts/lib/rl-optimizer.sh +362 -0
  125. package/scripts/lib/root-cause.sh +427 -0
  126. package/scripts/lib/scope-enforcement.sh +445 -0
  127. package/scripts/lib/session-restart.sh +493 -0
  128. package/scripts/lib/skill-memory.sh +300 -0
  129. package/scripts/lib/skill-registry.sh +775 -0
  130. package/scripts/lib/spec-driven.sh +476 -0
  131. package/scripts/lib/test-helpers.sh +18 -7
  132. package/scripts/lib/test-holdout.sh +429 -0
  133. package/scripts/lib/test-optimizer.sh +511 -0
  134. package/scripts/shipwright-file-suggest.sh +45 -0
  135. package/scripts/skills/adversarial-quality.md +61 -0
  136. package/scripts/skills/api-design.md +44 -0
  137. package/scripts/skills/architecture-design.md +50 -0
  138. package/scripts/skills/brainstorming.md +43 -0
  139. package/scripts/skills/data-pipeline.md +44 -0
  140. package/scripts/skills/deploy-safety.md +64 -0
  141. package/scripts/skills/documentation.md +38 -0
  142. package/scripts/skills/frontend-design.md +45 -0
  143. package/scripts/skills/generated/.gitkeep +0 -0
  144. package/scripts/skills/generated/_refinements/.gitkeep +0 -0
  145. package/scripts/skills/generated/_refinements/adversarial-quality.patch.md +3 -0
  146. package/scripts/skills/generated/_refinements/architecture-design.patch.md +3 -0
  147. package/scripts/skills/generated/_refinements/brainstorming.patch.md +3 -0
  148. package/scripts/skills/generated/cli-version-management.md +29 -0
  149. package/scripts/skills/generated/collection-system-validation.md +99 -0
  150. package/scripts/skills/generated/large-scale-c-refactoring-coordination.md +97 -0
  151. package/scripts/skills/generated/pattern-matching-similarity-scoring.md +195 -0
  152. package/scripts/skills/generated/test-parallelization-detection.md +65 -0
  153. package/scripts/skills/observability.md +79 -0
  154. package/scripts/skills/performance.md +48 -0
  155. package/scripts/skills/pr-quality.md +49 -0
  156. package/scripts/skills/product-thinking.md +43 -0
  157. package/scripts/skills/security-audit.md +49 -0
  158. package/scripts/skills/systematic-debugging.md +40 -0
  159. package/scripts/skills/testing-strategy.md +47 -0
  160. package/scripts/skills/two-stage-review.md +52 -0
  161. package/scripts/skills/validation-thoroughness.md +55 -0
  162. package/scripts/sw +9 -3
  163. package/scripts/sw-activity.sh +9 -2
  164. package/scripts/sw-adaptive.sh +2 -1
  165. package/scripts/sw-adversarial.sh +2 -1
  166. package/scripts/sw-architecture-enforcer.sh +3 -1
  167. package/scripts/sw-auth.sh +12 -2
  168. package/scripts/sw-autonomous.sh +5 -1
  169. package/scripts/sw-changelog.sh +4 -1
  170. package/scripts/sw-checkpoint.sh +2 -1
  171. package/scripts/sw-ci.sh +5 -1
  172. package/scripts/sw-cleanup.sh +4 -26
  173. package/scripts/sw-code-review.sh +10 -4
  174. package/scripts/sw-connect.sh +2 -1
  175. package/scripts/sw-context.sh +2 -1
  176. package/scripts/sw-cost.sh +48 -3
  177. package/scripts/sw-daemon.sh +66 -9
  178. package/scripts/sw-dashboard.sh +3 -1
  179. package/scripts/sw-db.sh +59 -16
  180. package/scripts/sw-decide.sh +8 -2
  181. package/scripts/sw-decompose.sh +360 -17
  182. package/scripts/sw-deps.sh +4 -1
  183. package/scripts/sw-developer-simulation.sh +4 -1
  184. package/scripts/sw-discovery.sh +325 -2
  185. package/scripts/sw-doc-fleet.sh +4 -1
  186. package/scripts/sw-docs-agent.sh +3 -1
  187. package/scripts/sw-docs.sh +2 -1
  188. package/scripts/sw-doctor.sh +453 -2
  189. package/scripts/sw-dora.sh +4 -1
  190. package/scripts/sw-durable.sh +4 -3
  191. package/scripts/sw-e2e-orchestrator.sh +17 -16
  192. package/scripts/sw-eventbus.sh +7 -1
  193. package/scripts/sw-evidence.sh +364 -12
  194. package/scripts/sw-feedback.sh +550 -9
  195. package/scripts/sw-fix.sh +20 -1
  196. package/scripts/sw-fleet-discover.sh +6 -2
  197. package/scripts/sw-fleet-viz.sh +4 -1
  198. package/scripts/sw-fleet.sh +5 -1
  199. package/scripts/sw-github-app.sh +16 -3
  200. package/scripts/sw-github-checks.sh +3 -2
  201. package/scripts/sw-github-deploy.sh +3 -2
  202. package/scripts/sw-github-graphql.sh +18 -7
  203. package/scripts/sw-guild.sh +5 -1
  204. package/scripts/sw-heartbeat.sh +5 -30
  205. package/scripts/sw-hello.sh +67 -0
  206. package/scripts/sw-hygiene.sh +6 -1
  207. package/scripts/sw-incident.sh +265 -1
  208. package/scripts/sw-init.sh +18 -2
  209. package/scripts/sw-instrument.sh +10 -2
  210. package/scripts/sw-intelligence.sh +42 -6
  211. package/scripts/sw-jira.sh +5 -1
  212. package/scripts/sw-launchd.sh +2 -1
  213. package/scripts/sw-linear.sh +4 -1
  214. package/scripts/sw-logs.sh +4 -1
  215. package/scripts/sw-loop.sh +432 -1128
  216. package/scripts/sw-memory.sh +356 -2
  217. package/scripts/sw-mission-control.sh +6 -1
  218. package/scripts/sw-model-router.sh +481 -26
  219. package/scripts/sw-otel.sh +13 -4
  220. package/scripts/sw-oversight.sh +14 -5
  221. package/scripts/sw-patrol-meta.sh +334 -0
  222. package/scripts/sw-pipeline-composer.sh +5 -1
  223. package/scripts/sw-pipeline-vitals.sh +2 -1
  224. package/scripts/sw-pipeline.sh +53 -2664
  225. package/scripts/sw-pm.sh +12 -5
  226. package/scripts/sw-pr-lifecycle.sh +2 -1
  227. package/scripts/sw-predictive.sh +7 -1
  228. package/scripts/sw-prep.sh +185 -2
  229. package/scripts/sw-ps.sh +5 -25
  230. package/scripts/sw-public-dashboard.sh +15 -3
  231. package/scripts/sw-quality.sh +2 -1
  232. package/scripts/sw-reaper.sh +8 -25
  233. package/scripts/sw-recruit.sh +156 -2303
  234. package/scripts/sw-regression.sh +19 -12
  235. package/scripts/sw-release-manager.sh +3 -1
  236. package/scripts/sw-release.sh +4 -1
  237. package/scripts/sw-remote.sh +3 -1
  238. package/scripts/sw-replay.sh +7 -1
  239. package/scripts/sw-retro.sh +158 -1
  240. package/scripts/sw-review-rerun.sh +3 -1
  241. package/scripts/sw-scale.sh +10 -3
  242. package/scripts/sw-security-audit.sh +6 -1
  243. package/scripts/sw-self-optimize.sh +6 -3
  244. package/scripts/sw-session.sh +9 -3
  245. package/scripts/sw-setup.sh +3 -1
  246. package/scripts/sw-stall-detector.sh +406 -0
  247. package/scripts/sw-standup.sh +15 -7
  248. package/scripts/sw-status.sh +3 -1
  249. package/scripts/sw-strategic.sh +4 -1
  250. package/scripts/sw-stream.sh +7 -1
  251. package/scripts/sw-swarm.sh +18 -6
  252. package/scripts/sw-team-stages.sh +13 -6
  253. package/scripts/sw-templates.sh +5 -29
  254. package/scripts/sw-testgen.sh +7 -1
  255. package/scripts/sw-tmux-pipeline.sh +4 -1
  256. package/scripts/sw-tmux-role-color.sh +2 -0
  257. package/scripts/sw-tmux-status.sh +1 -1
  258. package/scripts/sw-tmux.sh +3 -1
  259. package/scripts/sw-trace.sh +3 -1
  260. package/scripts/sw-tracker-github.sh +3 -0
  261. package/scripts/sw-tracker-jira.sh +3 -0
  262. package/scripts/sw-tracker-linear.sh +3 -0
  263. package/scripts/sw-tracker.sh +3 -1
  264. package/scripts/sw-triage.sh +2 -1
  265. package/scripts/sw-upgrade.sh +3 -1
  266. package/scripts/sw-ux.sh +5 -2
  267. package/scripts/sw-webhook.sh +3 -1
  268. package/scripts/sw-widgets.sh +3 -1
  269. package/scripts/sw-worktree.sh +15 -3
  270. package/scripts/test-skill-injection.sh +1233 -0
  271. package/templates/pipelines/autonomous.json +27 -3
  272. package/templates/pipelines/cost-aware.json +34 -8
  273. package/templates/pipelines/deployed.json +12 -0
  274. package/templates/pipelines/enterprise.json +12 -0
  275. package/templates/pipelines/fast.json +6 -0
  276. package/templates/pipelines/full.json +27 -3
  277. package/templates/pipelines/hotfix.json +6 -0
  278. package/templates/pipelines/standard.json +12 -0
  279. package/templates/pipelines/tdd.json +12 -0
@@ -0,0 +1,440 @@
1
+ # Research Sources: Autonomous Coding Systems (April 2026)
2
+
3
+ ## Complete Bibliography with URLs
4
+
5
+ ### Dark Factory & Autonomous Delivery
6
+
7
+ **BCG Platinion: The Dark Software Factory** (March 2026)
8
+
9
+ - https://www.bcgplatinion.com/insights/the-dark-software-factory
10
+ - **Key findings:** 3-5 engineers running factories; Spotify 650+ PRs/month; OpenAI 1M-line product in 5 months
11
+ - **Disciplines:** Harness Engineering, Intent Thinking
12
+ - **Report PDF:** https://cdn.prod.website-files.com/655cded084fee2e958faaffc/69b8331d6141dc7278866f9c_Dark_Software_Factory_BCG_Platinion_AI_report_March2026.pdf
13
+
14
+ **Anthropic 2026 Agentic Coding Trends Report**
15
+
16
+ - https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf
17
+ - **Coverage:** Loop convergence triggers, prompt design impact, multi-agent coordination patterns
18
+ - **Timeline:** 40% of enterprise apps will have agents by 2026 (vs <5% in 2025)
19
+
20
+ **GitHub Copilot: Agent Mode & Project Padawan**
21
+
22
+ - https://github.com/newsroom/press-releases/agent-mode
23
+ - https://githubnext.com/projects/copilot-workspace
24
+ - **Capabilities:** Issue-to-PR workflow, autonomous issue completion, asynchronous execution
25
+ - **Status:** GA since September 2025; Project Padawan in development
26
+
27
+ ---
28
+
29
+ ### Autonomous Loop Patterns & Convergence Detection
30
+
31
+ **SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering** (NeurIPS 2024)
32
+
33
+ - https://arxiv.org/abs/2405.15793
34
+ - **PDF:** https://arxiv.org/pdf/2405.15793
35
+ - **Repo:** https://github.com/SWE-agent/SWE-agent
36
+ - **Key innovation:** Custom ACI with repository primitives (find_file, search_dir, edit_tool)
37
+ - **Benchmark:** 40.6% on SWE-bench
38
+
39
+ **Geometric Dynamics of Agentic Loops in Large Language Models** (Jan 2026)
40
+
41
+ - https://arxiv.org/abs/2512.10350
42
+ - **Key finding:** Contractive vs exploratory loop regimes; prompt design governs dynamical behavior
43
+ - **Applications:** Early exit on convergence, escalation on divergence
44
+
45
+ **SWE-Bench & SWE-Bench Pro**
46
+
47
+ - Benchmark: https://www.vals.ai/benchmarks/swebench
48
+ - SWE-Bench Pro: https://scale.com/blog/swe-bench-pro
49
+ - **Status:** Verified flagged as contaminated (OpenAI finding); Pro (1,865 tasks) is new standard
50
+ - **Leaderboard:** https://llm-stats.com/benchmarks/swe-bench-verified-(agentic-coding)
51
+
52
+ **SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution**
53
+
54
+ - https://arxiv.org/pdf/2512.18470
55
+ - **Scope:** Multi-step modifications, release note interpretation, large-scale repos
56
+
57
+ ---
58
+
59
+ ### Reinforcement Learning for Code Generation
60
+
61
+ **FunPRM: Function-as-Step Process Reward Model with Meta Reward Correction**
62
+
63
+ - https://arxiv.org/abs/2601.22249
64
+ - **Innovation:** Treats functions as PRM steps; meta-learning reward correction via unit tests
65
+ - **Performance:** +15-20% completion rate vs outcome-only rewards
66
+
67
+ **SecCoderX: Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model**
68
+
69
+ - https://arxiv.org/abs/2602.07422
70
+ - **Key contribution:** Vulnerability detection → reward model → RL loop
71
+ - **Application:** Security-hardened code generation
72
+
73
+ **Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey**
74
+
75
+ - https://arxiv.org/abs/2412.20367
76
+ - **Coverage:** PPO standard, preference data → reward model → policy optimization
77
+ - **Scope:** RLHF, RLIF, online RL approaches
78
+
79
+ **Mutation-Guided LLM-based Test Generation at Meta**
80
+
81
+ - https://arxiv.org/abs/2501.12862
82
+ - **System:** ACH (Automated Compliance Hardening)
83
+ - **Scale:** 10,795 Android classes; 9,095 mutants; 571 test cases generated
84
+
85
+ ---
86
+
87
+ ### Reasoning Models & Extended Thinking
88
+
89
+ **Claude Opus 4.6: Adaptive Thinking** (Anthropic 2026)
90
+
91
+ - https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking
92
+ - **Key feature:** Dynamically decides when/how much to think (replaces extended thinking)
93
+ - **Capability:** Think between tool calls; 1M context window
94
+
95
+ **OpenAI o1-pro: Complete Guide**
96
+
97
+ - https://openai.com/index/introducing-openai-o1-preview/
98
+ - https://openai.com/index/learning-to-reason-with-llms/
99
+ - **Specs:** 200K context, 100K output tokens, $150/$600 pricing
100
+ - **Performance:** 86% AIME (vs 78% o1), 89th percentile Codeforces
101
+
102
+ **DeepSeek-R1: Incentivizing Reasoning Capability via RL**
103
+
104
+ - https://arxiv.org/abs/2501.12948
105
+ - **Repo:** https://github.com/deepseek-ai/DeepSeek-R1
106
+ - **Architecture:** 671B @ 37B inference cost via Mixture of Experts
107
+ - **Performance:** 2,029 Codeforces Elo (Candidate Master)
108
+ - **Training:** Pure RL without SFT; multi-stage RL + SFT
109
+
110
+ **Reasoning Models Don't Always Say What They Think** (Anthropic Alignment Science)
111
+
112
+ - https://www.anthropic.com/research/reasoning-models-dont-say-think
113
+ - **Finding:** Chain-of-thought reasoning may not be faithful (~25% of hints mentioned)
114
+
115
+ ---
116
+
117
+ ### Memory Systems & Episodic Learning
118
+
119
+ **Memory in the Age of AI Agents: A Survey**
120
+
121
+ - https://arxiv.org/abs/2512.13564
122
+ - **Paper list:** https://github.com/Shichun-Liu/Agent-Memory-Paper-List
123
+ - **Coverage:** Episodic, semantic, working memory; implementations across agents
124
+
125
+ **EM-LLM: Human-inspired Episodic Memory for Infinite Context LLMs**
126
+
127
+ - https://arxiv.org/abs/2407.09450
128
+ - **Innovation:** Bayesian surprise + graph refinement for event boundaries
129
+ - **Application:** Online episode segmentation
130
+
131
+ **Mem0: AI Memory Platform**
132
+
133
+ - https://mem0.ai
134
+ - **Technology:** Hybrid storage (Postgres), episodic summaries, continuous learning
135
+ - **Status:** Most mature long-term memory system (2026)
136
+
137
+ **Active Context Compression: Autonomous Memory Management in LLM Agents**
138
+
139
+ - https://arxiv.org/abs/2601.07190
140
+ - **Pattern:** Focus agent autonomously consolidates learnings into knowledge blocks
141
+ - **Technique:** Selective pruning of raw history
142
+
143
+ **Multi-Layered Memory Architectures for LLM Agents: Experimental Evaluation**
144
+
145
+ - https://arxiv.org/abs/2603.29194
146
+ - **Approach:** Working + episodic + semantic layers with adaptive retrieval gating
147
+
148
+ ---
149
+
150
+ ### Formal Verification & Specification
151
+
152
+ **DafnyPro: LLM-Assisted Automated Verification for Dafny Programs** (POPL 2026)
153
+
154
+ - https://popl26.sigplan.org/details/dafny-2026-papers/12/DafnyPro-LLM-Assisted-Automated-Verification-for-Dafny-Programs
155
+ - **Performance:** 86% on DafnyBench (Claude Sonnet 3.5)
156
+ - **Advance:** +16pp over previous SOTA
157
+
158
+ **MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving** (POPL 2026)
159
+
160
+ - https://popl26.sigplan.org/details/dafny-2026-papers/16/MiniF2F-Dafny-LLM-Guided-Mathematical-Theorem-Proving-via-Auto-Active-Verification
161
+ - **Coverage:** 40.6% test set, 44.7% validation set with empty proofs
162
+
163
+ **A Benchmark for Vericoding: Formally Verified Program Synthesis**
164
+
165
+ - https://arxiv.org/abs/2509.22908
166
+ - **Baseline:** 27% Lean, 44% Verus/Rust, 82% Dafny success rates
167
+
168
+ **ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis**
169
+
170
+ - https://arxiv.org/abs/2512.10173
171
+ - **Pipeline:** Synthesize 2.7K verified Dafny programs → 19K training examples
172
+ - **Results:** +23pp on DafnyBench, +50pp on DafnySynthesis via fine-tuning
173
+
174
+ **DafnyBench: A Benchmark for Formal Software Verification**
175
+
176
+ - https://openreview.net/pdf?id=yBgTVWccIx
177
+ - **Scope:** 412 verification problems; covers inductive invariants, loop specifications
178
+
179
+ ---
180
+
181
+ ### Test Generation & Mutation Testing
182
+
183
+ **Meta: Revolutionizing Software Testing with LLM-powered Bug Catchers**
184
+
185
+ - https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/
186
+ - **System:** ACH (Automated Compliance Hardening)
187
+ - **Scale:** 10,795 Android Kotlin classes; 9,095 mutants + 571 test cases
188
+
189
+ **Evaluating LLM-Based Test Generation Under Software Evolution**
190
+
191
+ - https://arxiv.org/abs/2603.23443
192
+ - **Challenge:** Test effectiveness degrades with code evolution
193
+
194
+ **Effective Test Generation Using Pre-Trained LLMs and Mutation Testing**
195
+
196
+ - https://www.sciencedirect.com/article/abs/pii/S0950584924000739
197
+ - **Approach:** Combine LLM generation + mutation validation
198
+
199
+ **LLMorpheus: LLM-based Mutation Testing**
200
+
201
+ - https://github.com/githubnext/llmorpheus
202
+ - **Tool:** Open-source implementation on GitHub Next
203
+
204
+ **MutGen: Mutation-Guided LLM-based Test Generation**
205
+
206
+ - **Performance:** 89.5% mutation score on HumanEval-Java (vs EvoSuite baseline)
207
+
208
+ ---
209
+
210
+ ### Cost Optimization & Model Routing
211
+
212
+ **Google: Speculative Cascades — A Hybrid Approach for Smarter, Faster LLM Inference**
213
+
214
+ - https://research.google/blog/speculative-cascades-a-hybrid-approach-for-smarter-faster-llm-inference/
215
+ - **Finding:** 30-60% cost reduction; hybrid routing + cascading
216
+ - **Benchmark:** 92% cost savings on open-source cascading
217
+
218
+ **A Unified Approach to Routing and Cascading for LLMs**
219
+
220
+ - https://arxiv.org/abs/2410.10347
221
+ - **Innovation:** Theoretically optimal integration of routing + cascading
222
+ - **Framework:** Unified decision tree for both strategies
223
+
224
+ **Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey**
225
+
226
+ - https://arxiv.org/abs/2603.04445
227
+ - **Coverage:** Routing vs cascading paradigms, cost-quality tradeoffs
228
+
229
+ **CoSine: Clustering-Based Routing for LLM Inference Optimization**
230
+
231
+ - **Results:** 23% latency reduction, 32% throughput increase
232
+
233
+ **Smurfs: Adaptive Speculative Decoding**
234
+
235
+ - **Technique:** Dynamic speculation length optimization per query
236
+
237
+ ---
238
+
239
+ ### Self-Healing CI/CD & AIOps
240
+
241
+ **Agentic SRE: How Self-Healing Infrastructure Is Redefining Enterprise AIOps** (2026)
242
+
243
+ - https://www.unite.ai/agentic-sre-how-self-healing-infrastructure-is-redefining-enterprise-aiops-in-2026/
244
+ - **Pattern:** Telemetry → reasoning → controlled automation (closed loop)
245
+ - **Adoption:** 60% of enterprises by Gartner 2026
246
+
247
+ **Building Self-Healing CI/CD Pipelines for Agentic AI Systems**
248
+
249
+ - https://optimumpartners.com/insight/how-to-architect-self-healing-ci/cd-for-agentic-ai/
250
+ - **Pattern:** Pipeline Doctor / Interceptor — repair agent on build failure
251
+
252
+ **From AIOps Hype to Reality: Building Self-Healing Infrastructure** (2026)
253
+
254
+ - https://techstrong.it/features/from-aiops-hype-to-reality-building-self-healing-infrastructure-in-2026
255
+ - **Results:** 67% MTTR drop; 40-60% in high-performing orgs
256
+
257
+ **AIOps: Guide to AI in IT Operations** (2026)
258
+
259
+ - https://www.ir.com/guides/what-is-aiops-guide-to-ai-in-operations-2026
260
+ - **Scope:** Anomaly detection, incident prediction, automated remediation
261
+
262
+ **LLM-as-a-Judge Pattern** (2026 standard)
263
+
264
+ - **Concept:** Secondary model evaluates primary agent output
265
+ - **Application:** Quality gates, merge decision support
266
+
267
+ ---
268
+
269
+ ### Multi-Agent Coordination & Orchestration
270
+
271
+ **How to Build Multi-Agent Systems: Complete 2026 Guide**
272
+
273
+ - https://dev.to/eira-wexford/how-to-build-multi-agent-systems-complete-2026-guide-1io6
274
+ - **Patterns:** 3-role (Planner, Worker, Judge); git worktrees for isolation
275
+ - **Status:** 40% of enterprise apps will have agents by 2026
276
+
277
+ **The Code Agent Orchestra: What Makes Multi-Agent Coding Work**
278
+
279
+ - https://addyosmani.com/blog/code-agent-orchestra/
280
+ - **Insight:** Coordination > autonomy; orchestration is the key lever
281
+
282
+ **Multi-Agent Frameworks Explained for Enterprise AI** (2026)
283
+
284
+ - https://www.adopt.ai/blog/multi-agent-frameworks
285
+ - **Frameworks:** CrewAI, LangGraph, AutoGen, MetaGPT
286
+ - **Winner:** LangGraph for complex workflows; CrewAI for rapid deployment
287
+
288
+ **MetaGPT: Multi-Agent Framework for Software Development**
289
+
290
+ - **Approach:** Simulates full product team (PM, TL, Dev, QA)
291
+ - **Specialization:** Standardized engineering workflows
292
+
293
+ **Google DORA 2025: AI Adoption & Bug Rates**
294
+
295
+ - **Finding:** 20-30% faster workflows, but 9% bug rate climb with multi-agent
296
+ - **Lesson:** Coordination + quality gates are critical
297
+
298
+ ---
299
+
300
+ ### Competitive Analysis & Benchmarks
301
+
302
+ **We Tested 15 AI Coding Agents (2026): Only 3 Changed How We Ship**
303
+
304
+ - https://www.morphllm.com/ai-coding-agent
305
+ - **Leaders:** Claude Code (80.9%), Aider (49.2%), Cline (500K downloads)
306
+
307
+ **Cline vs Aider: Which AI Coding Assistant is Best in 2026?**
308
+
309
+ - https://is4.ai/blog/our-blog-1/cline-vs-aider-comparison-2026-313
310
+ - **Comparison:** Architecture, integration, cost efficiency, workflow
311
+ - **Winner:** Aider for cost; Claude Code for complex tasks
312
+
313
+ **Aider Uses 4.2x Fewer Tokens Than Claude Code**
314
+
315
+ - https://www.morphllm.com/comparisons/morph-vs-aider-diff
316
+ - **Reason:** Diff-based editing vs search-replace
317
+
318
+ **SWE-Agent vs SWE-Bench Leaderboard**
319
+
320
+ - Leaderboard: https://llm-stats.com/benchmarks/swe-bench-verified-(agentic-coding)
321
+ - **Status:** Claude Code highest reported (80.9%), but unsubmitted officially
322
+
323
+ **AI Coding Benchmarks 2026: Every Major Eval Explained**
324
+
325
+ - https://www.morphllm.com/ai-coding-benchmarks-2026
326
+ - **Coverage:** SWE-bench, SWE-bench Pro, SWE-Bench Verified, Codeforces, AIME
327
+
328
+ ---
329
+
330
+ ### Additional Research & Surveys
331
+
332
+ **Agentic AI Resource Exhaustion & Infinite Loop Attacks** (Feb 2026)
333
+
334
+ - https://medium.com/@instatunnel/agentic-resource-exhaustion-the-infinite-loop-attack-of-the-ai-era-76a3f58c62e3
335
+ - **Finding:** 45% of 220 loops had problems (stagnation, stuck loops)
336
+
337
+ **How to Tell If Your AI Agent Is Stuck (Real Data From 220 Loops)**
338
+
339
+ - https://dev.to/boucle2026/how-to-tell-if-your-ai-agent-is-stuck-with-real-data-from-220-loops-4d4h
340
+ - **Techniques:** De-duplication, semantic similarity, state tracking
341
+
342
+ **Agents: Loop Control** (Vercel AI SDK)
343
+
344
+ - https://ai-sdk.dev/docs/agents/loop-control
345
+ - **Patterns:** Max iterations, timeout management, stop conditions
346
+
347
+ **120+ Agentic AI Tools Mapped Across 11 Categories** (2026)
348
+
349
+ - https://www.stackone.com/blog/ai-agent-tools-landscape-2026
350
+ - **Categories:** Frameworks, platforms, monitoring, integrations
351
+
352
+ ---
353
+
354
+ ### Industry Trends & Forecasts
355
+
356
+ **7 Agentic AI Trends to Watch in 2026**
357
+
358
+ - https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026
359
+ - **Topics:** Loop control, reliability, security, cost optimization
360
+
361
+ **The Next Frontier of RAG: How Enterprise Knowledge Systems Will Evolve**
362
+
363
+ - https://nstarxinc.com/blog/the-next-frontier-of-rag-how-enterprise-knowledge-systems-will-evolve-2026-2030
364
+ - **Timeline:** 2026-2030; RAG as knowledge runtime; verification + access control
365
+
366
+ **Agentic GraphRAG for Capital Markets** (Amazon Web Services)
367
+
368
+ - https://aws.amazon.com/blogs/industries/agentic-graphrag-for-capital-markets/
369
+ - **Pattern:** Agentic RAG with specialized agents (research, verification, synthesis)
370
+
371
+ **Why GraphRAG and MCP Are the New Standard for Agentic Data Architecture**
372
+
373
+ - https://hyperight.com/agentic-data-architecture-graphrag-mcp-2026/
374
+ - **Trend:** MCP (Model Context Protocol) + GraphRAG for structured context
375
+
376
+ ---
377
+
378
+ ## Quick Link Summary by Topic
379
+
380
+ ### Dark Factory & Intent (Backlog #2)
381
+
382
+ - BCG Platinion report (above)
383
+ - Anthropic trends report (above)
384
+ - GitHub Agent Mode / Project Padawan
385
+
386
+ ### Loop Convergence (Backlog #1)
387
+
388
+ - SWE-agent NeurIPS 2024
389
+ - Geometric Dynamics of Agentic Loops (arxiv 2512.10350)
390
+ - How to Tell If Your AI Agent Is Stuck (220 loops data)
391
+
392
+ ### Vulnerability & RL (Backlog #3)
393
+
394
+ - SecCoderX (arxiv 2602.07422)
395
+ - Meta ACH system (engineering.fb.com)
396
+ - Mutation-Guided LLM at Meta (arxiv 2501.12862)
397
+
398
+ ### Episodic Memory (Backlog #4)
399
+
400
+ - Mem0 (mem0.ai)
401
+ - EM-LLM (arxiv 2407.09450)
402
+ - Memory in the Age of AI Agents survey (arxiv 2512.13564)
403
+
404
+ ### Cost Optimization / Cascade (Backlog #5)
405
+
406
+ - Google Speculative Cascades (research.google)
407
+ - Unified Routing + Cascading (arxiv 2410.10347)
408
+ - CoSine, Smurfs papers
409
+
410
+ ### Mutation Testing (Backlog #6, #13)
411
+
412
+ - Meta ACH (engineering.fb.com)
413
+ - MutGen paper
414
+ - LLMorpheus (GitHub Next)
415
+
416
+ ### CI Repair & AIOps (Backlog #7)
417
+
418
+ - Agentic SRE (unite.ai)
419
+ - Pipeline Doctor pattern (optimumpartners.com)
420
+ - From AIOps Hype to Reality (techstrong.it)
421
+
422
+ ### Multi-Agent Coordination (Backlog #9)
423
+
424
+ - 2026 Multi-Agent Systems Guide (dev.to)
425
+ - The Code Agent Orchestra (addyosmani.com)
426
+ - MetaGPT, CrewAI, LangGraph frameworks
427
+
428
+ ### Formal Verification (Backlog #11)
429
+
430
+ - DafnyPro (POPL 2026)
431
+ - ATLAS (arxiv 2512.10173)
432
+ - DafnyBench (openreview)
433
+
434
+ ---
435
+
436
+ **Total sources cited:** 60+
437
+ **Papers:** 25+
438
+ **Companies/Organizations:** 15+ (Anthropic, OpenAI, DeepSeek, Meta, Google, BCG, GitHub, etc.)
439
+ **Research date:** April 4, 2026
440
+ **Coverage:** Autonomous software engineering, dark factories, RL systems, multi-agent coordination, formal verification, memory systems, cost optimization, self-healing CI/CD