shipwright-cli 3.1.0 → 3.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/code-reviewer.md +2 -0
- package/.claude/agents/devops-engineer.md +2 -0
- package/.claude/agents/doc-fleet-agent.md +2 -0
- package/.claude/agents/pipeline-agent.md +2 -0
- package/.claude/agents/shell-script-specialist.md +2 -0
- package/.claude/agents/test-specialist.md +2 -0
- package/.claude/hooks/agent-crash-capture.sh +32 -0
- package/.claude/hooks/post-tool-use.sh +3 -2
- package/.claude/hooks/pre-tool-use.sh +35 -3
- package/README.md +22 -8
- package/claude-code/hooks/config-change.sh +18 -0
- package/claude-code/hooks/instructions-reloaded.sh +7 -0
- package/claude-code/hooks/worktree-create.sh +25 -0
- package/claude-code/hooks/worktree-remove.sh +20 -0
- package/config/code-constitution.json +130 -0
- package/config/defaults.json +25 -2
- package/config/policy.json +1 -1
- package/dashboard/middleware/auth.ts +134 -0
- package/dashboard/middleware/constants.ts +21 -0
- package/dashboard/public/index.html +8 -6
- package/dashboard/public/styles.css +176 -97
- package/dashboard/routes/auth.ts +38 -0
- package/dashboard/server.ts +117 -25
- package/dashboard/services/config.ts +26 -0
- package/dashboard/services/db.ts +118 -0
- package/dashboard/src/canvas/pixel-agent.ts +298 -0
- package/dashboard/src/canvas/pixel-sprites.ts +440 -0
- package/dashboard/src/canvas/shipyard-effects.ts +367 -0
- package/dashboard/src/canvas/shipyard-scene.ts +616 -0
- package/dashboard/src/canvas/submarine-layout.ts +267 -0
- package/dashboard/src/components/header.ts +8 -7
- package/dashboard/src/core/api.ts +5 -0
- package/dashboard/src/core/router.ts +1 -0
- package/dashboard/src/design/submarine-theme.ts +253 -0
- package/dashboard/src/main.ts +2 -0
- package/dashboard/src/types/api.ts +12 -1
- package/dashboard/src/views/activity.ts +2 -1
- package/dashboard/src/views/metrics.ts +69 -1
- package/dashboard/src/views/shipyard.ts +39 -0
- package/dashboard/types/index.ts +166 -0
- package/docs/plans/2026-02-28-compound-audit-and-shipyard-design.md +186 -0
- package/docs/plans/2026-02-28-skipper-shipwright-implementation-plan.md +1182 -0
- package/docs/plans/2026-02-28-skipper-shipwright-integration-design.md +531 -0
- package/docs/plans/2026-03-01-ai-powered-skill-injection-design.md +298 -0
- package/docs/plans/2026-03-01-ai-powered-skill-injection-plan.md +1109 -0
- package/docs/plans/2026-03-01-capabilities-cleanup-plan.md +658 -0
- package/docs/plans/2026-03-01-clean-architecture-plan.md +924 -0
- package/docs/plans/2026-03-01-compound-audit-cascade-design.md +191 -0
- package/docs/plans/2026-03-01-compound-audit-cascade-plan.md +921 -0
- package/docs/plans/2026-03-01-deep-integration-plan.md +851 -0
- package/docs/plans/2026-03-01-pipeline-audit-trail-design.md +145 -0
- package/docs/plans/2026-03-01-pipeline-audit-trail-plan.md +770 -0
- package/docs/plans/2026-03-01-refined-depths-brand-design.md +382 -0
- package/docs/plans/2026-03-01-refined-depths-implementation.md +599 -0
- package/docs/plans/2026-03-01-skipper-kernel-integration-design.md +203 -0
- package/docs/plans/2026-03-01-unified-platform-design.md +272 -0
- package/docs/plans/2026-03-07-claude-code-feature-integration-design.md +189 -0
- package/docs/plans/2026-03-07-claude-code-feature-integration-plan.md +1165 -0
- package/docs/research/BACKLOG_QUICK_REFERENCE.md +352 -0
- package/docs/research/CUTTING_EDGE_RESEARCH_2026.md +546 -0
- package/docs/research/RESEARCH_INDEX.md +439 -0
- package/docs/research/RESEARCH_SOURCES.md +440 -0
- package/docs/research/RESEARCH_SUMMARY.txt +275 -0
- package/docs/superpowers/specs/2026-03-10-pipeline-quality-revolution-design.md +341 -0
- package/package.json +2 -2
- package/scripts/lib/adaptive-model.sh +427 -0
- package/scripts/lib/adaptive-timeout.sh +316 -0
- package/scripts/lib/audit-trail.sh +309 -0
- package/scripts/lib/auto-recovery.sh +471 -0
- package/scripts/lib/bandit-selector.sh +431 -0
- package/scripts/lib/bootstrap.sh +104 -2
- package/scripts/lib/causal-graph.sh +455 -0
- package/scripts/lib/compat.sh +126 -0
- package/scripts/lib/compound-audit.sh +337 -0
- package/scripts/lib/constitutional.sh +454 -0
- package/scripts/lib/context-budget.sh +359 -0
- package/scripts/lib/convergence.sh +594 -0
- package/scripts/lib/cost-optimizer.sh +634 -0
- package/scripts/lib/daemon-adaptive.sh +14 -2
- package/scripts/lib/daemon-dispatch.sh +106 -17
- package/scripts/lib/daemon-failure.sh +34 -4
- package/scripts/lib/daemon-patrol.sh +25 -4
- package/scripts/lib/daemon-poll-github.sh +361 -0
- package/scripts/lib/daemon-poll-health.sh +299 -0
- package/scripts/lib/daemon-poll.sh +27 -611
- package/scripts/lib/daemon-state.sh +119 -66
- package/scripts/lib/daemon-triage.sh +10 -0
- package/scripts/lib/dod-scorecard.sh +442 -0
- package/scripts/lib/error-actionability.sh +300 -0
- package/scripts/lib/formal-spec.sh +461 -0
- package/scripts/lib/helpers.sh +180 -5
- package/scripts/lib/intent-analysis.sh +409 -0
- package/scripts/lib/loop-convergence.sh +350 -0
- package/scripts/lib/loop-iteration.sh +682 -0
- package/scripts/lib/loop-progress.sh +48 -0
- package/scripts/lib/loop-restart.sh +185 -0
- package/scripts/lib/memory-effectiveness.sh +506 -0
- package/scripts/lib/mutation-executor.sh +352 -0
- package/scripts/lib/outcome-feedback.sh +521 -0
- package/scripts/lib/pipeline-cli.sh +336 -0
- package/scripts/lib/pipeline-commands.sh +1216 -0
- package/scripts/lib/pipeline-detection.sh +101 -3
- package/scripts/lib/pipeline-execution.sh +897 -0
- package/scripts/lib/pipeline-github.sh +28 -3
- package/scripts/lib/pipeline-intelligence-compound.sh +431 -0
- package/scripts/lib/pipeline-intelligence-scoring.sh +407 -0
- package/scripts/lib/pipeline-intelligence-skip.sh +181 -0
- package/scripts/lib/pipeline-intelligence.sh +104 -1138
- package/scripts/lib/pipeline-quality-bash-compat.sh +182 -0
- package/scripts/lib/pipeline-quality-checks.sh +17 -711
- package/scripts/lib/pipeline-quality-gates.sh +563 -0
- package/scripts/lib/pipeline-stages-build.sh +730 -0
- package/scripts/lib/pipeline-stages-delivery.sh +965 -0
- package/scripts/lib/pipeline-stages-intake.sh +1133 -0
- package/scripts/lib/pipeline-stages-monitor.sh +407 -0
- package/scripts/lib/pipeline-stages-review.sh +1022 -0
- package/scripts/lib/pipeline-stages.sh +161 -2901
- package/scripts/lib/pipeline-state.sh +36 -5
- package/scripts/lib/pipeline-util.sh +487 -0
- package/scripts/lib/policy-learner.sh +438 -0
- package/scripts/lib/process-reward.sh +493 -0
- package/scripts/lib/project-detect.sh +649 -0
- package/scripts/lib/quality-profile.sh +334 -0
- package/scripts/lib/recruit-commands.sh +885 -0
- package/scripts/lib/recruit-learning.sh +739 -0
- package/scripts/lib/recruit-roles.sh +648 -0
- package/scripts/lib/reward-aggregator.sh +458 -0
- package/scripts/lib/rl-optimizer.sh +362 -0
- package/scripts/lib/root-cause.sh +427 -0
- package/scripts/lib/scope-enforcement.sh +445 -0
- package/scripts/lib/session-restart.sh +493 -0
- package/scripts/lib/skill-memory.sh +300 -0
- package/scripts/lib/skill-registry.sh +775 -0
- package/scripts/lib/spec-driven.sh +476 -0
- package/scripts/lib/test-helpers.sh +18 -7
- package/scripts/lib/test-holdout.sh +429 -0
- package/scripts/lib/test-optimizer.sh +511 -0
- package/scripts/shipwright-file-suggest.sh +45 -0
- package/scripts/skills/adversarial-quality.md +61 -0
- package/scripts/skills/api-design.md +44 -0
- package/scripts/skills/architecture-design.md +50 -0
- package/scripts/skills/brainstorming.md +43 -0
- package/scripts/skills/data-pipeline.md +44 -0
- package/scripts/skills/deploy-safety.md +64 -0
- package/scripts/skills/documentation.md +38 -0
- package/scripts/skills/frontend-design.md +45 -0
- package/scripts/skills/generated/.gitkeep +0 -0
- package/scripts/skills/generated/_refinements/.gitkeep +0 -0
- package/scripts/skills/generated/_refinements/adversarial-quality.patch.md +3 -0
- package/scripts/skills/generated/_refinements/architecture-design.patch.md +3 -0
- package/scripts/skills/generated/_refinements/brainstorming.patch.md +3 -0
- package/scripts/skills/generated/cli-version-management.md +29 -0
- package/scripts/skills/generated/collection-system-validation.md +99 -0
- package/scripts/skills/generated/large-scale-c-refactoring-coordination.md +97 -0
- package/scripts/skills/generated/pattern-matching-similarity-scoring.md +195 -0
- package/scripts/skills/generated/test-parallelization-detection.md +65 -0
- package/scripts/skills/observability.md +79 -0
- package/scripts/skills/performance.md +48 -0
- package/scripts/skills/pr-quality.md +49 -0
- package/scripts/skills/product-thinking.md +43 -0
- package/scripts/skills/security-audit.md +49 -0
- package/scripts/skills/systematic-debugging.md +40 -0
- package/scripts/skills/testing-strategy.md +47 -0
- package/scripts/skills/two-stage-review.md +52 -0
- package/scripts/skills/validation-thoroughness.md +55 -0
- package/scripts/sw +9 -3
- package/scripts/sw-activity.sh +9 -8
- package/scripts/sw-adaptive.sh +8 -7
- package/scripts/sw-adversarial.sh +2 -1
- package/scripts/sw-architecture-enforcer.sh +3 -1
- package/scripts/sw-auth.sh +12 -2
- package/scripts/sw-autonomous.sh +5 -1
- package/scripts/sw-changelog.sh +4 -1
- package/scripts/sw-checkpoint.sh +2 -1
- package/scripts/sw-ci.sh +15 -6
- package/scripts/sw-cleanup.sh +4 -26
- package/scripts/sw-code-review.sh +45 -20
- package/scripts/sw-connect.sh +2 -1
- package/scripts/sw-context.sh +2 -1
- package/scripts/sw-cost.sh +107 -5
- package/scripts/sw-daemon.sh +71 -11
- package/scripts/sw-dashboard.sh +3 -1
- package/scripts/sw-db.sh +71 -20
- package/scripts/sw-decide.sh +8 -2
- package/scripts/sw-decompose.sh +360 -17
- package/scripts/sw-deps.sh +4 -1
- package/scripts/sw-developer-simulation.sh +4 -1
- package/scripts/sw-discovery.sh +378 -5
- package/scripts/sw-doc-fleet.sh +4 -1
- package/scripts/sw-docs-agent.sh +3 -1
- package/scripts/sw-docs.sh +2 -1
- package/scripts/sw-doctor.sh +453 -2
- package/scripts/sw-dora.sh +4 -1
- package/scripts/sw-durable.sh +12 -7
- package/scripts/sw-e2e-orchestrator.sh +17 -16
- package/scripts/sw-eventbus.sh +13 -4
- package/scripts/sw-evidence.sh +364 -12
- package/scripts/sw-feedback.sh +550 -9
- package/scripts/sw-fix.sh +20 -1
- package/scripts/sw-fleet-discover.sh +6 -2
- package/scripts/sw-fleet-viz.sh +9 -4
- package/scripts/sw-fleet.sh +5 -1
- package/scripts/sw-github-app.sh +18 -4
- package/scripts/sw-github-checks.sh +3 -2
- package/scripts/sw-github-deploy.sh +3 -2
- package/scripts/sw-github-graphql.sh +18 -7
- package/scripts/sw-guild.sh +5 -1
- package/scripts/sw-heartbeat.sh +5 -30
- package/scripts/sw-hello.sh +67 -0
- package/scripts/sw-hygiene.sh +10 -3
- package/scripts/sw-incident.sh +273 -5
- package/scripts/sw-init.sh +18 -2
- package/scripts/sw-instrument.sh +10 -2
- package/scripts/sw-intelligence.sh +44 -7
- package/scripts/sw-jira.sh +5 -1
- package/scripts/sw-launchd.sh +2 -1
- package/scripts/sw-linear.sh +4 -1
- package/scripts/sw-logs.sh +4 -1
- package/scripts/sw-loop.sh +436 -1076
- package/scripts/sw-memory.sh +357 -3
- package/scripts/sw-mission-control.sh +6 -1
- package/scripts/sw-model-router.sh +483 -27
- package/scripts/sw-otel.sh +15 -4
- package/scripts/sw-oversight.sh +14 -5
- package/scripts/sw-patrol-meta.sh +334 -0
- package/scripts/sw-pipeline-composer.sh +7 -1
- package/scripts/sw-pipeline-vitals.sh +12 -6
- package/scripts/sw-pipeline.sh +54 -2653
- package/scripts/sw-pm.sh +16 -8
- package/scripts/sw-pr-lifecycle.sh +2 -1
- package/scripts/sw-predictive.sh +17 -5
- package/scripts/sw-prep.sh +185 -2
- package/scripts/sw-ps.sh +5 -25
- package/scripts/sw-public-dashboard.sh +17 -4
- package/scripts/sw-quality.sh +14 -6
- package/scripts/sw-reaper.sh +8 -25
- package/scripts/sw-recruit.sh +156 -2303
- package/scripts/sw-regression.sh +19 -12
- package/scripts/sw-release-manager.sh +3 -1
- package/scripts/sw-release.sh +4 -1
- package/scripts/sw-remote.sh +3 -1
- package/scripts/sw-replay.sh +7 -1
- package/scripts/sw-retro.sh +158 -1
- package/scripts/sw-review-rerun.sh +3 -1
- package/scripts/sw-scale.sh +14 -5
- package/scripts/sw-security-audit.sh +6 -1
- package/scripts/sw-self-optimize.sh +173 -6
- package/scripts/sw-session.sh +9 -3
- package/scripts/sw-setup.sh +3 -1
- package/scripts/sw-stall-detector.sh +406 -0
- package/scripts/sw-standup.sh +15 -7
- package/scripts/sw-status.sh +3 -1
- package/scripts/sw-strategic.sh +14 -6
- package/scripts/sw-stream.sh +13 -4
- package/scripts/sw-swarm.sh +20 -7
- package/scripts/sw-team-stages.sh +13 -6
- package/scripts/sw-templates.sh +7 -31
- package/scripts/sw-testgen.sh +17 -6
- package/scripts/sw-tmux-pipeline.sh +4 -1
- package/scripts/sw-tmux-role-color.sh +2 -0
- package/scripts/sw-tmux-status.sh +1 -1
- package/scripts/sw-tmux.sh +37 -1
- package/scripts/sw-trace.sh +3 -1
- package/scripts/sw-tracker-github.sh +3 -0
- package/scripts/sw-tracker-jira.sh +3 -0
- package/scripts/sw-tracker-linear.sh +3 -0
- package/scripts/sw-tracker.sh +3 -1
- package/scripts/sw-triage.sh +3 -2
- package/scripts/sw-upgrade.sh +3 -1
- package/scripts/sw-ux.sh +5 -2
- package/scripts/sw-webhook.sh +5 -2
- package/scripts/sw-widgets.sh +9 -4
- package/scripts/sw-worktree.sh +15 -3
- package/scripts/test-skill-injection.sh +1233 -0
- package/templates/pipelines/autonomous.json +27 -3
- package/templates/pipelines/cost-aware.json +34 -8
- package/templates/pipelines/deployed.json +12 -0
- package/templates/pipelines/enterprise.json +12 -0
- package/templates/pipelines/fast.json +6 -0
- package/templates/pipelines/full.json +27 -3
- package/templates/pipelines/hotfix.json +6 -0
- package/templates/pipelines/standard.json +12 -0
- package/templates/pipelines/tdd.json +12 -0
|
@@ -0,0 +1,440 @@
|
|
|
1
|
+
# Research Sources: Autonomous Coding Systems (April 2026)
|
|
2
|
+
|
|
3
|
+
## Complete Bibliography with URLs
|
|
4
|
+
|
|
5
|
+
### Dark Factory & Autonomous Delivery
|
|
6
|
+
|
|
7
|
+
**BCG Platinion: The Dark Software Factory** (March 2026)
|
|
8
|
+
|
|
9
|
+
- https://www.bcgplatinion.com/insights/the-dark-software-factory
|
|
10
|
+
- **Key findings:** 3-5 engineers running factories; Spotify 650+ PRs/month; OpenAI 1M-line product in 5 months
|
|
11
|
+
- **Disciplines:** Harness Engineering, Intent Thinking
|
|
12
|
+
- **Report PDF:** https://cdn.prod.website-files.com/655cded084fee2e958faaffc/69b8331d6141dc7278866f9c_Dark_Software_Factory_BCG_Platinion_AI_report_March2026.pdf
|
|
13
|
+
|
|
14
|
+
**Anthropic 2026 Agentic Coding Trends Report**
|
|
15
|
+
|
|
16
|
+
- https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf
|
|
17
|
+
- **Coverage:** Loop convergence triggers, prompt design impact, multi-agent coordination patterns
|
|
18
|
+
- **Timeline:** 40% of enterprise apps will have agents by 2026 (vs <5% in 2025)
|
|
19
|
+
|
|
20
|
+
**GitHub Copilot: Agent Mode & Project Padawan**
|
|
21
|
+
|
|
22
|
+
- https://github.com/newsroom/press-releases/agent-mode
|
|
23
|
+
- https://githubnext.com/projects/copilot-workspace
|
|
24
|
+
- **Capabilities:** Issue-to-PR workflow, autonomous issue completion, asynchronous execution
|
|
25
|
+
- **Status:** GA since September 2025; Project Padawan in development
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
### Autonomous Loop Patterns & Convergence Detection
|
|
30
|
+
|
|
31
|
+
**SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering** (NeurIPS 2024)
|
|
32
|
+
|
|
33
|
+
- https://arxiv.org/abs/2405.15793
|
|
34
|
+
- **PDF:** https://arxiv.org/pdf/2405.15793
|
|
35
|
+
- **Repo:** https://github.com/SWE-agent/SWE-agent
|
|
36
|
+
- **Key innovation:** Custom ACI with repository primitives (find_file, search_dir, edit_tool)
|
|
37
|
+
- **Benchmark:** 40.6% on SWE-bench
|
|
38
|
+
|
|
39
|
+
**Geometric Dynamics of Agentic Loops in Large Language Models** (Jan 2026)
|
|
40
|
+
|
|
41
|
+
- https://arxiv.org/abs/2512.10350
|
|
42
|
+
- **Key finding:** Contractive vs exploratory loop regimes; prompt design governs dynamical behavior
|
|
43
|
+
- **Applications:** Early exit on convergence, escalation on divergence
|
|
44
|
+
|
|
45
|
+
**SWE-Bench & SWE-Bench Pro**
|
|
46
|
+
|
|
47
|
+
- Benchmark: https://www.vals.ai/benchmarks/swebench
|
|
48
|
+
- SWE-Bench Pro: https://scale.com/blog/swe-bench-pro
|
|
49
|
+
- **Status:** Verified flagged as contaminated (OpenAI finding); Pro (1,865 tasks) is new standard
|
|
50
|
+
- **Leaderboard:** https://llm-stats.com/benchmarks/swe-bench-verified-(agentic-coding)
|
|
51
|
+
|
|
52
|
+
**SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution**
|
|
53
|
+
|
|
54
|
+
- https://arxiv.org/pdf/2512.18470
|
|
55
|
+
- **Scope:** Multi-step modifications, release note interpretation, large-scale repos
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
### Reinforcement Learning for Code Generation
|
|
60
|
+
|
|
61
|
+
**FunPRM: Function-as-Step Process Reward Model with Meta Reward Correction**
|
|
62
|
+
|
|
63
|
+
- https://arxiv.org/abs/2601.22249
|
|
64
|
+
- **Innovation:** Treats functions as PRM steps; meta-learning reward correction via unit tests
|
|
65
|
+
- **Performance:** +15-20% completion rate vs outcome-only rewards
|
|
66
|
+
|
|
67
|
+
**SecCoderX: Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model**
|
|
68
|
+
|
|
69
|
+
- https://arxiv.org/abs/2602.07422
|
|
70
|
+
- **Key contribution:** Vulnerability detection → reward model → RL loop
|
|
71
|
+
- **Application:** Security-hardened code generation
|
|
72
|
+
|
|
73
|
+
**Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey**
|
|
74
|
+
|
|
75
|
+
- https://arxiv.org/abs/2412.20367
|
|
76
|
+
- **Coverage:** PPO standard, preference data → reward model → policy optimization
|
|
77
|
+
- **Scope:** RLHF, RLIF, online RL approaches
|
|
78
|
+
|
|
79
|
+
**Mutation-Guided LLM-based Test Generation at Meta**
|
|
80
|
+
|
|
81
|
+
- https://arxiv.org/abs/2501.12862
|
|
82
|
+
- **System:** ACH (Automated Compliance Hardening)
|
|
83
|
+
- **Scale:** 10,795 Android classes; 9,095 mutants; 571 test cases generated
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
### Reasoning Models & Extended Thinking
|
|
88
|
+
|
|
89
|
+
**Claude Opus 4.6: Adaptive Thinking** (Anthropic 2026)
|
|
90
|
+
|
|
91
|
+
- https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking
|
|
92
|
+
- **Key feature:** Dynamically decides when/how much to think (replaces extended thinking)
|
|
93
|
+
- **Capability:** Think between tool calls; 1M context window
|
|
94
|
+
|
|
95
|
+
**OpenAI o1-pro: Complete Guide**
|
|
96
|
+
|
|
97
|
+
- https://openai.com/index/introducing-openai-o1-preview/
|
|
98
|
+
- https://openai.com/index/learning-to-reason-with-llms/
|
|
99
|
+
- **Specs:** 200K context, 100K output tokens, $150/$600 pricing
|
|
100
|
+
- **Performance:** 86% AIME (vs 78% o1), 89th percentile Codeforces
|
|
101
|
+
|
|
102
|
+
**DeepSeek-R1: Incentivizing Reasoning Capability via RL**
|
|
103
|
+
|
|
104
|
+
- https://arxiv.org/abs/2501.12948
|
|
105
|
+
- **Repo:** https://github.com/deepseek-ai/DeepSeek-R1
|
|
106
|
+
- **Architecture:** 671B @ 37B inference cost via Mixture of Experts
|
|
107
|
+
- **Performance:** 2,029 Codeforces Elo (Candidate Master)
|
|
108
|
+
- **Training:** Pure RL without SFT; multi-stage RL + SFT
|
|
109
|
+
|
|
110
|
+
**Reasoning Models Don't Always Say What They Think** (Anthropic Alignment Science)
|
|
111
|
+
|
|
112
|
+
- https://www.anthropic.com/research/reasoning-models-dont-say-think
|
|
113
|
+
- **Finding:** Chain-of-thought reasoning may not be faithful (~25% of hints mentioned)
|
|
114
|
+
|
|
115
|
+
---
|
|
116
|
+
|
|
117
|
+
### Memory Systems & Episodic Learning
|
|
118
|
+
|
|
119
|
+
**Memory in the Age of AI Agents: A Survey**
|
|
120
|
+
|
|
121
|
+
- https://arxiv.org/abs/2512.13564
|
|
122
|
+
- **Paper list:** https://github.com/Shichun-Liu/Agent-Memory-Paper-List
|
|
123
|
+
- **Coverage:** Episodic, semantic, working memory; implementations across agents
|
|
124
|
+
|
|
125
|
+
**EM-LLM: Human-inspired Episodic Memory for Infinite Context LLMs**
|
|
126
|
+
|
|
127
|
+
- https://arxiv.org/abs/2407.09450
|
|
128
|
+
- **Innovation:** Bayesian surprise + graph refinement for event boundaries
|
|
129
|
+
- **Application:** Online episode segmentation
|
|
130
|
+
|
|
131
|
+
**Mem0: AI Memory Platform**
|
|
132
|
+
|
|
133
|
+
- https://mem0.ai
|
|
134
|
+
- **Technology:** Hybrid storage (Postgres), episodic summaries, continuous learning
|
|
135
|
+
- **Status:** Most mature long-term memory system (2026)
|
|
136
|
+
|
|
137
|
+
**Active Context Compression: Autonomous Memory Management in LLM Agents**
|
|
138
|
+
|
|
139
|
+
- https://arxiv.org/abs/2601.07190
|
|
140
|
+
- **Pattern:** Focus agent autonomously consolidates learnings into knowledge blocks
|
|
141
|
+
- **Technique:** Selective pruning of raw history
|
|
142
|
+
|
|
143
|
+
**Multi-Layered Memory Architectures for LLM Agents: Experimental Evaluation**
|
|
144
|
+
|
|
145
|
+
- https://arxiv.org/abs/2603.29194
|
|
146
|
+
- **Approach:** Working + episodic + semantic layers with adaptive retrieval gating
|
|
147
|
+
|
|
148
|
+
---
|
|
149
|
+
|
|
150
|
+
### Formal Verification & Specification
|
|
151
|
+
|
|
152
|
+
**DafnyPro: LLM-Assisted Automated Verification for Dafny Programs** (POPL 2026)
|
|
153
|
+
|
|
154
|
+
- https://popl26.sigplan.org/details/dafny-2026-papers/12/DafnyPro-LLM-Assisted-Automated-Verification-for-Dafny-Programs
|
|
155
|
+
- **Performance:** 86% on DafnyBench (Claude Sonnet 3.5)
|
|
156
|
+
- **Advance:** +16pp over previous SOTA
|
|
157
|
+
|
|
158
|
+
**MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving** (POPL 2026)
|
|
159
|
+
|
|
160
|
+
- https://popl26.sigplan.org/details/dafny-2026-papers/16/MiniF2F-Dafny-LLM-Guided-Mathematical-Theorem-Proving-via-Auto-Active-Verification
|
|
161
|
+
- **Coverage:** 40.6% test set, 44.7% validation set with empty proofs
|
|
162
|
+
|
|
163
|
+
**A Benchmark for Vericoding: Formally Verified Program Synthesis**
|
|
164
|
+
|
|
165
|
+
- https://arxiv.org/abs/2509.22908
|
|
166
|
+
- **Baseline:** 27% Lean, 44% Verus/Rust, 82% Dafny success rates
|
|
167
|
+
|
|
168
|
+
**ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis**
|
|
169
|
+
|
|
170
|
+
- https://arxiv.org/abs/2512.10173
|
|
171
|
+
- **Pipeline:** Synthesize 2.7K verified Dafny programs → 19K training examples
|
|
172
|
+
- **Results:** +23pp on DafnyBench, +50pp on DafnySynthesis via fine-tuning
|
|
173
|
+
|
|
174
|
+
**DafnyBench: A Benchmark for Formal Software Verification**
|
|
175
|
+
|
|
176
|
+
- https://openreview.net/pdf?id=yBgTVWccIx
|
|
177
|
+
- **Scope:** 412 verification problems; covers inductive invariants, loop specifications
|
|
178
|
+
|
|
179
|
+
---
|
|
180
|
+
|
|
181
|
+
### Test Generation & Mutation Testing
|
|
182
|
+
|
|
183
|
+
**Meta: Revolutionizing Software Testing with LLM-powered Bug Catchers**
|
|
184
|
+
|
|
185
|
+
- https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/
|
|
186
|
+
- **System:** ACH (Automated Compliance Hardening)
|
|
187
|
+
- **Scale:** 10,795 Android Kotlin classes; 9,095 mutants + 571 test cases
|
|
188
|
+
|
|
189
|
+
**Evaluating LLM-Based Test Generation Under Software Evolution**
|
|
190
|
+
|
|
191
|
+
- https://arxiv.org/abs/2603.23443
|
|
192
|
+
- **Challenge:** Test effectiveness degrades with code evolution
|
|
193
|
+
|
|
194
|
+
**Effective Test Generation Using Pre-Trained LLMs and Mutation Testing**
|
|
195
|
+
|
|
196
|
+
- https://www.sciencedirect.com/article/abs/pii/S0950584924000739
|
|
197
|
+
- **Approach:** Combine LLM generation + mutation validation
|
|
198
|
+
|
|
199
|
+
**LLMorpheus: LLM-based Mutation Testing**
|
|
200
|
+
|
|
201
|
+
- https://github.com/githubnext/llmorpheus
|
|
202
|
+
- **Tool:** Open-source implementation on GitHub Next
|
|
203
|
+
|
|
204
|
+
**MutGen: Mutation-Guided LLM-based Test Generation**
|
|
205
|
+
|
|
206
|
+
- **Performance:** 89.5% mutation score on HumanEval-Java (vs EvoSuite baseline)
|
|
207
|
+
|
|
208
|
+
---
|
|
209
|
+
|
|
210
|
+
### Cost Optimization & Model Routing
|
|
211
|
+
|
|
212
|
+
**Google: Speculative Cascades — A Hybrid Approach for Smarter, Faster LLM Inference**
|
|
213
|
+
|
|
214
|
+
- https://research.google/blog/speculative-cascades-a-hybrid-approach-for-smarter-faster-llm-inference/
|
|
215
|
+
- **Finding:** 30-60% cost reduction; hybrid routing + cascading
|
|
216
|
+
- **Benchmark:** 92% cost savings on open-source cascading
|
|
217
|
+
|
|
218
|
+
**A Unified Approach to Routing and Cascading for LLMs**
|
|
219
|
+
|
|
220
|
+
- https://arxiv.org/abs/2410.10347
|
|
221
|
+
- **Innovation:** Theoretically optimal integration of routing + cascading
|
|
222
|
+
- **Framework:** Unified decision tree for both strategies
|
|
223
|
+
|
|
224
|
+
**Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey**
|
|
225
|
+
|
|
226
|
+
- https://arxiv.org/abs/2603.04445
|
|
227
|
+
- **Coverage:** Routing vs cascading paradigms, cost-quality tradeoffs
|
|
228
|
+
|
|
229
|
+
**CoSine: Clustering-Based Routing for LLM Inference Optimization**
|
|
230
|
+
|
|
231
|
+
- **Results:** 23% latency reduction, 32% throughput increase
|
|
232
|
+
|
|
233
|
+
**Smurfs: Adaptive Speculative Decoding**
|
|
234
|
+
|
|
235
|
+
- **Technique:** Dynamic speculation length optimization per query
|
|
236
|
+
|
|
237
|
+
---
|
|
238
|
+
|
|
239
|
+
### Self-Healing CI/CD & AIOps
|
|
240
|
+
|
|
241
|
+
**Agentic SRE: How Self-Healing Infrastructure Is Redefining Enterprise AIOps** (2026)
|
|
242
|
+
|
|
243
|
+
- https://www.unite.ai/agentic-sre-how-self-healing-infrastructure-is-redefining-enterprise-aiops-in-2026/
|
|
244
|
+
- **Pattern:** Telemetry → reasoning → controlled automation (closed loop)
|
|
245
|
+
- **Adoption:** 60% of enterprises by Gartner 2026
|
|
246
|
+
|
|
247
|
+
**Building Self-Healing CI/CD Pipelines for Agentic AI Systems**
|
|
248
|
+
|
|
249
|
+
- https://optimumpartners.com/insight/how-to-architect-self-healing-ci/cd-for-agentic-ai/
|
|
250
|
+
- **Pattern:** Pipeline Doctor / Interceptor — repair agent on build failure
|
|
251
|
+
|
|
252
|
+
**From AIOps Hype to Reality: Building Self-Healing Infrastructure** (2026)
|
|
253
|
+
|
|
254
|
+
- https://techstrong.it/features/from-aiops-hype-to-reality-building-self-healing-infrastructure-in-2026
|
|
255
|
+
- **Results:** 67% MTTR drop; 40-60% in high-performing orgs
|
|
256
|
+
|
|
257
|
+
**AIOps: Guide to AI in IT Operations** (2026)
|
|
258
|
+
|
|
259
|
+
- https://www.ir.com/guides/what-is-aiops-guide-to-ai-in-operations-2026
|
|
260
|
+
- **Scope:** Anomaly detection, incident prediction, automated remediation
|
|
261
|
+
|
|
262
|
+
**LLM-as-a-Judge Pattern** (2026 standard)
|
|
263
|
+
|
|
264
|
+
- **Concept:** Secondary model evaluates primary agent output
|
|
265
|
+
- **Application:** Quality gates, merge decision support
|
|
266
|
+
|
|
267
|
+
---
|
|
268
|
+
|
|
269
|
+
### Multi-Agent Coordination & Orchestration
|
|
270
|
+
|
|
271
|
+
**How to Build Multi-Agent Systems: Complete 2026 Guide**
|
|
272
|
+
|
|
273
|
+
- https://dev.to/eira-wexford/how-to-build-multi-agent-systems-complete-2026-guide-1io6
|
|
274
|
+
- **Patterns:** 3-role (Planner, Worker, Judge); git worktrees for isolation
|
|
275
|
+
- **Status:** 40% of enterprise apps will have agents by 2026
|
|
276
|
+
|
|
277
|
+
**The Code Agent Orchestra: What Makes Multi-Agent Coding Work**
|
|
278
|
+
|
|
279
|
+
- https://addyosmani.com/blog/code-agent-orchestra/
|
|
280
|
+
- **Insight:** Coordination > autonomy; orchestration is the key lever
|
|
281
|
+
|
|
282
|
+
**Multi-Agent Frameworks Explained for Enterprise AI** (2026)
|
|
283
|
+
|
|
284
|
+
- https://www.adopt.ai/blog/multi-agent-frameworks
|
|
285
|
+
- **Frameworks:** CrewAI, LangGraph, AutoGen, MetaGPT
|
|
286
|
+
- **Winner:** LangGraph for complex workflows; CrewAI for rapid deployment
|
|
287
|
+
|
|
288
|
+
**MetaGPT: Multi-Agent Framework for Software Development**
|
|
289
|
+
|
|
290
|
+
- **Approach:** Simulates full product team (PM, TL, Dev, QA)
|
|
291
|
+
- **Specialization:** Standardized engineering workflows
|
|
292
|
+
|
|
293
|
+
**Google DORA 2025: AI Adoption & Bug Rates**
|
|
294
|
+
|
|
295
|
+
- **Finding:** 20-30% faster workflows, but 9% bug rate climb with multi-agent
|
|
296
|
+
- **Lesson:** Coordination + quality gates are critical
|
|
297
|
+
|
|
298
|
+
---
|
|
299
|
+
|
|
300
|
+
### Competitive Analysis & Benchmarks
|
|
301
|
+
|
|
302
|
+
**We Tested 15 AI Coding Agents (2026): Only 3 Changed How We Ship**
|
|
303
|
+
|
|
304
|
+
- https://www.morphllm.com/ai-coding-agent
|
|
305
|
+
- **Leaders:** Claude Code (80.9%), Aider (49.2%), Cline (500K downloads)
|
|
306
|
+
|
|
307
|
+
**Cline vs Aider: Which AI Coding Assistant is Best in 2026?**
|
|
308
|
+
|
|
309
|
+
- https://is4.ai/blog/our-blog-1/cline-vs-aider-comparison-2026-313
|
|
310
|
+
- **Comparison:** Architecture, integration, cost efficiency, workflow
|
|
311
|
+
- **Winner:** Aider for cost; Claude Code for complex tasks
|
|
312
|
+
|
|
313
|
+
**Aider Uses 4.2x Fewer Tokens Than Claude Code**
|
|
314
|
+
|
|
315
|
+
- https://www.morphllm.com/comparisons/morph-vs-aider-diff
|
|
316
|
+
- **Reason:** Diff-based editing vs search-replace
|
|
317
|
+
|
|
318
|
+
**SWE-Agent vs SWE-Bench Leaderboard**
|
|
319
|
+
|
|
320
|
+
- Leaderboard: https://llm-stats.com/benchmarks/swe-bench-verified-(agentic-coding)
|
|
321
|
+
- **Status:** Claude Code highest reported (80.9%), but unsubmitted officially
|
|
322
|
+
|
|
323
|
+
**AI Coding Benchmarks 2026: Every Major Eval Explained**
|
|
324
|
+
|
|
325
|
+
- https://www.morphllm.com/ai-coding-benchmarks-2026
|
|
326
|
+
- **Coverage:** SWE-bench, SWE-bench Pro, SWE-Bench Verified, Codeforces, AIME
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
### Additional Research & Surveys
|
|
331
|
+
|
|
332
|
+
**Agentic AI Resource Exhaustion & Infinite Loop Attacks** (Feb 2026)
|
|
333
|
+
|
|
334
|
+
- https://medium.com/@instatunnel/agentic-resource-exhaustion-the-infinite-loop-attack-of-the-ai-era-76a3f58c62e3
|
|
335
|
+
- **Finding:** 45% of 220 loops had problems (stagnation, stuck loops)
|
|
336
|
+
|
|
337
|
+
**How to Tell If Your AI Agent Is Stuck (Real Data From 220 Loops)**
|
|
338
|
+
|
|
339
|
+
- https://dev.to/boucle2026/how-to-tell-if-your-ai-agent-is-stuck-with-real-data-from-220-loops-4d4h
|
|
340
|
+
- **Techniques:** De-duplication, semantic similarity, state tracking
|
|
341
|
+
|
|
342
|
+
**Agents: Loop Control** (Vercel AI SDK)
|
|
343
|
+
|
|
344
|
+
- https://ai-sdk.dev/docs/agents/loop-control
|
|
345
|
+
- **Patterns:** Max iterations, timeout management, stop conditions
|
|
346
|
+
|
|
347
|
+
**120+ Agentic AI Tools Mapped Across 11 Categories** (2026)
|
|
348
|
+
|
|
349
|
+
- https://www.stackone.com/blog/ai-agent-tools-landscape-2026
|
|
350
|
+
- **Categories:** Frameworks, platforms, monitoring, integrations
|
|
351
|
+
|
|
352
|
+
---
|
|
353
|
+
|
|
354
|
+
### Industry Trends & Forecasts
|
|
355
|
+
|
|
356
|
+
**7 Agentic AI Trends to Watch in 2026**
|
|
357
|
+
|
|
358
|
+
- https://machinelearningmastery.com/7-agentic-ai-trends-to-watch-in-2026
|
|
359
|
+
- **Topics:** Loop control, reliability, security, cost optimization
|
|
360
|
+
|
|
361
|
+
**The Next Frontier of RAG: How Enterprise Knowledge Systems Will Evolve**
|
|
362
|
+
|
|
363
|
+
- https://nstarxinc.com/blog/the-next-frontier-of-rag-how-enterprise-knowledge-systems-will-evolve-2026-2030
|
|
364
|
+
- **Timeline:** 2026-2030; RAG as knowledge runtime; verification + access control
|
|
365
|
+
|
|
366
|
+
**Agentic GraphRAG for Capital Markets** (Amazon Web Services)
|
|
367
|
+
|
|
368
|
+
- https://aws.amazon.com/blogs/industries/agentic-graphrag-for-capital-markets/
|
|
369
|
+
- **Pattern:** Agentic RAG with specialized agents (research, verification, synthesis)
|
|
370
|
+
|
|
371
|
+
**Why GraphRAG and MCP Are the New Standard for Agentic Data Architecture**
|
|
372
|
+
|
|
373
|
+
- https://hyperight.com/agentic-data-architecture-graphrag-mcp-2026/
|
|
374
|
+
- **Trend:** MCP (Model Context Protocol) + GraphRAG for structured context
|
|
375
|
+
|
|
376
|
+
---
|
|
377
|
+
|
|
378
|
+
## Quick Link Summary by Topic
|
|
379
|
+
|
|
380
|
+
### Dark Factory & Intent (Backlog #2)
|
|
381
|
+
|
|
382
|
+
- BCG Platinion report (above)
|
|
383
|
+
- Anthropic trends report (above)
|
|
384
|
+
- GitHub Agent Mode / Project Padawan
|
|
385
|
+
|
|
386
|
+
### Loop Convergence (Backlog #1)
|
|
387
|
+
|
|
388
|
+
- SWE-agent NeurIPS 2024
|
|
389
|
+
- Geometric Dynamics of Agentic Loops (arxiv 2512.10350)
|
|
390
|
+
- How to Tell If Your AI Agent Is Stuck (220 loops data)
|
|
391
|
+
|
|
392
|
+
### Vulnerability & RL (Backlog #3)
|
|
393
|
+
|
|
394
|
+
- SecCoderX (arxiv 2602.07422)
|
|
395
|
+
- Meta ACH system (engineering.fb.com)
|
|
396
|
+
- Mutation-Guided LLM at Meta (arxiv 2501.12862)
|
|
397
|
+
|
|
398
|
+
### Episodic Memory (Backlog #4)
|
|
399
|
+
|
|
400
|
+
- Mem0 (mem0.ai)
|
|
401
|
+
- EM-LLM (arxiv 2407.09450)
|
|
402
|
+
- Memory in the Age of AI Agents survey (arxiv 2512.13564)
|
|
403
|
+
|
|
404
|
+
### Cost Optimization / Cascade (Backlog #5)
|
|
405
|
+
|
|
406
|
+
- Google Speculative Cascades (research.google)
|
|
407
|
+
- Unified Routing + Cascading (arxiv 2410.10347)
|
|
408
|
+
- CoSine, Smurfs papers
|
|
409
|
+
|
|
410
|
+
### Mutation Testing (Backlog #6, #13)
|
|
411
|
+
|
|
412
|
+
- Meta ACH (engineering.fb.com)
|
|
413
|
+
- MutGen paper
|
|
414
|
+
- LLMorpheus (GitHub Next)
|
|
415
|
+
|
|
416
|
+
### CI Repair & AIOps (Backlog #7)
|
|
417
|
+
|
|
418
|
+
- Agentic SRE (unite.ai)
|
|
419
|
+
- Pipeline Doctor pattern (optimumpartners.com)
|
|
420
|
+
- From AIOps Hype to Reality (techstrong.it)
|
|
421
|
+
|
|
422
|
+
### Multi-Agent Coordination (Backlog #9)
|
|
423
|
+
|
|
424
|
+
- 2026 Multi-Agent Systems Guide (dev.to)
|
|
425
|
+
- The Code Agent Orchestra (addyosmani.com)
|
|
426
|
+
- MetaGPT, CrewAI, LangGraph frameworks
|
|
427
|
+
|
|
428
|
+
### Formal Verification (Backlog #11)
|
|
429
|
+
|
|
430
|
+
- DafnyPro (POPL 2026)
|
|
431
|
+
- ATLAS (arxiv 2512.10173)
|
|
432
|
+
- DafnyBench (openreview)
|
|
433
|
+
|
|
434
|
+
---
|
|
435
|
+
|
|
436
|
+
**Total sources cited:** 60+
|
|
437
|
+
**Papers:** 25+
|
|
438
|
+
**Companies/Organizations:** 15+ (Anthropic, OpenAI, DeepSeek, Meta, Google, BCG, GitHub, etc.)
|
|
439
|
+
**Research date:** April 4, 2026
|
|
440
|
+
**Coverage:** Autonomous software engineering, dark factories, RL systems, multi-agent coordination, formal verification, memory systems, cost optimization, self-healing CI/CD
|