shipwright-cli 3.2.0 → 3.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/code-reviewer.md +2 -0
- package/.claude/agents/devops-engineer.md +2 -0
- package/.claude/agents/doc-fleet-agent.md +2 -0
- package/.claude/agents/pipeline-agent.md +2 -0
- package/.claude/agents/shell-script-specialist.md +2 -0
- package/.claude/agents/test-specialist.md +2 -0
- package/.claude/hooks/agent-crash-capture.sh +32 -0
- package/.claude/hooks/post-tool-use.sh +3 -2
- package/.claude/hooks/pre-tool-use.sh +35 -3
- package/README.md +4 -4
- package/claude-code/hooks/config-change.sh +18 -0
- package/claude-code/hooks/instructions-reloaded.sh +7 -0
- package/claude-code/hooks/worktree-create.sh +25 -0
- package/claude-code/hooks/worktree-remove.sh +20 -0
- package/config/code-constitution.json +130 -0
- package/dashboard/middleware/auth.ts +134 -0
- package/dashboard/middleware/constants.ts +21 -0
- package/dashboard/public/index.html +2 -6
- package/dashboard/public/styles.css +100 -97
- package/dashboard/routes/auth.ts +38 -0
- package/dashboard/server.ts +66 -25
- package/dashboard/services/config.ts +26 -0
- package/dashboard/services/db.ts +118 -0
- package/dashboard/src/canvas/pixel-agent.ts +298 -0
- package/dashboard/src/canvas/pixel-sprites.ts +440 -0
- package/dashboard/src/canvas/shipyard-effects.ts +367 -0
- package/dashboard/src/canvas/shipyard-scene.ts +616 -0
- package/dashboard/src/canvas/submarine-layout.ts +267 -0
- package/dashboard/src/components/header.ts +8 -7
- package/dashboard/src/core/router.ts +1 -0
- package/dashboard/src/design/submarine-theme.ts +253 -0
- package/dashboard/src/main.ts +2 -0
- package/dashboard/src/types/api.ts +2 -1
- package/dashboard/src/views/activity.ts +2 -1
- package/dashboard/src/views/shipyard.ts +39 -0
- package/dashboard/types/index.ts +166 -0
- package/docs/plans/2026-02-28-compound-audit-and-shipyard-design.md +186 -0
- package/docs/plans/2026-02-28-skipper-shipwright-implementation-plan.md +1182 -0
- package/docs/plans/2026-02-28-skipper-shipwright-integration-design.md +531 -0
- package/docs/plans/2026-03-01-ai-powered-skill-injection-design.md +298 -0
- package/docs/plans/2026-03-01-ai-powered-skill-injection-plan.md +1109 -0
- package/docs/plans/2026-03-01-capabilities-cleanup-plan.md +658 -0
- package/docs/plans/2026-03-01-clean-architecture-plan.md +924 -0
- package/docs/plans/2026-03-01-compound-audit-cascade-design.md +191 -0
- package/docs/plans/2026-03-01-compound-audit-cascade-plan.md +921 -0
- package/docs/plans/2026-03-01-deep-integration-plan.md +851 -0
- package/docs/plans/2026-03-01-pipeline-audit-trail-design.md +145 -0
- package/docs/plans/2026-03-01-pipeline-audit-trail-plan.md +770 -0
- package/docs/plans/2026-03-01-refined-depths-brand-design.md +382 -0
- package/docs/plans/2026-03-01-refined-depths-implementation.md +599 -0
- package/docs/plans/2026-03-01-skipper-kernel-integration-design.md +203 -0
- package/docs/plans/2026-03-01-unified-platform-design.md +272 -0
- package/docs/plans/2026-03-07-claude-code-feature-integration-design.md +189 -0
- package/docs/plans/2026-03-07-claude-code-feature-integration-plan.md +1165 -0
- package/docs/research/BACKLOG_QUICK_REFERENCE.md +352 -0
- package/docs/research/CUTTING_EDGE_RESEARCH_2026.md +546 -0
- package/docs/research/RESEARCH_INDEX.md +439 -0
- package/docs/research/RESEARCH_SOURCES.md +440 -0
- package/docs/research/RESEARCH_SUMMARY.txt +275 -0
- package/docs/superpowers/specs/2026-03-10-pipeline-quality-revolution-design.md +341 -0
- package/package.json +2 -2
- package/scripts/lib/adaptive-model.sh +427 -0
- package/scripts/lib/adaptive-timeout.sh +316 -0
- package/scripts/lib/audit-trail.sh +309 -0
- package/scripts/lib/auto-recovery.sh +471 -0
- package/scripts/lib/bandit-selector.sh +431 -0
- package/scripts/lib/bootstrap.sh +104 -2
- package/scripts/lib/causal-graph.sh +455 -0
- package/scripts/lib/compat.sh +126 -0
- package/scripts/lib/compound-audit.sh +337 -0
- package/scripts/lib/constitutional.sh +454 -0
- package/scripts/lib/context-budget.sh +359 -0
- package/scripts/lib/convergence.sh +594 -0
- package/scripts/lib/cost-optimizer.sh +634 -0
- package/scripts/lib/daemon-adaptive.sh +10 -0
- package/scripts/lib/daemon-dispatch.sh +106 -17
- package/scripts/lib/daemon-failure.sh +34 -4
- package/scripts/lib/daemon-patrol.sh +23 -2
- package/scripts/lib/daemon-poll-github.sh +361 -0
- package/scripts/lib/daemon-poll-health.sh +299 -0
- package/scripts/lib/daemon-poll.sh +27 -611
- package/scripts/lib/daemon-state.sh +112 -66
- package/scripts/lib/daemon-triage.sh +10 -0
- package/scripts/lib/dod-scorecard.sh +442 -0
- package/scripts/lib/error-actionability.sh +300 -0
- package/scripts/lib/formal-spec.sh +461 -0
- package/scripts/lib/helpers.sh +177 -4
- package/scripts/lib/intent-analysis.sh +409 -0
- package/scripts/lib/loop-convergence.sh +350 -0
- package/scripts/lib/loop-iteration.sh +682 -0
- package/scripts/lib/loop-progress.sh +48 -0
- package/scripts/lib/loop-restart.sh +185 -0
- package/scripts/lib/memory-effectiveness.sh +506 -0
- package/scripts/lib/mutation-executor.sh +352 -0
- package/scripts/lib/outcome-feedback.sh +521 -0
- package/scripts/lib/pipeline-cli.sh +336 -0
- package/scripts/lib/pipeline-commands.sh +1216 -0
- package/scripts/lib/pipeline-detection.sh +100 -2
- package/scripts/lib/pipeline-execution.sh +897 -0
- package/scripts/lib/pipeline-github.sh +28 -3
- package/scripts/lib/pipeline-intelligence-compound.sh +431 -0
- package/scripts/lib/pipeline-intelligence-scoring.sh +407 -0
- package/scripts/lib/pipeline-intelligence-skip.sh +181 -0
- package/scripts/lib/pipeline-intelligence.sh +100 -1136
- package/scripts/lib/pipeline-quality-bash-compat.sh +182 -0
- package/scripts/lib/pipeline-quality-checks.sh +17 -715
- package/scripts/lib/pipeline-quality-gates.sh +563 -0
- package/scripts/lib/pipeline-stages-build.sh +730 -0
- package/scripts/lib/pipeline-stages-delivery.sh +965 -0
- package/scripts/lib/pipeline-stages-intake.sh +1133 -0
- package/scripts/lib/pipeline-stages-monitor.sh +407 -0
- package/scripts/lib/pipeline-stages-review.sh +1022 -0
- package/scripts/lib/pipeline-stages.sh +59 -2929
- package/scripts/lib/pipeline-state.sh +36 -5
- package/scripts/lib/pipeline-util.sh +487 -0
- package/scripts/lib/policy-learner.sh +438 -0
- package/scripts/lib/process-reward.sh +493 -0
- package/scripts/lib/project-detect.sh +649 -0
- package/scripts/lib/quality-profile.sh +334 -0
- package/scripts/lib/recruit-commands.sh +885 -0
- package/scripts/lib/recruit-learning.sh +739 -0
- package/scripts/lib/recruit-roles.sh +648 -0
- package/scripts/lib/reward-aggregator.sh +458 -0
- package/scripts/lib/rl-optimizer.sh +362 -0
- package/scripts/lib/root-cause.sh +427 -0
- package/scripts/lib/scope-enforcement.sh +445 -0
- package/scripts/lib/session-restart.sh +493 -0
- package/scripts/lib/skill-memory.sh +300 -0
- package/scripts/lib/skill-registry.sh +775 -0
- package/scripts/lib/spec-driven.sh +476 -0
- package/scripts/lib/test-helpers.sh +18 -7
- package/scripts/lib/test-holdout.sh +429 -0
- package/scripts/lib/test-optimizer.sh +511 -0
- package/scripts/shipwright-file-suggest.sh +45 -0
- package/scripts/skills/adversarial-quality.md +61 -0
- package/scripts/skills/api-design.md +44 -0
- package/scripts/skills/architecture-design.md +50 -0
- package/scripts/skills/brainstorming.md +43 -0
- package/scripts/skills/data-pipeline.md +44 -0
- package/scripts/skills/deploy-safety.md +64 -0
- package/scripts/skills/documentation.md +38 -0
- package/scripts/skills/frontend-design.md +45 -0
- package/scripts/skills/generated/.gitkeep +0 -0
- package/scripts/skills/generated/_refinements/.gitkeep +0 -0
- package/scripts/skills/generated/_refinements/adversarial-quality.patch.md +3 -0
- package/scripts/skills/generated/_refinements/architecture-design.patch.md +3 -0
- package/scripts/skills/generated/_refinements/brainstorming.patch.md +3 -0
- package/scripts/skills/generated/cli-version-management.md +29 -0
- package/scripts/skills/generated/collection-system-validation.md +99 -0
- package/scripts/skills/generated/large-scale-c-refactoring-coordination.md +97 -0
- package/scripts/skills/generated/pattern-matching-similarity-scoring.md +195 -0
- package/scripts/skills/generated/test-parallelization-detection.md +65 -0
- package/scripts/skills/observability.md +79 -0
- package/scripts/skills/performance.md +48 -0
- package/scripts/skills/pr-quality.md +49 -0
- package/scripts/skills/product-thinking.md +43 -0
- package/scripts/skills/security-audit.md +49 -0
- package/scripts/skills/systematic-debugging.md +40 -0
- package/scripts/skills/testing-strategy.md +47 -0
- package/scripts/skills/two-stage-review.md +52 -0
- package/scripts/skills/validation-thoroughness.md +55 -0
- package/scripts/sw +9 -3
- package/scripts/sw-activity.sh +9 -2
- package/scripts/sw-adaptive.sh +2 -1
- package/scripts/sw-adversarial.sh +2 -1
- package/scripts/sw-architecture-enforcer.sh +3 -1
- package/scripts/sw-auth.sh +12 -2
- package/scripts/sw-autonomous.sh +5 -1
- package/scripts/sw-changelog.sh +4 -1
- package/scripts/sw-checkpoint.sh +2 -1
- package/scripts/sw-ci.sh +5 -1
- package/scripts/sw-cleanup.sh +4 -26
- package/scripts/sw-code-review.sh +10 -4
- package/scripts/sw-connect.sh +2 -1
- package/scripts/sw-context.sh +2 -1
- package/scripts/sw-cost.sh +48 -3
- package/scripts/sw-daemon.sh +66 -9
- package/scripts/sw-dashboard.sh +3 -1
- package/scripts/sw-db.sh +59 -16
- package/scripts/sw-decide.sh +8 -2
- package/scripts/sw-decompose.sh +360 -17
- package/scripts/sw-deps.sh +4 -1
- package/scripts/sw-developer-simulation.sh +4 -1
- package/scripts/sw-discovery.sh +325 -2
- package/scripts/sw-doc-fleet.sh +4 -1
- package/scripts/sw-docs-agent.sh +3 -1
- package/scripts/sw-docs.sh +2 -1
- package/scripts/sw-doctor.sh +453 -2
- package/scripts/sw-dora.sh +4 -1
- package/scripts/sw-durable.sh +4 -3
- package/scripts/sw-e2e-orchestrator.sh +17 -16
- package/scripts/sw-eventbus.sh +7 -1
- package/scripts/sw-evidence.sh +364 -12
- package/scripts/sw-feedback.sh +550 -9
- package/scripts/sw-fix.sh +20 -1
- package/scripts/sw-fleet-discover.sh +6 -2
- package/scripts/sw-fleet-viz.sh +4 -1
- package/scripts/sw-fleet.sh +5 -1
- package/scripts/sw-github-app.sh +16 -3
- package/scripts/sw-github-checks.sh +3 -2
- package/scripts/sw-github-deploy.sh +3 -2
- package/scripts/sw-github-graphql.sh +18 -7
- package/scripts/sw-guild.sh +5 -1
- package/scripts/sw-heartbeat.sh +5 -30
- package/scripts/sw-hello.sh +67 -0
- package/scripts/sw-hygiene.sh +6 -1
- package/scripts/sw-incident.sh +265 -1
- package/scripts/sw-init.sh +18 -2
- package/scripts/sw-instrument.sh +10 -2
- package/scripts/sw-intelligence.sh +42 -6
- package/scripts/sw-jira.sh +5 -1
- package/scripts/sw-launchd.sh +2 -1
- package/scripts/sw-linear.sh +4 -1
- package/scripts/sw-logs.sh +4 -1
- package/scripts/sw-loop.sh +432 -1128
- package/scripts/sw-memory.sh +356 -2
- package/scripts/sw-mission-control.sh +6 -1
- package/scripts/sw-model-router.sh +481 -26
- package/scripts/sw-otel.sh +13 -4
- package/scripts/sw-oversight.sh +14 -5
- package/scripts/sw-patrol-meta.sh +334 -0
- package/scripts/sw-pipeline-composer.sh +5 -1
- package/scripts/sw-pipeline-vitals.sh +2 -1
- package/scripts/sw-pipeline.sh +53 -2664
- package/scripts/sw-pm.sh +12 -5
- package/scripts/sw-pr-lifecycle.sh +2 -1
- package/scripts/sw-predictive.sh +7 -1
- package/scripts/sw-prep.sh +185 -2
- package/scripts/sw-ps.sh +5 -25
- package/scripts/sw-public-dashboard.sh +15 -3
- package/scripts/sw-quality.sh +2 -1
- package/scripts/sw-reaper.sh +8 -25
- package/scripts/sw-recruit.sh +156 -2303
- package/scripts/sw-regression.sh +19 -12
- package/scripts/sw-release-manager.sh +3 -1
- package/scripts/sw-release.sh +4 -1
- package/scripts/sw-remote.sh +3 -1
- package/scripts/sw-replay.sh +7 -1
- package/scripts/sw-retro.sh +158 -1
- package/scripts/sw-review-rerun.sh +3 -1
- package/scripts/sw-scale.sh +10 -3
- package/scripts/sw-security-audit.sh +6 -1
- package/scripts/sw-self-optimize.sh +6 -3
- package/scripts/sw-session.sh +9 -3
- package/scripts/sw-setup.sh +3 -1
- package/scripts/sw-stall-detector.sh +406 -0
- package/scripts/sw-standup.sh +15 -7
- package/scripts/sw-status.sh +3 -1
- package/scripts/sw-strategic.sh +4 -1
- package/scripts/sw-stream.sh +7 -1
- package/scripts/sw-swarm.sh +18 -6
- package/scripts/sw-team-stages.sh +13 -6
- package/scripts/sw-templates.sh +5 -29
- package/scripts/sw-testgen.sh +7 -1
- package/scripts/sw-tmux-pipeline.sh +4 -1
- package/scripts/sw-tmux-role-color.sh +2 -0
- package/scripts/sw-tmux-status.sh +1 -1
- package/scripts/sw-tmux.sh +3 -1
- package/scripts/sw-trace.sh +3 -1
- package/scripts/sw-tracker-github.sh +3 -0
- package/scripts/sw-tracker-jira.sh +3 -0
- package/scripts/sw-tracker-linear.sh +3 -0
- package/scripts/sw-tracker.sh +3 -1
- package/scripts/sw-triage.sh +2 -1
- package/scripts/sw-upgrade.sh +3 -1
- package/scripts/sw-ux.sh +5 -2
- package/scripts/sw-webhook.sh +3 -1
- package/scripts/sw-widgets.sh +3 -1
- package/scripts/sw-worktree.sh +15 -3
- package/scripts/test-skill-injection.sh +1233 -0
- package/templates/pipelines/autonomous.json +27 -3
- package/templates/pipelines/cost-aware.json +34 -8
- package/templates/pipelines/deployed.json +12 -0
- package/templates/pipelines/enterprise.json +12 -0
- package/templates/pipelines/fast.json +6 -0
- package/templates/pipelines/full.json +27 -3
- package/templates/pipelines/hotfix.json +6 -0
- package/templates/pipelines/standard.json +12 -0
- package/templates/pipelines/tdd.json +12 -0
|
@@ -0,0 +1,546 @@
|
|
|
1
|
+
# Cutting Edge Research: Autonomous Coding Systems, Dark Factories & RL (April 2026)
|
|
2
|
+
|
|
3
|
+
**Research Date:** April 4, 2026
|
|
4
|
+
**Scope:** 10 research areas across autonomous software engineering, dark factories, RL systems, and multi-agent coordination
|
|
5
|
+
**Format:** Competitive analysis (SOTA systems vs Shipwright), specific gaps, and actionable 20-item backlog prioritized by impact/effort ratio
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Executive Summary
|
|
10
|
+
|
|
11
|
+
The autonomous software engineering landscape has consolidated around four operating models by early 2026:
|
|
12
|
+
|
|
13
|
+
1. **Dark Factory Model** (BCG Platinion) — 3-5 engineers running fully automated factories shipping 650+ PRs/month
|
|
14
|
+
2. **Reasoning-First Agents** (OpenAI o1-pro, DeepSeek-R1) — Extended thinking with cost-optimal cascade routing
|
|
15
|
+
3. **Tool-Use Optimization** (SWE-agent, Claude Code, Aider) — Agent-Computer Interface (ACI) design + diffing strategies
|
|
16
|
+
4. **Memory-Driven Learning** (Mem0, EM-LLM, episodic memory) — Self-improving agents via persistent episodic traces
|
|
17
|
+
|
|
18
|
+
**Shipwright's Current Position:** Strong foundation on pipeline orchestration, multi-agent coordination, and RL reward aggregation. **Key gaps:** episodic memory for cross-session learning, formal verification integration, context distillation, and advanced loop convergence detection.
|
|
19
|
+
|
|
20
|
+
---
|
|
21
|
+
|
|
22
|
+
## 1. Autonomous Loop Patterns & Convergence Detection
|
|
23
|
+
|
|
24
|
+
### SOTA Systems Doing This
|
|
25
|
+
|
|
26
|
+
- **SWE-agent** (NeurIPS 2024, [arxiv.org/abs/2405.15793](https://arxiv.org/abs/2405.15793)) — Custom Agent-Computer Interface (ACI) with repository navigation primitives (find_file, search_dir, search_file)
|
|
27
|
+
- **SWE-bench Verified + SWE-bench Pro** — 1,865+ tasks with verified test suites; Verified now flagged as contaminated, Pro is SOTA benchmark
|
|
28
|
+
- **Geometric Dynamics of Agentic Loops** (arxiv 2512.10350) — Formal characterization of contractive vs exploratory loop regimes
|
|
29
|
+
- **2026 Agentic Coding Trends Report** (Anthropic, [resources.anthropic.com](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf)) — loop convergence triggers based on prompt design
|
|
30
|
+
|
|
31
|
+
### What Shipwright Has
|
|
32
|
+
|
|
33
|
+
- ✓ `sw-loop.sh` (2561 lines) with multi-iteration harness and context exhaustion detection
|
|
34
|
+
- ✓ `sw-convergence-test.sh` with convergence detection unit tests
|
|
35
|
+
- ✓ `sw-stall-detector.sh` identifying pipeline stalls and deadlocks
|
|
36
|
+
- ✓ Iteration budgets with `--max-restarts` escalation
|
|
37
|
+
- ✓ Session restart with progress memory injection
|
|
38
|
+
|
|
39
|
+
### Specific Gap
|
|
40
|
+
|
|
41
|
+
**Stuck detection is heuristic; no formal detection of contractive vs exploratory regimes.** Shipwright's loop iteration cap is a hard limit (default 5 iterations), but SOTA systems use regime detection to decide early exit vs escalation. SWE-agent and Anthropic's findings show that prompt design (e.g., "summarize and negate" vs "refine incrementally") governs whether a loop converges or diverges. Shipwright lacks the **semantic trajectory analysis** to classify loop behavior geometrically.
|
|
42
|
+
|
|
43
|
+
### Actionable Gap
|
|
44
|
+
|
|
45
|
+
Implement regime detection by tracking embedding-space distance of consecutive outputs. When agent output vectors stop moving (contractive regime), terminate early. When they diverge unbounded (exploratory), escalate to longer chains-of-thought or switch to reasoning model (o1-pro, DeepSeek-R1).
|
|
46
|
+
|
|
47
|
+
**Impact:** 25-40% reduction in iteration waste on stuck loops; early exit on convergence.
|
|
48
|
+
**Effort:** Medium (requires embedding-space tracking, vector distance computation).
|
|
49
|
+
**Priority Rank:** 1 (foundational for cost optimization)
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## 2. Dark Factory / Lights-Out Delivery
|
|
54
|
+
|
|
55
|
+
### SOTA Systems Doing This
|
|
56
|
+
|
|
57
|
+
- **BCG Platinion Dark Software Factory** ([bcgplatinion.com/insights/the-dark-software-factory](https://www.bcgplatinion.com/insights/the-dark-software-factory), March 2026 report) — 3-5 engineers merging 650+ PRs/month; Spotify shipped 90% faster migrations; OpenAI built 1M-line product in 5 months with 3 engineers
|
|
58
|
+
- **Two critical disciplines identified:**
|
|
59
|
+
- **Harness Engineering** — designing and refining the factory; feeding information to assembly lines
|
|
60
|
+
- **Intent Thinking** — translating business needs into testable outcome descriptions
|
|
61
|
+
- **GitHub Copilot Workspace / Agent Mode** — Issue-to-PR workflow with asynchronous execution; Project Padawan for fully autonomous issue completion
|
|
62
|
+
|
|
63
|
+
### What Shipwright Has
|
|
64
|
+
|
|
65
|
+
- ✓ Full 12-stage pipeline (intake → monitor) running autonomously
|
|
66
|
+
- ✓ Daemon with auto-scaling (up to 8 workers), worker pool distribution across repos
|
|
67
|
+
- ✓ Fleet orchestration (multi-repo, 650+ PRs/month feasible with current throughput)
|
|
68
|
+
- ✓ Intent classification in triage and decomposition stages
|
|
69
|
+
- ✓ Self-optimization via DORA metrics (lead time, deployment frequency, CFR, MTTR)
|
|
70
|
+
- ✗ **Missing:** human intent capture → outcome specification transformation
|
|
71
|
+
|
|
72
|
+
### Specific Gap
|
|
73
|
+
|
|
74
|
+
**Intent Thinking capability.** BCG identifies that human effort shifts from code production to intent specification. Shipwright's triage and decompose stages use heuristic scoring but lack a formal **intent translator** that converts business descriptions into testable, machine-verifiable outcome definitions. No explicit "outcome specification language" or constraint DSL.
|
|
75
|
+
|
|
76
|
+
### Actionable Gap
|
|
77
|
+
|
|
78
|
+
Build an **Intent Specification Engine** that:
|
|
79
|
+
|
|
80
|
+
1. Parses GitHub issue natural language → structured intent with constraints (latency, cost, safety)
|
|
81
|
+
2. Generates acceptance criteria in a machine-verifiable format (e.g., Dafny preconditions, formal spec)
|
|
82
|
+
3. Routes to appropriate agent type based on intent complexity (simple PRs → Aider/Haiku, complex → Claude Code/Opus)
|
|
83
|
+
|
|
84
|
+
**Impact:** Enables true 3-5 engineer factories; reduces human design time by 40-60%.
|
|
85
|
+
**Effort:** High (new DSL, formal spec generation, multi-stage processing).
|
|
86
|
+
**Priority Rank:** 2 (strategic, high ROI)
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## 3. Reinforcement Learning for Code Generation & Policy Learning
|
|
91
|
+
|
|
92
|
+
### SOTA Systems Doing This
|
|
93
|
+
|
|
94
|
+
- **FunPRM: Function-as-Step Process Reward Model** ([arxiv.org/abs/2601.22249](https://arxiv.org/abs/2601.22249)) — Treats code functions as PRM steps; meta-reward correction via unit-test feedback
|
|
95
|
+
- **SecCoderX** ([arxiv.org/abs/2602.07422](https://arxiv.org/abs/2602.07422)) — Vulnerability reward model + secure code generation via online RL
|
|
96
|
+
- **Enhancing Code LLMs with RL Survey** ([arxiv.org/abs/2412.20367](https://arxiv.org/abs/2412.20367)) — PPO as standard post-training; preference data → reward model → policy optimization
|
|
97
|
+
- **DeepSeek-R1** ([github.com/deepseek-ai/DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)) — Pure RL without SFT; Codeforces 2,029 Elo (Candidate Master); 671B model at 37B inference cost via MoE
|
|
98
|
+
|
|
99
|
+
### What Shipwright Has
|
|
100
|
+
|
|
101
|
+
- ✓ `sw-reward-aggregator.sh` — Multi-signal reward composition (test pass, coverage, latency, cost)
|
|
102
|
+
- ✓ `sw-bandit-selector.sh` — Multi-armed bandit for agent selection based on historical rewards
|
|
103
|
+
- ✓ `sw-policy-learner.sh` — Policy gradient learning to improve model routing
|
|
104
|
+
- ✓ `sw-rl-optimizer.sh` — Full RL loop with PPO-style optimization
|
|
105
|
+
- ✓ `sw-process-reward-test.sh` — Unit tests for process reward model
|
|
106
|
+
- ✓ Reward signal captures: test success, coverage, latency, cost, rule violations
|
|
107
|
+
- ✗ **Missing:** Formal vulnerability reward model; online RL with vulnerability detection feedback
|
|
108
|
+
|
|
109
|
+
### Specific Gap
|
|
110
|
+
|
|
111
|
+
**No vulnerability-aware RL.** Shipwright's reward model optimizes for test pass + coverage, but SOTA systems (SecCoderX) add security-specific signals: detected vulnerabilities, CWE patterns, fuzzing results. Code generated by Shipwright agents is not explicitly hardened against common attack vectors.
|
|
112
|
+
|
|
113
|
+
Also: **Process rewards vs outcome rewards.** Shipwright uses outcome rewards (test pass/fail) but lacks intermediate process rewards that guide reasoning steps within a single solution attempt. FunPRM shows this yields 15-20% better completion rates.
|
|
114
|
+
|
|
115
|
+
### Actionable Gap
|
|
116
|
+
|
|
117
|
+
Integrate **Vulnerability Reward Model (VRM)** that:
|
|
118
|
+
|
|
119
|
+
1. Runs lightweight security scanning on generated code (SAST, dependency check, CWE patterns)
|
|
120
|
+
2. Feeds vulnerability count as negative reward signal into RL loop
|
|
121
|
+
3. Fine-tunes on secure code examples in memory system
|
|
122
|
+
|
|
123
|
+
**Impact:** 30-40% reduction in security issues; enables security-hardened pipelines.
|
|
124
|
+
**Effort:** Medium (security scanner integration, signal architecture).
|
|
125
|
+
**Priority Rank:** 3 (high compliance value)
|
|
126
|
+
|
|
127
|
+
---
|
|
128
|
+
|
|
129
|
+
## 4. Long-Context Agent Memory & Episodic Traces
|
|
130
|
+
|
|
131
|
+
### SOTA Systems Doing This
|
|
132
|
+
|
|
133
|
+
- **Mem0** ([https://mem0.ai](https://mem0.ai)) — Mature long-term memory: hybrid storage (Postgres), episodic summaries, continuous update from interactions
|
|
134
|
+
- **EM-LLM: Episodic Memory for Infinite Context** ([arxiv.org/abs/2407.09450](https://arxiv.org/abs/2407.09450)) — Bayesian surprise + graph refinement to segment event boundaries online
|
|
135
|
+
- **Memory in the Age of AI Agents: Survey** ([arxiv.org/abs/2512.13564](https://arxiv.org/abs/2512.13564)) — Episodic (specific events), Semantic (facts), and Working memory layers
|
|
136
|
+
- **MemRL: Self-Evolving Agents via Runtime RL on Episodic Memory** (Jan 2026) — Agents improve by learning from stored episode traces
|
|
137
|
+
- **Active Context Compression** ([arxiv.org/abs/2601.07190](https://arxiv.org/abs/2601.07190)) — Autonomous consolidation of key learnings into persistent knowledge blocks; raw history pruning
|
|
138
|
+
|
|
139
|
+
### What Shipwright Has
|
|
140
|
+
|
|
141
|
+
- ✓ `sw-memory.sh` (2240 lines) — Persistent failure patterns, cross-pipeline learning
|
|
142
|
+
- ✓ `~/.claude/agent-memory/` with lessons, patterns, and codebase conventions
|
|
143
|
+
- ✓ Memory injection into loop prompts (context window ~1M via Claude Opus)
|
|
144
|
+
- ✓ Learned rules and conventions persist across sessions
|
|
145
|
+
- ✗ **Missing:** True episodic memory (storing execution traces, not just patterns)
|
|
146
|
+
- ✗ **Missing:** Active compression of multi-session histories
|
|
147
|
+
- ✗ **Missing:** Semantic memory layer (distilled facts vs raw traces)
|
|
148
|
+
|
|
149
|
+
### Specific Gap
|
|
150
|
+
|
|
151
|
+
**Memory is pattern-based, not episode-based.** Shipwright's memory system captures high-level lessons ("when X fails, do Y") but not complete execution traces (what happened, what actions were taken, what results occurred). This prevents agents from doing **case-based reasoning** — learning from similar past episodes to predict future outcomes.
|
|
152
|
+
|
|
153
|
+
Also: No **active compression.** As agent runs across days/weeks, memory grows unbounded. SOTA systems consolidate old episodes into semantic facts, freeing context window.
|
|
154
|
+
|
|
155
|
+
### Actionable Gap
|
|
156
|
+
|
|
157
|
+
Implement **Episodic Memory Layer** that stores and retrieves full execution traces:
|
|
158
|
+
|
|
159
|
+
1. Each pipeline run → episode JSON (inputs, actions, outcomes, duration, cost)
|
|
160
|
+
2. Query: "Show me 3 similar past episodes" for case-based reasoning
|
|
161
|
+
3. Active compression: after every 10 episodes, consolidate into semantic facts
|
|
162
|
+
4. Distillation: extract key patterns (e.g., "this error always follows this sequence")
|
|
163
|
+
|
|
164
|
+
**Impact:** 20-35% faster solution time via case-based analogy; reduced context bloat.
|
|
165
|
+
**Effort:** High (episode storage, retrieval, compression, distillation).
|
|
166
|
+
**Priority Rank:** 4 (medium-term, unlocks long-horizon learning)
|
|
167
|
+
|
|
168
|
+
---
|
|
169
|
+
|
|
170
|
+
## 5. Formal Verification & Specification-Driven Pipeline
|
|
171
|
+
|
|
172
|
+
### SOTA Systems Doing This
|
|
173
|
+
|
|
174
|
+
- **DafnyPro: LLM-Assisted Automated Verification** (POPL 2026, [popl26.sigplan.org](https://popl26.sigplan.org)) — 86% correct proofs on DafnyBench using Claude Sonnet 3.5
|
|
175
|
+
- **ATLAS: Automated Toolkit for Large-Scale Verified Code Synthesis** ([arxiv.org/abs/2512.10173](https://arxiv.org/abs/2512.10173)) — Synthesizes 2.7K verified Dafny programs; 19K training examples; +23% improvement via fine-tuning
|
|
176
|
+
- **MiniF2F-Dafny: Mathematical Theorem Proving via Auto-Active Verification** (POPL 2026) — 40.6% test set, 44.7% validation set via empty proofs
|
|
177
|
+
- **Vericoding Benchmark** ([arxiv.org/abs/2509.22908](https://arxiv.org/abs/2509.22908)) — Success rates: 27% Lean, 44% Verus/Rust, 82% Dafny
|
|
178
|
+
- **CLEVER: Curated Benchmark for Formally Verified Code Generation** ([arxiv.org/abs/2505.13938](https://arxiv.org/abs/2505.13938))
|
|
179
|
+
|
|
180
|
+
### What Shipwright Has
|
|
181
|
+
|
|
182
|
+
- ✓ Test generation and validation (testgen stage)
|
|
183
|
+
- ✓ Architecture enforcement via sw-architecture-enforcer.sh
|
|
184
|
+
- ✓ Quality gates checking for memory safety, bounds, idioms
|
|
185
|
+
- ✗ **Missing:** Formal specification language integration (Dafny, Lean, Z3)
|
|
186
|
+
- ✗ **Missing:** Automated invariant generation
|
|
187
|
+
- ✗ **Missing:** Spec-driven pipeline where agents prove correctness before merge
|
|
188
|
+
|
|
189
|
+
### Specific Gap
|
|
190
|
+
|
|
191
|
+
**No formal verification integration.** Shipwright validates code via tests and linting, but SOTA systems (DafnyPro, ATLAS) formally verify correctness properties using theorem provers. For critical code paths (payment, auth, crypto), formal verification catches classes of bugs that tests miss.
|
|
192
|
+
|
|
193
|
+
### Actionable Gap
|
|
194
|
+
|
|
195
|
+
Add a **Formal Verification Stage** in pipeline:
|
|
196
|
+
|
|
197
|
+
1. For security-critical modules, generate Dafny/Lean specifications from natural language intent
|
|
198
|
+
2. Agent produces proof sketches or hints for theorem prover
|
|
199
|
+
3. Gate merge on proof completion (not just test pass)
|
|
200
|
+
4. Cache proofs for reuse across similar functions
|
|
201
|
+
|
|
202
|
+
**Impact:** 99.99%+ confidence on critical paths (vs 95-97% with tests alone).
|
|
203
|
+
**Effort:** Very High (theorem prover integration, spec generation, proof automation).
|
|
204
|
+
**Priority Rank:** 5 (high stakes, niche use case — crypto, payments)
|
|
205
|
+
|
|
206
|
+
---
|
|
207
|
+
|
|
208
|
+
## 6. Test Generation with Mutation Testing & Coverage Optimization
|
|
209
|
+
|
|
210
|
+
### SOTA Systems Doing This
|
|
211
|
+
|
|
212
|
+
- **Meta ACH: Automated Compliance Hardening** (2026, [engineering.fb.com](https://engineering.fb.com/2025/02/05/security/)) — LLM-based test generation + LLM-based mutation generation; 9,095 mutants + 571 test cases on 10,795 Android classes
|
|
213
|
+
- **MutGen: Mutation-Guided Test Generation** — 89.5% mutation score on HumanEval-Java; outperforms EvoSuite
|
|
214
|
+
- **LLM4SoftwareTesting Framework** ([github.com/LLM-Testing/LLM4SoftwareTesting](https://github.com/LLM-Testing/LLM4SoftwareTesting))
|
|
215
|
+
- **Mutation-Guided LLM-based Test Generation at Meta** ([arxiv.org/abs/2501.12862](https://arxiv.org/abs/2501.12862))
|
|
216
|
+
|
|
217
|
+
### What Shipwright Has
|
|
218
|
+
|
|
219
|
+
- ✓ `sw-testgen.sh` — Autonomous test generation and coverage maintenance
|
|
220
|
+
- ✓ Test harness patterns in agent definitions (test-specialist.md)
|
|
221
|
+
- ✓ Coverage tracking via pytest/vitest
|
|
222
|
+
- ✗ **Missing:** Mutation testing feedback loop
|
|
223
|
+
- ✗ **Missing:** LLM-based mutant generation
|
|
224
|
+
- ✗ **Missing:** Privacy-hardening mutation targets
|
|
225
|
+
|
|
226
|
+
### Specific Gap
|
|
227
|
+
|
|
228
|
+
**No mutation testing.** Shipwright generates tests but doesn't validate test quality via mutation. Meta's findings: 45% of LLM-generated tests are ineffective at catching mutations. Without mutation feedback, test coverage numbers are inflated.
|
|
229
|
+
|
|
230
|
+
Also: **No privacy-hardening mutants.** Meta's approach generates mutants that simulate privacy attacks (e.g., data leakage patterns), then hardens tests to detect them. Shipwright's testgen is functional-only.
|
|
231
|
+
|
|
232
|
+
### Actionable Gap
|
|
233
|
+
|
|
234
|
+
Integrate **Mutation Testing Loop**:
|
|
235
|
+
|
|
236
|
+
1. Generate tests via testgen stage (current)
|
|
237
|
+
2. Run mutations (e.g., Major, PIT) on generated code
|
|
238
|
+
3. Score tests by mutation score (% mutants killed)
|
|
239
|
+
4. If score < threshold, regenerate tests with mutation feedback
|
|
240
|
+
5. Store effective test patterns in memory for reuse
|
|
241
|
+
|
|
242
|
+
**Impact:** 30-40% better test effectiveness; catches subtle bugs.
|
|
243
|
+
**Effort:** Medium (mutation tool integration, feedback loop).
|
|
244
|
+
**Priority Rank:** 6 (medium priority, quality improvement)
|
|
245
|
+
|
|
246
|
+
---
|
|
247
|
+
|
|
248
|
+
## 7. Cost-Optimized Model Routing & Cascade/Speculative Decoding
|
|
249
|
+
|
|
250
|
+
### SOTA Systems Doing This
|
|
251
|
+
|
|
252
|
+
- **Google Speculative Cascades** (Google Research 2026, [research.google/blog](https://research.google/blog/speculative-cascades-a-hybrid-approach-for-smarter-faster-llm-inference/)) — Hybrid routing + cascading; 30-60% cost reduction with 92% cost savings on benchmarks
|
|
253
|
+
- **Unified Cascade Routing Framework** ([arxiv.org/abs/2410.10347](https://arxiv.org/abs/2410.10347)) — Theoretically optimal integration of routing + cascading
|
|
254
|
+
- **CoSine: Adaptive Clustering-Based Routing** — 23% latency reduction, 32% throughput increase
|
|
255
|
+
- **Smurfs: Adaptive Speculative Decoding** — Dynamic speculation length optimization
|
|
256
|
+
- **Model Routing in Code Generation** — Haiku for simple fixes, Sonnet for medium, Opus for complex reasoning
|
|
257
|
+
|
|
258
|
+
### What Shipwright Has
|
|
259
|
+
|
|
260
|
+
- ✓ `sw-model-router.sh` — Intelligent model routing by task type
|
|
261
|
+
- ✓ `sw-cost-aware` pipeline template with cost gates
|
|
262
|
+
- ✓ Budget enforcement and cost tracking
|
|
263
|
+
- ✓ Adaptive timeouts based on DORA metrics
|
|
264
|
+
- ✓ Per-stage effort level (low/medium/high)
|
|
265
|
+
- ✗ **Missing:** Speculative cascading (try Haiku, escalate to Sonnet if fail)
|
|
266
|
+
- ✗ **Missing:** Semantic query clustering for routing decisions
|
|
267
|
+
- ✗ **Missing:** Adaptive token budgets per query type
|
|
268
|
+
|
|
269
|
+
### Specific Gap
|
|
270
|
+
|
|
271
|
+
**No speculative cascade.** Shipwright routes to a single model per stage upfront. SOTA systems try small (Haiku) first, cascade to larger (Sonnet → Opus) only if small fails. This saves 60% cost on simple tasks. Shipwright's current approach picks model upfront, no revaluation mid-execution.
|
|
272
|
+
|
|
273
|
+
### Actionable Gap
|
|
274
|
+
|
|
275
|
+
Implement **Speculative Cascade Routing**:
|
|
276
|
+
|
|
277
|
+
1. Classify query difficulty (via embeddings)
|
|
278
|
+
2. Route to Haiku-class model with short timeout (e.g., 30s)
|
|
279
|
+
3. If timeout/failure, immediately cascade to Sonnet with larger context
|
|
280
|
+
4. Cascade again to Opus if Sonnet fails
|
|
281
|
+
5. Track success rates per difficulty tier → inform future routing
|
|
282
|
+
|
|
283
|
+
**Impact:** 40-60% cost reduction on median tasks; same quality on hard tasks.
|
|
284
|
+
**Effort:** Medium (timeout management, cascade state, monitoring).
|
|
285
|
+
**Priority Rank:** 7 (high-leverage, near-term ROI)
|
|
286
|
+
|
|
287
|
+
---
|
|
288
|
+
|
|
289
|
+
## 8. Self-Healing CI/CD & AIOps Pipeline Repair
|
|
290
|
+
|
|
291
|
+
### SOTA Systems Doing This
|
|
292
|
+
|
|
293
|
+
- **Agentic SRE Pattern** (2026, [unite.ai](https://www.unite.ai/agentic-sre-how-self-healing-infrastructure-is-redefining-enterprise-aiops-in-2026/)) — Telemetry → reasoning → controlled automation closed loop
|
|
294
|
+
- **Pipeline Doctor / Interceptor Pattern** — When build fails, specialized "Repair Agent" reads logs, analyzes errors, commits fixes
|
|
295
|
+
- **LLM-as-a-Judge** (standard 2026 pattern) — Secondary model evaluates primary agent output; triggers repair if needed
|
|
296
|
+
- **60% enterprise adoption of self-healing infrastructure** (Gartner 2026)
|
|
297
|
+
- **67% drop in MTTR** with AIOps; 40-60% reduction in high-performing orgs
|
|
298
|
+
|
|
299
|
+
### What Shipwright Has
|
|
300
|
+
|
|
301
|
+
- ✓ `sw-stall-detector.sh` — Pipeline stall detection
|
|
302
|
+
- ✓ Retry logic with escalation (--max-restarts)
|
|
303
|
+
- ✓ Error classification and pattern matching
|
|
304
|
+
- ✓ Session restart with progress briefing
|
|
305
|
+
- ✓ CI integration (GitHub Actions dispatch, patrol)
|
|
306
|
+
- ✗ **Missing:** Automated repair of CI failures (flaky tests, race conditions, timeouts)
|
|
307
|
+
- ✗ **Missing:** LLM-as-a-Judge validation before merge
|
|
308
|
+
- ✗ **Missing:** Log anomaly detection + predictive repair
|
|
309
|
+
|
|
310
|
+
### Specific Gap
|
|
311
|
+
|
|
312
|
+
**No automated CI repair.** When GitHub Actions fails (flaky test, timeout, network error), Shipwright retries but doesn't diagnose/fix root cause. SOTA systems spawn a "Repair Agent" that reads logs, identifies the pattern (e.g., "test flakes due to timing"), and commits a fix (e.g., add sleep, increase timeout).
|
|
313
|
+
|
|
314
|
+
Also: **No LLM-as-a-Judge.** Shipwright's quality gates are rule-based (coverage > X%, no ASan errors). SOTA adds a secondary LLM to evaluate "is this code actually good?" — catching issues rules miss.
|
|
315
|
+
|
|
316
|
+
### Actionable Gap
|
|
317
|
+
|
|
318
|
+
Add **CI Repair Agent** stage:
|
|
319
|
+
|
|
320
|
+
1. When test/check fails: parse error logs
|
|
321
|
+
2. Classify failure (timeout, race condition, assertion, resource, flaky)
|
|
322
|
+
3. Spawn repair agent with failure context
|
|
323
|
+
4. Agent proposes fix (increase timeout, add synchronization, skip flaky test, etc.)
|
|
324
|
+
5. Re-run test; if passes, commit repair
|
|
325
|
+
6. Track effective repairs in memory for reuse
|
|
326
|
+
|
|
327
|
+
**Impact:** 50% reduction in retry cycles; faster time-to-merge.
|
|
328
|
+
**Effort:** High (log parsing, classification, repair proposals).
|
|
329
|
+
**Priority Rank:** 8 (medium-term, high quality impact)
|
|
330
|
+
|
|
331
|
+
---
|
|
332
|
+
|
|
333
|
+
## 9. Multi-Agent Orchestration & Coordination Patterns
|
|
334
|
+
|
|
335
|
+
### SOTA Systems Doing This
|
|
336
|
+
|
|
337
|
+
- **2026 Multi-Agent Trends** (40% of enterprise apps will have agents by 2026, up from <5% in 2025)
|
|
338
|
+
- **Standard 3-Role Pattern:** Planner (explore codebase, create tasks), Worker (execute without coordination), Judge (decide continue/stop)
|
|
339
|
+
- **Git Worktree Isolation** — Multiple agents work simultaneously without conflicts (now standard)
|
|
340
|
+
- **MetaGPT / CrewAI / LangGraph / AutoGen** — Four dominant frameworks; each converges on similar architecture
|
|
341
|
+
- **Role Specialization:** Builders, Reviewers, Testers, Optimizers (Google 2025 DORA study: 20-30% faster workflows, but 9% climb in bug rates)
|
|
342
|
+
|
|
343
|
+
### What Shipwright Has
|
|
344
|
+
|
|
345
|
+
- ✓ Multi-agent fleet with specialized agents (builder, reviewer, tester, optimizer)
|
|
346
|
+
- ✓ Distributed task list coordination via TaskCreate/TaskUpdate
|
|
347
|
+
- ✓ Worktree isolation per agent (`--worktree`)
|
|
348
|
+
- ✓ Idle state detection and wait-for-work patterns
|
|
349
|
+
- ✓ Cross-agent message delivery (SendMessage)
|
|
350
|
+
- ✓ Role-specialization via agent definitions
|
|
351
|
+
- ✗ **Missing:** Explicit conflict resolution for competing agent changes
|
|
352
|
+
- ✗ **Missing:** Real-time dependency tracking (Agent A blocks Agent B)
|
|
353
|
+
- ✗ **Missing:** Quorum-based merge decisions across reviewers
|
|
354
|
+
|
|
355
|
+
### Specific Gap
|
|
356
|
+
|
|
357
|
+
**No explicit conflict detection for concurrent changes.** Shipwright uses worktrees to isolate agents, but if two agents modify the same file, the merge can fail silently. No explicit conflict detection + resolution protocol.
|
|
358
|
+
|
|
359
|
+
Also: **No dependency-aware scheduling.** If Agent A (API changes) must complete before Agent B (client changes), Shipwright relies on manual task ordering. SOTA systems use DAG-based task scheduling.
|
|
360
|
+
|
|
361
|
+
### Actionable Gap
|
|
362
|
+
|
|
363
|
+
Implement **Explicit Conflict Resolution** and **Dependency-Aware Scheduling**:
|
|
364
|
+
|
|
365
|
+
1. Track file-level locks per agent
|
|
366
|
+
2. Detect read-write conflicts before merging worktrees
|
|
367
|
+
3. Build DAG of task dependencies (task X blocks task Y)
|
|
368
|
+
4. Schedule agents respecting DAG (don't start Y until X complete)
|
|
369
|
+
5. On merge conflict: spawn conflict-resolver agent to rebase/merge intelligently
|
|
370
|
+
|
|
371
|
+
**Impact:** Eliminates silent merge failures; enables more aggressive parallelism.
|
|
372
|
+
**Effort:** Medium (file tracking, DAG scheduler, conflict resolver).
|
|
373
|
+
**Priority Rank:** 9 (medium priority, prevents errors)
|
|
374
|
+
|
|
375
|
+
---
|
|
376
|
+
|
|
377
|
+
## 10. Reasoning-First Code Generation with Extended/Adaptive Thinking
|
|
378
|
+
|
|
379
|
+
### SOTA Systems Doing This
|
|
380
|
+
|
|
381
|
+
- **Claude Opus 4.6 / Sonnet 4.6 Adaptive Thinking** (Anthropic 2026) — Dynamically decide when/how much to think; replaces extended thinking
|
|
382
|
+
- **OpenAI o1-pro** ([openai.com/index/learning-to-reason-with-llms](https://openai.com/index/learning-to-reason-with-llms)) — 200K context window, 100K output tokens, $150/$600 pricing; ranks 89th percentile in Codeforces
|
|
383
|
+
- **DeepSeek-R1** ([github.com/deepseek-ai/DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)) — Pure RL-based reasoning; 2,029 Codeforces Elo; 671B model at 37B cost via MoE
|
|
384
|
+
- **Claude Mythos (unreleased)** — Next Anthropic model; recursive self-correction without intermediate human input
|
|
385
|
+
- **Reasoning faithfulness research** (Anthropic, Alignment Science) — Even with thinking, models only mention hints 25% of time; chain-of-thought reasoning may not be faithful
|
|
386
|
+
|
|
387
|
+
### What Shipwright Has
|
|
388
|
+
|
|
389
|
+
- ✓ `--effort high` routing to Opus for complex stages
|
|
390
|
+
- ✓ Extended thinking support (currently built-in to Claude Opus)
|
|
391
|
+
- ✓ Adaptive thinking (via Claude SDK, auto-enabled)
|
|
392
|
+
- ✓ Per-stage effort configuration
|
|
393
|
+
- ✓ Fallback models for overload
|
|
394
|
+
- ✗ **Missing:** Explicit reasoning budget allocation per query type
|
|
395
|
+
- ✗ **Missing:** Interleaved reasoning + tool calls (think → observe → think cycle)
|
|
396
|
+
- ✗ **Missing:** o1-pro / DeepSeek-R1 support (closed APIs)
|
|
397
|
+
|
|
398
|
+
### Specific Gap
|
|
399
|
+
|
|
400
|
+
**Reasoning allocation is coarse-grained.** Shipwright's `--effort high` tells Claude "think hard," but no feedback on whether thinking actually helped. SOTA systems track thinking effectiveness (e.g., "does thinking improve from X% to Y% accuracy?") and allocate thinking dynamically per query.
|
|
401
|
+
|
|
402
|
+
Also: **No interleaved reasoning.** Shipwright asks Claude to think, then calls tools. SOTA systems let reasoning happen mid-tool-sequence: think → read file → think → call API → think. This is harder to implement but yields better results on multi-step problems.
|
|
403
|
+
|
|
404
|
+
### Actionable Gap
|
|
405
|
+
|
|
406
|
+
Implement **Intelligent Reasoning Budget Allocation**:
|
|
407
|
+
|
|
408
|
+
1. Track reasoning cost vs outcome quality for each task type
|
|
409
|
+
2. For new task: estimate complexity → allocate thinking budget
|
|
410
|
+
3. If task fails: increase thinking budget on retry
|
|
411
|
+
4. Build lookup table: (task_type, complexity) → thinking_tokens
|
|
412
|
+
5. Interleave reasoning and tool calls for multi-step tasks (requires SDK support)
|
|
413
|
+
|
|
414
|
+
**Impact:** 15-25% better success on hard tasks; cheaper on easy tasks.
|
|
415
|
+
**Effort:** Medium (tracking, learning, budget logic).
|
|
416
|
+
**Priority Rank:** 10 (quality improvement, medium effort)
|
|
417
|
+
|
|
418
|
+
---
|
|
419
|
+
|
|
420
|
+
## Shipwright: What You Already Have (Strengths to Preserve)
|
|
421
|
+
|
|
422
|
+
This research confirms Shipwright's strong foundation:
|
|
423
|
+
|
|
424
|
+
1. **RL Architecture** — Multi-signal rewards, bandit selection, policy learning (sw-rl-optimizer.sh, sw-policy-learner.sh)
|
|
425
|
+
2. **Pipeline Orchestration** — 12-stage flow with quality gates, evidence capture, artifact management
|
|
426
|
+
3. **Multi-Agent Coordination** — Fleet support, task list coordination, idle detection, role specialization
|
|
427
|
+
4. **Cost Intelligence** — Budget tracking, model routing, DORA metrics, cost-per-issue
|
|
428
|
+
5. **Memory System** — Cross-session learning, failure patterns, codebase conventions
|
|
429
|
+
6. **CI Integration** — GitHub Actions, webhook receiver, Checks API, Deployments API
|
|
430
|
+
7. **Daemon & Auto-Scaling** — Worker pool, load balancing, adaptive configuration
|
|
431
|
+
8. **Testing & Evidence** — 121+ test suites, evidence capture system, pre-PR validation
|
|
432
|
+
|
|
433
|
+
**These are differentiated. Build on them, don't replace.**
|
|
434
|
+
|
|
435
|
+
---
|
|
436
|
+
|
|
437
|
+
## 20-Item Backlog: Ranked by Impact/Effort Ratio
|
|
438
|
+
|
|
439
|
+
| Rank | Feature | Impact | Effort | ROI | Category |
|
|
440
|
+
| ---- | ----------------------------------------------------------------------------- | ----------------------------------------------------- | --------- | ------------------ | ----------------- |
|
|
441
|
+
| 1 | Semantic trajectory analysis + convergence detection (geometric loop regimes) | 30% iteration waste reduction | Medium | **High** | Loop Patterns |
|
|
442
|
+
| 2 | Intent Specification Engine (business → testable outcomes) | 40-60% design time; 3-5 person factories | High | **Exceptional** | Dark Factory |
|
|
443
|
+
| 3 | Vulnerability Reward Model + online RL hardening | 30-40% security issue reduction | Medium | **High** | RL/Security |
|
|
444
|
+
| 4 | Episodic Memory Layer (execution traces, case-based reasoning) | 20-35% faster solutions via analogy | High | **Medium** | Memory |
|
|
445
|
+
| 5 | Speculative Cascade Model Routing (Haiku → Sonnet → Opus) | 40-60% cost reduction on median tasks | Medium | **Very High** | Cost Optimization |
|
|
446
|
+
| 6 | Mutation Testing Feedback Loop (validate test effectiveness) | 30-40% better test quality | Medium | **High** | Testing |
|
|
447
|
+
| 7 | CI Repair Agent (automatic fix for flaky tests, timeouts) | 50% fewer retries; faster merge | High | **High** | Self-Healing |
|
|
448
|
+
| 8 | LLM-as-a-Judge validation stage (secondary reviewer) | 10-15% fewer merge regressions | Medium | **Medium** | Quality |
|
|
449
|
+
| 9 | Explicit File Conflict Detection + DAG Scheduling | Prevents merge failures; enables parallelism | Medium | **Medium** | Multi-Agent |
|
|
450
|
+
| 10 | Intelligent Reasoning Budget Allocation | 15-25% harder-task success; cheaper easy tasks | Medium | **Medium** | Reasoning |
|
|
451
|
+
| 11 | Formal Verification Integration (Dafny/Lean stage) | 99.99% confidence on critical code | Very High | **Medium** (niche) | Verification |
|
|
452
|
+
| 12 | Active Context Compression + Semantic Memory Layer | Unbounded context bloat fixed; 30% better compression | High | **Medium** | Memory |
|
|
453
|
+
| 13 | Multi-Pass Mutation Generation (LLM-based mutants) | Diversified test coverage; Meta-style compliance | High | **Medium** | Testing |
|
|
454
|
+
| 14 | Anomaly Detection + Predictive Repair (log analysis) | Earlier failure prevention; MTTR ↓ 40% | High | **Medium** | Self-Healing |
|
|
455
|
+
| 15 | Cross-Repo Fleet Learning (pattern sharing across repos) | 20% faster on new repo types | High | **Medium** | Memory/Fleet |
|
|
456
|
+
| 16 | Quorum-Based Merge Decisions (multiple reviewers) | 5-10% fewer bugs; more confident merges | Medium | **Low** | Multi-Agent |
|
|
457
|
+
| 17 | Privacy-Hardening Mutations (Meta ACH-style) | Compliance + security in test suite | High | **Medium** | Testing/Security |
|
|
458
|
+
| 18 | Dependency-Aware Task Scheduling (DAG executor) | Smarter agent coordination; prevents deadlocks | Medium | **Low** | Multi-Agent |
|
|
459
|
+
| 19 | Symbol Caching + Semantic Search (fast repo understanding) | 20-30% faster codebase navigation | Medium | **Low** | Performance |
|
|
460
|
+
| 20 | WebSocket Real-Time Loop Monitoring (dashboard streaming) | Live visibility into agentic loops | Medium | **Low** | Observability |
|
|
461
|
+
|
|
462
|
+
---
|
|
463
|
+
|
|
464
|
+
## Implementation Roadmap (Next 12 Weeks)
|
|
465
|
+
|
|
466
|
+
### Phase 1: Convergence & Cost (Weeks 1-4)
|
|
467
|
+
|
|
468
|
+
- ✅ **Semantic trajectory analysis** (backlog #1) → faster early exit
|
|
469
|
+
- ✅ **Speculative cascade routing** (backlog #5) → 40-60% cost reduction
|
|
470
|
+
- Start Intent Specification Engine (backlog #2) — research phase
|
|
471
|
+
|
|
472
|
+
### Phase 2: Security & Testing (Weeks 5-8)
|
|
473
|
+
|
|
474
|
+
- ✅ **Vulnerability Reward Model** (backlog #3) → security-aware RL
|
|
475
|
+
- ✅ **Mutation Testing Loop** (backlog #6) → validate test quality
|
|
476
|
+
- ✅ **Multi-Pass Mutation Generation** (backlog #13)
|
|
477
|
+
|
|
478
|
+
### Phase 3: Memory & Self-Healing (Weeks 9-12)
|
|
479
|
+
|
|
480
|
+
- ✅ **Episodic Memory Layer** (backlog #4) → case-based reasoning
|
|
481
|
+
- ✅ **CI Repair Agent** (backlog #7) → automatic fix generation
|
|
482
|
+
- ✅ **LLM-as-a-Judge** (backlog #8) → secondary validation
|
|
483
|
+
|
|
484
|
+
---
|
|
485
|
+
|
|
486
|
+
## Key Research Sources
|
|
487
|
+
|
|
488
|
+
### Benchmarks & Standards
|
|
489
|
+
|
|
490
|
+
- [SWE-bench](https://www.vals.ai/benchmarks/swebench) — 500+ real GitHub issues
|
|
491
|
+
- [SWE-bench Pro](https://scale.com/blog/swe-bench-pro) — 1,865 tasks (recommended)
|
|
492
|
+
- [Codeforces Rating](https://codeforces.com/) — Competitive programming (DeepSeek-R1 2,029 Elo)
|
|
493
|
+
- [AIME Math Benchmark](https://www.maa.org/math-competitions/american-invitational-mathematics-examination) — o1-pro 86% vs o1 78%
|
|
494
|
+
|
|
495
|
+
### Models
|
|
496
|
+
|
|
497
|
+
- [Claude Opus 4.6](https://platform.claude.com) — Adaptive thinking, 1M context
|
|
498
|
+
- [OpenAI o1-pro](https://openai.com/index/introducing-openai-o1-preview/) — 200K context, 89th percentile Codeforces
|
|
499
|
+
- [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) — 671B @ 37B cost; RL-first approach
|
|
500
|
+
|
|
501
|
+
### Key Papers
|
|
502
|
+
|
|
503
|
+
- [SWE-agent NeurIPS 2024](https://arxiv.org/abs/2405.15793)
|
|
504
|
+
- [Geometric Dynamics of Agentic Loops](https://arxiv.org/abs/2512.10350)
|
|
505
|
+
- [DafnyPro POPL 2026](https://popl26.sigplan.org)
|
|
506
|
+
- [FunPRM: Function-as-Step Process Reward](https://arxiv.org/abs/2601.22249)
|
|
507
|
+
- [DeepSeek-R1 RL Architecture](https://arxiv.org/abs/2501.12948)
|
|
508
|
+
- [Active Context Compression](https://arxiv.org/abs/2601.07190)
|
|
509
|
+
|
|
510
|
+
### Industry Reports
|
|
511
|
+
|
|
512
|
+
- [BCG Platinion Dark Software Factory](https://www.bcgplatinion.com/insights/the-dark-software-factory) (March 2026)
|
|
513
|
+
- [Anthropic 2026 Agentic Coding Trends](https://resources.anthropic.com/hubfs/2026%20Agentic%20Coding%20Trends%20Report.pdf)
|
|
514
|
+
- [GitHub Copilot Workspace → Agent Mode](https://github.com/newsroom/press-releases/agent-mode)
|
|
515
|
+
- [Meta Mutation Testing at Scale](https://engineering.fb.com/2025/02/05/security/)
|
|
516
|
+
|
|
517
|
+
---
|
|
518
|
+
|
|
519
|
+
## Competitive Positioning
|
|
520
|
+
|
|
521
|
+
| Dimension | Shipwright | SWE-agent | GitHub Copilot | Aider |
|
|
522
|
+
| -------------------------- | -------------------------------------- | ---------------- | --------------------- | -------------------------- |
|
|
523
|
+
| **SOTA Benchmark** | (not submitted) | 40.6% SWE-Bench | ~55% SWE-bench | 49.2% SWE-Verified |
|
|
524
|
+
| **Multi-Agent** | ✅ Fleet, 5+ agents | ❌ Single agent | ✅ Agent Mode (2025+) | ❌ Single agent |
|
|
525
|
+
| **Self-Improving RL** | ✅ Reward aggregation, policy learning | ❌ | ❌ | ❌ |
|
|
526
|
+
| **Cost Optimization** | ✅ Model routing, budget | ❌ | ✅ Cascade (partial) | ✅ Token-efficient diffing |
|
|
527
|
+
| **Memory Across Sessions** | ✅ Pattern-based | ❌ | ❌ | ❌ |
|
|
528
|
+
| **Pipeline Stages** | ✅ 12-stage with gates | ❌ (single-pass) | ✅ Issue-to-PR | ❌ (editing only) |
|
|
529
|
+
| **Dark Factory Ready** | ⚠️ 80% there (needs Intent Engine) | ❌ | ✅ (Project Padawan) | ❌ |
|
|
530
|
+
|
|
531
|
+
---
|
|
532
|
+
|
|
533
|
+
## Conclusion
|
|
534
|
+
|
|
535
|
+
Shipwright is positioned as a **platform-grade autonomous software factory** — the right abstraction level between human intent and code. The next wave of differentiation comes from:
|
|
536
|
+
|
|
537
|
+
1. **Predictive intelligence** (convergence detection, loop regimes) → cost & time reduction
|
|
538
|
+
2. **Learning across episodes** (episodic memory) → faster on similar problems
|
|
539
|
+
3. **Formal guarantees** (verification, formal specs) → safety/compliance for critical code
|
|
540
|
+
4. **Self-healing** (CI repair, automated fixes) → resilience and uptime
|
|
541
|
+
|
|
542
|
+
The 20-item backlog reflects industry momentum (BCG Dark Factories, DeepSeek-R1, DafnyPro POPL, Meta mutation testing) and fills Shipwright's remaining gaps. Implementation order prioritizes highest ROI (cost, learning, quality).
|
|
543
|
+
|
|
544
|
+
---
|
|
545
|
+
|
|
546
|
+
**Generated:** April 4, 2026 | **Research Effort:** Deep dives across 20+ sources (papers, blogs, GitHub, industry reports)
|