shipwright-cli 3.1.0 → 3.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/agents/code-reviewer.md +2 -0
- package/.claude/agents/devops-engineer.md +2 -0
- package/.claude/agents/doc-fleet-agent.md +2 -0
- package/.claude/agents/pipeline-agent.md +2 -0
- package/.claude/agents/shell-script-specialist.md +2 -0
- package/.claude/agents/test-specialist.md +2 -0
- package/.claude/hooks/agent-crash-capture.sh +32 -0
- package/.claude/hooks/post-tool-use.sh +3 -2
- package/.claude/hooks/pre-tool-use.sh +35 -3
- package/README.md +22 -8
- package/claude-code/hooks/config-change.sh +18 -0
- package/claude-code/hooks/instructions-reloaded.sh +7 -0
- package/claude-code/hooks/worktree-create.sh +25 -0
- package/claude-code/hooks/worktree-remove.sh +20 -0
- package/config/code-constitution.json +130 -0
- package/config/defaults.json +25 -2
- package/config/policy.json +1 -1
- package/dashboard/middleware/auth.ts +134 -0
- package/dashboard/middleware/constants.ts +21 -0
- package/dashboard/public/index.html +8 -6
- package/dashboard/public/styles.css +176 -97
- package/dashboard/routes/auth.ts +38 -0
- package/dashboard/server.ts +117 -25
- package/dashboard/services/config.ts +26 -0
- package/dashboard/services/db.ts +118 -0
- package/dashboard/src/canvas/pixel-agent.ts +298 -0
- package/dashboard/src/canvas/pixel-sprites.ts +440 -0
- package/dashboard/src/canvas/shipyard-effects.ts +367 -0
- package/dashboard/src/canvas/shipyard-scene.ts +616 -0
- package/dashboard/src/canvas/submarine-layout.ts +267 -0
- package/dashboard/src/components/header.ts +8 -7
- package/dashboard/src/core/api.ts +5 -0
- package/dashboard/src/core/router.ts +1 -0
- package/dashboard/src/design/submarine-theme.ts +253 -0
- package/dashboard/src/main.ts +2 -0
- package/dashboard/src/types/api.ts +12 -1
- package/dashboard/src/views/activity.ts +2 -1
- package/dashboard/src/views/metrics.ts +69 -1
- package/dashboard/src/views/shipyard.ts +39 -0
- package/dashboard/types/index.ts +166 -0
- package/docs/plans/2026-02-28-compound-audit-and-shipyard-design.md +186 -0
- package/docs/plans/2026-02-28-skipper-shipwright-implementation-plan.md +1182 -0
- package/docs/plans/2026-02-28-skipper-shipwright-integration-design.md +531 -0
- package/docs/plans/2026-03-01-ai-powered-skill-injection-design.md +298 -0
- package/docs/plans/2026-03-01-ai-powered-skill-injection-plan.md +1109 -0
- package/docs/plans/2026-03-01-capabilities-cleanup-plan.md +658 -0
- package/docs/plans/2026-03-01-clean-architecture-plan.md +924 -0
- package/docs/plans/2026-03-01-compound-audit-cascade-design.md +191 -0
- package/docs/plans/2026-03-01-compound-audit-cascade-plan.md +921 -0
- package/docs/plans/2026-03-01-deep-integration-plan.md +851 -0
- package/docs/plans/2026-03-01-pipeline-audit-trail-design.md +145 -0
- package/docs/plans/2026-03-01-pipeline-audit-trail-plan.md +770 -0
- package/docs/plans/2026-03-01-refined-depths-brand-design.md +382 -0
- package/docs/plans/2026-03-01-refined-depths-implementation.md +599 -0
- package/docs/plans/2026-03-01-skipper-kernel-integration-design.md +203 -0
- package/docs/plans/2026-03-01-unified-platform-design.md +272 -0
- package/docs/plans/2026-03-07-claude-code-feature-integration-design.md +189 -0
- package/docs/plans/2026-03-07-claude-code-feature-integration-plan.md +1165 -0
- package/docs/research/BACKLOG_QUICK_REFERENCE.md +352 -0
- package/docs/research/CUTTING_EDGE_RESEARCH_2026.md +546 -0
- package/docs/research/RESEARCH_INDEX.md +439 -0
- package/docs/research/RESEARCH_SOURCES.md +440 -0
- package/docs/research/RESEARCH_SUMMARY.txt +275 -0
- package/docs/superpowers/specs/2026-03-10-pipeline-quality-revolution-design.md +341 -0
- package/package.json +2 -2
- package/scripts/lib/adaptive-model.sh +427 -0
- package/scripts/lib/adaptive-timeout.sh +316 -0
- package/scripts/lib/audit-trail.sh +309 -0
- package/scripts/lib/auto-recovery.sh +471 -0
- package/scripts/lib/bandit-selector.sh +431 -0
- package/scripts/lib/bootstrap.sh +104 -2
- package/scripts/lib/causal-graph.sh +455 -0
- package/scripts/lib/compat.sh +126 -0
- package/scripts/lib/compound-audit.sh +337 -0
- package/scripts/lib/constitutional.sh +454 -0
- package/scripts/lib/context-budget.sh +359 -0
- package/scripts/lib/convergence.sh +594 -0
- package/scripts/lib/cost-optimizer.sh +634 -0
- package/scripts/lib/daemon-adaptive.sh +14 -2
- package/scripts/lib/daemon-dispatch.sh +106 -17
- package/scripts/lib/daemon-failure.sh +34 -4
- package/scripts/lib/daemon-patrol.sh +25 -4
- package/scripts/lib/daemon-poll-github.sh +361 -0
- package/scripts/lib/daemon-poll-health.sh +299 -0
- package/scripts/lib/daemon-poll.sh +27 -611
- package/scripts/lib/daemon-state.sh +119 -66
- package/scripts/lib/daemon-triage.sh +10 -0
- package/scripts/lib/dod-scorecard.sh +442 -0
- package/scripts/lib/error-actionability.sh +300 -0
- package/scripts/lib/formal-spec.sh +461 -0
- package/scripts/lib/helpers.sh +180 -5
- package/scripts/lib/intent-analysis.sh +409 -0
- package/scripts/lib/loop-convergence.sh +350 -0
- package/scripts/lib/loop-iteration.sh +682 -0
- package/scripts/lib/loop-progress.sh +48 -0
- package/scripts/lib/loop-restart.sh +185 -0
- package/scripts/lib/memory-effectiveness.sh +506 -0
- package/scripts/lib/mutation-executor.sh +352 -0
- package/scripts/lib/outcome-feedback.sh +521 -0
- package/scripts/lib/pipeline-cli.sh +336 -0
- package/scripts/lib/pipeline-commands.sh +1216 -0
- package/scripts/lib/pipeline-detection.sh +101 -3
- package/scripts/lib/pipeline-execution.sh +897 -0
- package/scripts/lib/pipeline-github.sh +28 -3
- package/scripts/lib/pipeline-intelligence-compound.sh +431 -0
- package/scripts/lib/pipeline-intelligence-scoring.sh +407 -0
- package/scripts/lib/pipeline-intelligence-skip.sh +181 -0
- package/scripts/lib/pipeline-intelligence.sh +104 -1138
- package/scripts/lib/pipeline-quality-bash-compat.sh +182 -0
- package/scripts/lib/pipeline-quality-checks.sh +17 -711
- package/scripts/lib/pipeline-quality-gates.sh +563 -0
- package/scripts/lib/pipeline-stages-build.sh +730 -0
- package/scripts/lib/pipeline-stages-delivery.sh +965 -0
- package/scripts/lib/pipeline-stages-intake.sh +1133 -0
- package/scripts/lib/pipeline-stages-monitor.sh +407 -0
- package/scripts/lib/pipeline-stages-review.sh +1022 -0
- package/scripts/lib/pipeline-stages.sh +161 -2901
- package/scripts/lib/pipeline-state.sh +36 -5
- package/scripts/lib/pipeline-util.sh +487 -0
- package/scripts/lib/policy-learner.sh +438 -0
- package/scripts/lib/process-reward.sh +493 -0
- package/scripts/lib/project-detect.sh +649 -0
- package/scripts/lib/quality-profile.sh +334 -0
- package/scripts/lib/recruit-commands.sh +885 -0
- package/scripts/lib/recruit-learning.sh +739 -0
- package/scripts/lib/recruit-roles.sh +648 -0
- package/scripts/lib/reward-aggregator.sh +458 -0
- package/scripts/lib/rl-optimizer.sh +362 -0
- package/scripts/lib/root-cause.sh +427 -0
- package/scripts/lib/scope-enforcement.sh +445 -0
- package/scripts/lib/session-restart.sh +493 -0
- package/scripts/lib/skill-memory.sh +300 -0
- package/scripts/lib/skill-registry.sh +775 -0
- package/scripts/lib/spec-driven.sh +476 -0
- package/scripts/lib/test-helpers.sh +18 -7
- package/scripts/lib/test-holdout.sh +429 -0
- package/scripts/lib/test-optimizer.sh +511 -0
- package/scripts/shipwright-file-suggest.sh +45 -0
- package/scripts/skills/adversarial-quality.md +61 -0
- package/scripts/skills/api-design.md +44 -0
- package/scripts/skills/architecture-design.md +50 -0
- package/scripts/skills/brainstorming.md +43 -0
- package/scripts/skills/data-pipeline.md +44 -0
- package/scripts/skills/deploy-safety.md +64 -0
- package/scripts/skills/documentation.md +38 -0
- package/scripts/skills/frontend-design.md +45 -0
- package/scripts/skills/generated/.gitkeep +0 -0
- package/scripts/skills/generated/_refinements/.gitkeep +0 -0
- package/scripts/skills/generated/_refinements/adversarial-quality.patch.md +3 -0
- package/scripts/skills/generated/_refinements/architecture-design.patch.md +3 -0
- package/scripts/skills/generated/_refinements/brainstorming.patch.md +3 -0
- package/scripts/skills/generated/cli-version-management.md +29 -0
- package/scripts/skills/generated/collection-system-validation.md +99 -0
- package/scripts/skills/generated/large-scale-c-refactoring-coordination.md +97 -0
- package/scripts/skills/generated/pattern-matching-similarity-scoring.md +195 -0
- package/scripts/skills/generated/test-parallelization-detection.md +65 -0
- package/scripts/skills/observability.md +79 -0
- package/scripts/skills/performance.md +48 -0
- package/scripts/skills/pr-quality.md +49 -0
- package/scripts/skills/product-thinking.md +43 -0
- package/scripts/skills/security-audit.md +49 -0
- package/scripts/skills/systematic-debugging.md +40 -0
- package/scripts/skills/testing-strategy.md +47 -0
- package/scripts/skills/two-stage-review.md +52 -0
- package/scripts/skills/validation-thoroughness.md +55 -0
- package/scripts/sw +9 -3
- package/scripts/sw-activity.sh +9 -8
- package/scripts/sw-adaptive.sh +8 -7
- package/scripts/sw-adversarial.sh +2 -1
- package/scripts/sw-architecture-enforcer.sh +3 -1
- package/scripts/sw-auth.sh +12 -2
- package/scripts/sw-autonomous.sh +5 -1
- package/scripts/sw-changelog.sh +4 -1
- package/scripts/sw-checkpoint.sh +2 -1
- package/scripts/sw-ci.sh +15 -6
- package/scripts/sw-cleanup.sh +4 -26
- package/scripts/sw-code-review.sh +45 -20
- package/scripts/sw-connect.sh +2 -1
- package/scripts/sw-context.sh +2 -1
- package/scripts/sw-cost.sh +107 -5
- package/scripts/sw-daemon.sh +71 -11
- package/scripts/sw-dashboard.sh +3 -1
- package/scripts/sw-db.sh +71 -20
- package/scripts/sw-decide.sh +8 -2
- package/scripts/sw-decompose.sh +360 -17
- package/scripts/sw-deps.sh +4 -1
- package/scripts/sw-developer-simulation.sh +4 -1
- package/scripts/sw-discovery.sh +378 -5
- package/scripts/sw-doc-fleet.sh +4 -1
- package/scripts/sw-docs-agent.sh +3 -1
- package/scripts/sw-docs.sh +2 -1
- package/scripts/sw-doctor.sh +453 -2
- package/scripts/sw-dora.sh +4 -1
- package/scripts/sw-durable.sh +12 -7
- package/scripts/sw-e2e-orchestrator.sh +17 -16
- package/scripts/sw-eventbus.sh +13 -4
- package/scripts/sw-evidence.sh +364 -12
- package/scripts/sw-feedback.sh +550 -9
- package/scripts/sw-fix.sh +20 -1
- package/scripts/sw-fleet-discover.sh +6 -2
- package/scripts/sw-fleet-viz.sh +9 -4
- package/scripts/sw-fleet.sh +5 -1
- package/scripts/sw-github-app.sh +18 -4
- package/scripts/sw-github-checks.sh +3 -2
- package/scripts/sw-github-deploy.sh +3 -2
- package/scripts/sw-github-graphql.sh +18 -7
- package/scripts/sw-guild.sh +5 -1
- package/scripts/sw-heartbeat.sh +5 -30
- package/scripts/sw-hello.sh +67 -0
- package/scripts/sw-hygiene.sh +10 -3
- package/scripts/sw-incident.sh +273 -5
- package/scripts/sw-init.sh +18 -2
- package/scripts/sw-instrument.sh +10 -2
- package/scripts/sw-intelligence.sh +44 -7
- package/scripts/sw-jira.sh +5 -1
- package/scripts/sw-launchd.sh +2 -1
- package/scripts/sw-linear.sh +4 -1
- package/scripts/sw-logs.sh +4 -1
- package/scripts/sw-loop.sh +436 -1076
- package/scripts/sw-memory.sh +357 -3
- package/scripts/sw-mission-control.sh +6 -1
- package/scripts/sw-model-router.sh +483 -27
- package/scripts/sw-otel.sh +15 -4
- package/scripts/sw-oversight.sh +14 -5
- package/scripts/sw-patrol-meta.sh +334 -0
- package/scripts/sw-pipeline-composer.sh +7 -1
- package/scripts/sw-pipeline-vitals.sh +12 -6
- package/scripts/sw-pipeline.sh +54 -2653
- package/scripts/sw-pm.sh +16 -8
- package/scripts/sw-pr-lifecycle.sh +2 -1
- package/scripts/sw-predictive.sh +17 -5
- package/scripts/sw-prep.sh +185 -2
- package/scripts/sw-ps.sh +5 -25
- package/scripts/sw-public-dashboard.sh +17 -4
- package/scripts/sw-quality.sh +14 -6
- package/scripts/sw-reaper.sh +8 -25
- package/scripts/sw-recruit.sh +156 -2303
- package/scripts/sw-regression.sh +19 -12
- package/scripts/sw-release-manager.sh +3 -1
- package/scripts/sw-release.sh +4 -1
- package/scripts/sw-remote.sh +3 -1
- package/scripts/sw-replay.sh +7 -1
- package/scripts/sw-retro.sh +158 -1
- package/scripts/sw-review-rerun.sh +3 -1
- package/scripts/sw-scale.sh +14 -5
- package/scripts/sw-security-audit.sh +6 -1
- package/scripts/sw-self-optimize.sh +173 -6
- package/scripts/sw-session.sh +9 -3
- package/scripts/sw-setup.sh +3 -1
- package/scripts/sw-stall-detector.sh +406 -0
- package/scripts/sw-standup.sh +15 -7
- package/scripts/sw-status.sh +3 -1
- package/scripts/sw-strategic.sh +14 -6
- package/scripts/sw-stream.sh +13 -4
- package/scripts/sw-swarm.sh +20 -7
- package/scripts/sw-team-stages.sh +13 -6
- package/scripts/sw-templates.sh +7 -31
- package/scripts/sw-testgen.sh +17 -6
- package/scripts/sw-tmux-pipeline.sh +4 -1
- package/scripts/sw-tmux-role-color.sh +2 -0
- package/scripts/sw-tmux-status.sh +1 -1
- package/scripts/sw-tmux.sh +37 -1
- package/scripts/sw-trace.sh +3 -1
- package/scripts/sw-tracker-github.sh +3 -0
- package/scripts/sw-tracker-jira.sh +3 -0
- package/scripts/sw-tracker-linear.sh +3 -0
- package/scripts/sw-tracker.sh +3 -1
- package/scripts/sw-triage.sh +3 -2
- package/scripts/sw-upgrade.sh +3 -1
- package/scripts/sw-ux.sh +5 -2
- package/scripts/sw-webhook.sh +5 -2
- package/scripts/sw-widgets.sh +9 -4
- package/scripts/sw-worktree.sh +15 -3
- package/scripts/test-skill-injection.sh +1233 -0
- package/templates/pipelines/autonomous.json +27 -3
- package/templates/pipelines/cost-aware.json +34 -8
- package/templates/pipelines/deployed.json +12 -0
- package/templates/pipelines/enterprise.json +12 -0
- package/templates/pipelines/fast.json +6 -0
- package/templates/pipelines/full.json +27 -3
- package/templates/pipelines/hotfix.json +6 -0
- package/templates/pipelines/standard.json +12 -0
- package/templates/pipelines/tdd.json +12 -0
|
@@ -0,0 +1,352 @@
|
|
|
1
|
+
# Shipwright Backlog: Quick Reference (20-Item Priority List)
|
|
2
|
+
|
|
3
|
+
## At-a-Glance Priority Matrix
|
|
4
|
+
|
|
5
|
+
| Priority | ID | Feature | Impact | Effort | ROI | Category |
|
|
6
|
+
| -------- | --- | ---------------------------------------------------- | -------- | -------- | --------------- | ------------- |
|
|
7
|
+
| 🔴 P0 | #1 | Semantic trajectory analysis + convergence detection | 🟢🟢🟢 | 🟡🟡 | **EXCEPTIONAL** | Loop Patterns |
|
|
8
|
+
| 🔴 P0 | #2 | Intent Specification Engine (business → outcomes) | 🟢🟢🟢🟢 | 🔴🔴🔴 | **EXCEPTIONAL** | Dark Factory |
|
|
9
|
+
| 🔴 P0 | #3 | Vulnerability Reward Model + online RL | 🟢🟢🟢 | 🟡🟡 | **EXCEPTIONAL** | RL/Security |
|
|
10
|
+
| 🔴 P0 | #5 | Speculative Cascade Model Routing | 🟢🟢🟢🟢 | 🟡🟡 | **VERY HIGH** | Cost |
|
|
11
|
+
| 🟡 P1 | #4 | Episodic Memory Layer | 🟢🟢🟢 | 🔴🔴🔴 | **HIGH** | Memory |
|
|
12
|
+
| 🟡 P1 | #6 | Mutation Testing Feedback Loop | 🟢🟢🟢 | 🟡🟡 | **HIGH** | Testing |
|
|
13
|
+
| 🟡 P1 | #7 | CI Repair Agent | 🟢🟢🟢 | 🔴🔴🔴 | **HIGH** | Self-Healing |
|
|
14
|
+
| 🟡 P1 | #8 | LLM-as-a-Judge validation | 🟢🟢 | 🟡🟡 | **HIGH** | Quality |
|
|
15
|
+
| 🟢 P2 | #9 | Explicit Conflict Detection + DAG Scheduling | 🟢🟢 | 🟡🟡 | **MEDIUM** | Multi-Agent |
|
|
16
|
+
| 🟢 P2 | #10 | Intelligent Reasoning Budget Allocation | 🟢🟢 | 🟡🟡 | **MEDIUM** | Reasoning |
|
|
17
|
+
| 🟢 P2 | #11 | Formal Verification Integration (Dafny/Lean) | 🟢🟢 | 🔴🔴🔴🔴 | **MEDIUM** | Verification |
|
|
18
|
+
| 🟢 P2 | #12 | Active Context Compression + Semantic Memory | 🟢🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Memory |
|
|
19
|
+
| 🟢 P2 | #13 | Multi-Pass Mutation Generation (LLM-based) | 🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Testing |
|
|
20
|
+
| 🟢 P2 | #14 | Anomaly Detection + Predictive Repair | 🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Self-Healing |
|
|
21
|
+
| 🟢 P2 | #15 | Cross-Repo Fleet Learning | 🟢🟢 | 🔴🔴🔴 | **MEDIUM** | Memory/Fleet |
|
|
22
|
+
| 🟢 P3 | #16 | Quorum-Based Merge Decisions | 🟢 | 🟡 | **LOW** | Quality |
|
|
23
|
+
| 🟢 P3 | #17 | Privacy-Hardening Mutations | 🟢 | 🔴🔴 | **LOW** | Compliance |
|
|
24
|
+
| 🟢 P3 | #18 | Dependency-Aware Task Scheduling (DAG) | 🟢 | 🟡 | **LOW** | Multi-Agent |
|
|
25
|
+
| 🟢 P3 | #19 | Symbol Caching + Semantic Search | 🟢 | 🟡 | **LOW** | Performance |
|
|
26
|
+
| 🟢 P3 | #20 | WebSocket Real-Time Loop Monitoring | 🟢 | 🟡 | **LOW** | Observability |
|
|
27
|
+
|
|
28
|
+
---
|
|
29
|
+
|
|
30
|
+
## PHASE 1 (Weeks 1-4): Convergence & Cost
|
|
31
|
+
|
|
32
|
+
### #1 Semantic Trajectory Analysis + Convergence Detection
|
|
33
|
+
|
|
34
|
+
**What it does:** Tracks embedding-space distance of consecutive agent outputs; detects stuck (contractive) vs wandering (exploratory) loops
|
|
35
|
+
|
|
36
|
+
**Why it matters:**
|
|
37
|
+
|
|
38
|
+
- Current: Hard iteration limit (5 iterations) wastes compute on stuck loops
|
|
39
|
+
- SOTA: Geometric Dynamics paper (arxiv 2512.10350) shows regime detection enables early exit
|
|
40
|
+
- Impact: 25-40% iteration waste reduction
|
|
41
|
+
|
|
42
|
+
**How to implement:**
|
|
43
|
+
|
|
44
|
+
1. On each loop iteration: encode agent output to embedding space (use Claude's embeddings)
|
|
45
|
+
2. Compute cosine distance to previous iteration's embedding
|
|
46
|
+
3. Track distance trend (contracting = converging, diverging = exploring)
|
|
47
|
+
4. Early exit if contracting + distance < threshold
|
|
48
|
+
5. Escalate to longer thinking if diverging unbounded
|
|
49
|
+
|
|
50
|
+
**Effort:** Medium (embedding integration, vector math, tracking state)
|
|
51
|
+
**Blocking:** Nothing (can implement in isolation)
|
|
52
|
+
**Files to modify:** `sw-loop.sh`, `sw-convergence-test.sh`
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
### #5 Speculative Cascade Model Routing
|
|
57
|
+
|
|
58
|
+
**What it does:** Try Haiku first (short timeout), escalate to Sonnet → Opus on failure
|
|
59
|
+
|
|
60
|
+
**Why it matters:**
|
|
61
|
+
|
|
62
|
+
- Current: Pick model upfront (per `--effort` flag), no escalation
|
|
63
|
+
- SOTA: Google Speculative Cascades paper; 30-60% cost reduction on median tasks
|
|
64
|
+
- Impact: 40-60% cost reduction while maintaining quality
|
|
65
|
+
|
|
66
|
+
**How to implement:**
|
|
67
|
+
|
|
68
|
+
1. Build failure prediction model: (query_type, difficulty) → success_rate on Haiku
|
|
69
|
+
2. For new query: estimate difficulty via embedding similarity
|
|
70
|
+
3. Route to Haiku with timeout (e.g., 30s)
|
|
71
|
+
4. If timeout/failure (tests fail), cascade to Sonnet, then Opus
|
|
72
|
+
5. Track cascade effectiveness per query type in memory
|
|
73
|
+
|
|
74
|
+
**Effort:** Medium (timeout management, cascade orchestration, tracking)
|
|
75
|
+
**Blocking:** Nothing
|
|
76
|
+
**Files to modify:** `sw-model-router.sh`, `sw-loop.sh`, new: `sw-cascade-router.sh`
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## PHASE 2 (Weeks 5-8): Security & Testing
|
|
81
|
+
|
|
82
|
+
### #3 Vulnerability Reward Model + Online RL Hardening
|
|
83
|
+
|
|
84
|
+
**What it does:** Add security signals (detected vulnerabilities, CWE patterns) to reward model; enable vulnerability-aware RL
|
|
85
|
+
|
|
86
|
+
**Why it matters:**
|
|
87
|
+
|
|
88
|
+
- Current: Reward signals are functional-only (test pass, coverage)
|
|
89
|
+
- SOTA: Meta's SecCoderX, Anthropic's security research
|
|
90
|
+
- Impact: 30-40% security issue reduction; compliance-ready code
|
|
91
|
+
|
|
92
|
+
**How to implement:**
|
|
93
|
+
|
|
94
|
+
1. Integrate lightweight SAST (e.g., Semgrep, bandit, Trivy)
|
|
95
|
+
2. Run on generated code; extract (vulnerability_count, cwe_classes)
|
|
96
|
+
3. Add to reward signal as negative reward: reward -= vulnerability_count \* weight
|
|
97
|
+
4. Store effective security fixes in episodic memory
|
|
98
|
+
5. Fine-tune on secure code examples
|
|
99
|
+
|
|
100
|
+
**Effort:** Medium (scanner integration, signal weighting, RL loop)
|
|
101
|
+
**Blocking:** Nothing
|
|
102
|
+
**Files to modify:** `sw-reward-aggregator.sh`, `sw-rl-optimizer.sh`, new: `sw-security-reward.sh`
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
### #6 Mutation Testing Feedback Loop
|
|
107
|
+
|
|
108
|
+
**What it does:** Validate test quality by checking % of mutants killed; regenerate tests if score low
|
|
109
|
+
|
|
110
|
+
**Why it matters:**
|
|
111
|
+
|
|
112
|
+
- Current: Coverage metrics inflated; 45% of LLM-generated tests are ineffective
|
|
113
|
+
- SOTA: Meta ACH, MutGen papers show mutation feedback improves test quality
|
|
114
|
+
- Impact: 30-40% better test effectiveness; catches subtle bugs
|
|
115
|
+
|
|
116
|
+
**How to implement:**
|
|
117
|
+
|
|
118
|
+
1. After test generation: run mutation tool (Major, PIT) on code
|
|
119
|
+
2. Run generated tests against mutants; compute mutation_score = killed / total
|
|
120
|
+
3. If score < threshold (e.g., 80%): add feedback to testgen prompt
|
|
121
|
+
4. Regenerate tests with mutation feedback
|
|
122
|
+
5. Store effective test patterns for reuse
|
|
123
|
+
|
|
124
|
+
**Effort:** Medium (mutation tool integration, feedback loop)
|
|
125
|
+
**Blocking:** Nothing
|
|
126
|
+
**Files to modify:** `sw-testgen.sh`, new: `sw-mutation-validator.sh`
|
|
127
|
+
|
|
128
|
+
---
|
|
129
|
+
|
|
130
|
+
### #13 Multi-Pass Mutation Generation (LLM-based)
|
|
131
|
+
|
|
132
|
+
**What it does:** Use LLM to generate diverse mutants (not just rule-based); Meta-style compliance
|
|
133
|
+
|
|
134
|
+
**Why it matters:**
|
|
135
|
+
|
|
136
|
+
- Current: Traditional mutation tools (Major) have limited operators
|
|
137
|
+
- SOTA: GPT-4o/DeepSeek-R1 generate 57 different AST node types vs 2 for rules
|
|
138
|
+
- Impact: Better mutation diversity; more confident test validation
|
|
139
|
+
|
|
140
|
+
**How to implement:**
|
|
141
|
+
|
|
142
|
+
1. Take source code + list of mutation types
|
|
143
|
+
2. Prompt LLM: "Generate N mutants that change behavior but keep syntax valid"
|
|
144
|
+
3. Validate mutants compile + are distinct from originals
|
|
145
|
+
4. Run tests; track mutation score
|
|
146
|
+
5. Feed back into testgen loop if coverage is low
|
|
147
|
+
|
|
148
|
+
**Effort:** High (prompt engineering, mutation validation)
|
|
149
|
+
**Blocking:** Nothing
|
|
150
|
+
**Files to modify:** new: `sw-llm-mutant-generator.sh`
|
|
151
|
+
|
|
152
|
+
---
|
|
153
|
+
|
|
154
|
+
## PHASE 3 (Weeks 9-12): Memory & Self-Healing
|
|
155
|
+
|
|
156
|
+
### #4 Episodic Memory Layer
|
|
157
|
+
|
|
158
|
+
**What it does:** Store complete execution traces (inputs, actions, outcomes); enable case-based reasoning
|
|
159
|
+
|
|
160
|
+
**Why it matters:**
|
|
161
|
+
|
|
162
|
+
- Current: Memory is pattern-based ("when X fails, do Y")
|
|
163
|
+
- SOTA: Mem0, EM-LLM, MemRL papers show episodic learning 20-35% faster
|
|
164
|
+
- Impact: Case-based analogy; long-horizon self-improvement
|
|
165
|
+
|
|
166
|
+
**How to implement:**
|
|
167
|
+
|
|
168
|
+
1. On each pipeline run: capture episode JSON (inputs, agent_actions, outputs, duration, cost, test_results)
|
|
169
|
+
2. Store in episodic DB (SQLite + JSON or Postgres)
|
|
170
|
+
3. Query: "Find 3 similar past episodes" (via embedding similarity)
|
|
171
|
+
4. Inject case as few-shot examples into new agent prompts
|
|
172
|
+
5. Active compression: every 10 episodes, consolidate → semantic facts
|
|
173
|
+
|
|
174
|
+
**Effort:** High (episode storage, retrieval, compression)
|
|
175
|
+
**Blocking:** Nothing
|
|
176
|
+
**Files to modify:** `sw-memory.sh`, new: `sw-episodic-memory.sh`
|
|
177
|
+
|
|
178
|
+
---
|
|
179
|
+
|
|
180
|
+
### #7 CI Repair Agent
|
|
181
|
+
|
|
182
|
+
**What it does:** When test/check fails, spawn repair agent to diagnose & fix root cause
|
|
183
|
+
|
|
184
|
+
**Why it matters:**
|
|
185
|
+
|
|
186
|
+
- Current: Retries on failure; no diagnosis
|
|
187
|
+
- SOTA: Pipeline Doctor pattern (2026 AIOps trend); 67% MTTR drop
|
|
188
|
+
- Impact: 50% fewer retries; faster merge times
|
|
189
|
+
|
|
190
|
+
**How to implement:**
|
|
191
|
+
|
|
192
|
+
1. Detect test/check failure (via CI logs)
|
|
193
|
+
2. Classify failure: timeout, race condition, assertion, resource, flaky
|
|
194
|
+
3. Spawn repair agent with failure context (logs, git diff, error)
|
|
195
|
+
4. Agent proposes fix (increase timeout, add sync, skip flaky test, etc.)
|
|
196
|
+
5. Re-run test; if passes, commit repair
|
|
197
|
+
6. Track effective repairs in memory
|
|
198
|
+
|
|
199
|
+
**Effort:** High (log parsing, classification, repair proposals, commit management)
|
|
200
|
+
**Blocking:** Nothing
|
|
201
|
+
**Files to modify:** `sw-ci.sh`, new: `sw-repair-agent.sh`
|
|
202
|
+
|
|
203
|
+
---
|
|
204
|
+
|
|
205
|
+
### #8 LLM-as-a-Judge Validation
|
|
206
|
+
|
|
207
|
+
**What it does:** Secondary model evaluates primary agent output; triggers repair if needed
|
|
208
|
+
|
|
209
|
+
**Why it matters:**
|
|
210
|
+
|
|
211
|
+
- Current: Quality gates are rule-based (coverage > X%, no ASan)
|
|
212
|
+
- SOTA: 2026 standard design pattern for agentic systems
|
|
213
|
+
- Impact: 10-15% fewer merge regressions; catches issues rules miss
|
|
214
|
+
|
|
215
|
+
**How to implement:**
|
|
216
|
+
|
|
217
|
+
1. After primary agent completes task: send code + acceptance criteria to Judge model
|
|
218
|
+
2. Judge evaluates: "Does this code meet requirements? Any issues?"
|
|
219
|
+
3. If Judge flags issues: auto-trigger repair agent or escalate
|
|
220
|
+
4. Log Judge decisions for learning
|
|
221
|
+
5. Track Judge accuracy (via post-merge bug rates)
|
|
222
|
+
|
|
223
|
+
**Effort:** Medium (prompt engineering, logic orchestration)
|
|
224
|
+
**Blocking:** Nothing
|
|
225
|
+
**Files to modify:** `sw-quality.sh`, new: `sw-judge.sh`
|
|
226
|
+
|
|
227
|
+
---
|
|
228
|
+
|
|
229
|
+
## TIER 2 Items (Brief Summary)
|
|
230
|
+
|
|
231
|
+
| # | Feature | Quick Implementation Path |
|
|
232
|
+
| --- | ------------------------------------- | ------------------------------------------------------------------------------ |
|
|
233
|
+
| #2 | Intent Specification Engine | Research phase; build DSL for constraints; integrate formal spec generation |
|
|
234
|
+
| #9 | Conflict Detection + DAG | Track file locks per agent; build task DAG scheduler; merge conflict resolver |
|
|
235
|
+
| #10 | Reasoning Budget Allocation | Track thinking cost vs outcome; build (task_type, complexity) → tokens lookup |
|
|
236
|
+
| #11 | Formal Verification (Dafny/Lean) | Integrate theorem prover APIs; generate specs; gate merge on proof completion |
|
|
237
|
+
| #12 | Active Context Compression | EM-LLM approach: Bayesian surprise + graph refinement for episode boundaries |
|
|
238
|
+
| #14 | Anomaly Detection + Predictive Repair | Time-series analysis on logs; ML model for failure prediction; repair triggers |
|
|
239
|
+
| #15 | Cross-Repo Fleet Learning | Share patterns via fleet event bus; rank patterns by repo similarity |
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
## Implementation Checklist
|
|
244
|
+
|
|
245
|
+
### PHASE 1 (Target: 2 weeks per item)
|
|
246
|
+
|
|
247
|
+
- [ ] #1 Semantic trajectory analysis
|
|
248
|
+
- [ ] Embedding integration
|
|
249
|
+
- [ ] Distance tracking + regime classification
|
|
250
|
+
- [ ] Early exit logic
|
|
251
|
+
- [ ] Tests + monitoring
|
|
252
|
+
- [ ] #5 Speculative cascade routing
|
|
253
|
+
- [ ] Failure prediction model
|
|
254
|
+
- [ ] Cascade orchestration
|
|
255
|
+
- [ ] Timeout management
|
|
256
|
+
- [ ] Tracking + learning
|
|
257
|
+
|
|
258
|
+
### PHASE 2 (Target: 1.5-2 weeks per item)
|
|
259
|
+
|
|
260
|
+
- [ ] #3 Vulnerability reward model
|
|
261
|
+
- [ ] #6 Mutation testing loop
|
|
262
|
+
- [ ] #13 LLM-based mutants
|
|
263
|
+
|
|
264
|
+
### PHASE 3 (Target: 2-3 weeks per item)
|
|
265
|
+
|
|
266
|
+
- [ ] #4 Episodic memory layer
|
|
267
|
+
- [ ] #7 CI repair agent
|
|
268
|
+
- [ ] #8 LLM-as-a-Judge
|
|
269
|
+
|
|
270
|
+
---
|
|
271
|
+
|
|
272
|
+
## Success Metrics (Post-Implementation)
|
|
273
|
+
|
|
274
|
+
| Feature | Metric | Target | Current |
|
|
275
|
+
| ------------------- | ----------------------------------- | ------- | -------- |
|
|
276
|
+
| #1 Loop convergence | Iteration waste reduction | -25-40% | Baseline |
|
|
277
|
+
| #5 Cascade routing | Cost reduction on median tasks | -40-60% | Baseline |
|
|
278
|
+
| #3 Security rewards | Bug reduction | -30-40% | Current |
|
|
279
|
+
| #6 Mutation testing | Test effectiveness (mutation score) | >80% | ~60% |
|
|
280
|
+
| #4 Episodic memory | Solution time on similar tasks | -20-35% | Baseline |
|
|
281
|
+
| #7 CI repair | Retry cycles | -50% | Baseline |
|
|
282
|
+
| Overall | Pipeline success rate | >85% | ~77% |
|
|
283
|
+
|
|
284
|
+
---
|
|
285
|
+
|
|
286
|
+
## Dependencies & Blocking Relationships
|
|
287
|
+
|
|
288
|
+
```
|
|
289
|
+
#1 (trajectory) ─────┐
|
|
290
|
+
├──→ #5 (cascade) ──→ Cost optimization ✓
|
|
291
|
+
│
|
|
292
|
+
#2 (intent) [research phase; no immediate blocks]
|
|
293
|
+
|
|
294
|
+
#3 (vulnerability) ──┐
|
|
295
|
+
#6 (mutations) ├──→ Security + Testing quality
|
|
296
|
+
#13 (LLM mutants) ───┘
|
|
297
|
+
|
|
298
|
+
#4 (episodic) ───────┐
|
|
299
|
+
#12 (compression) ───┤
|
|
300
|
+
#15 (fleet learning) ┤ All feed each other; can implement in parallel
|
|
301
|
+
└──→ Long-horizon learning
|
|
302
|
+
|
|
303
|
+
#7 (CI repair) ──┐
|
|
304
|
+
#8 (judge) └──→ Quality gates
|
|
305
|
+
|
|
306
|
+
No critical blocking path: all items can start immediately with risk.
|
|
307
|
+
Recommend: Start #1 + #5 in week 1, #3 + #6 in week 5, #4 + #7 in week 9.
|
|
308
|
+
```
|
|
309
|
+
|
|
310
|
+
---
|
|
311
|
+
|
|
312
|
+
## Cost-Benefit Analysis
|
|
313
|
+
|
|
314
|
+
### Immediate ROI (Phase 1-2, Weeks 1-8)
|
|
315
|
+
|
|
316
|
+
**Investment:**
|
|
317
|
+
|
|
318
|
+
- 2 engineers × 8 weeks @ $200K/year = ~$60K engineering cost
|
|
319
|
+
- Compute for research + prototyping = ~$5K
|
|
320
|
+
|
|
321
|
+
**Returns (Annual):**
|
|
322
|
+
|
|
323
|
+
- Cost reduction via cascade: 40-60% savings on compute (current $50K/month → $20-30K) = **$240-360K/year**
|
|
324
|
+
- Faster iteration: 30% speedup on 200 pipelines/month × $5/pipeline = **$30K/year**
|
|
325
|
+
- Security improvement: 30-40% fewer CVEs → reduced incident response = **$50K+ saved**
|
|
326
|
+
|
|
327
|
+
**Total Annual ROI: $320-440K on $65K investment = 5-7x**
|
|
328
|
+
|
|
329
|
+
### Long-Term ROI (Phase 3 + Beyond, Weeks 9-26)
|
|
330
|
+
|
|
331
|
+
**Additional returns:**
|
|
332
|
+
|
|
333
|
+
- Episodic memory: 20-35% faster solutions × 200 pipelines = **$50-85K/year**
|
|
334
|
+
- Self-healing CI: 50% fewer retries = **$30K/year** (fewer human reviews)
|
|
335
|
+
- Fleet learning: 20% faster on new projects = **$40K/year**
|
|
336
|
+
|
|
337
|
+
**Total Long-Term ROI: $440-555K on $120K investment = 3-4x**
|
|
338
|
+
|
|
339
|
+
---
|
|
340
|
+
|
|
341
|
+
## Next Steps
|
|
342
|
+
|
|
343
|
+
1. **This week:** Review [CUTTING_EDGE_RESEARCH_2026.md](./CUTTING_EDGE_RESEARCH_2026.md) for full details on each feature
|
|
344
|
+
2. **Next week:** Spike on #1 (trajectory analysis) — prototype embedding-space distance tracking
|
|
345
|
+
3. **Following week:** Begin #5 (cascade routing) and #3 (vulnerability rewards) in parallel
|
|
346
|
+
4. **Week 4+:** Ramp up to PHASE 2 items as Phase 1 items ship
|
|
347
|
+
|
|
348
|
+
---
|
|
349
|
+
|
|
350
|
+
**Generated:** April 4, 2026
|
|
351
|
+
**Total research effort:** 50+ sources, 25+ papers, 8 research areas
|
|
352
|
+
**Full report:** See CUTTING_EDGE_RESEARCH_2026.md (comprehensive analysis)
|