npm - shipwright-cli - Versions diffs - 3.2.0 → 3.3.0 - Mend

shipwright-cli 3.2.0 → 3.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (279) hide show

package/.claude/agents/code-reviewer.md +2 -0
package/.claude/agents/devops-engineer.md +2 -0
package/.claude/agents/doc-fleet-agent.md +2 -0
package/.claude/agents/pipeline-agent.md +2 -0
package/.claude/agents/shell-script-specialist.md +2 -0
package/.claude/agents/test-specialist.md +2 -0
package/.claude/hooks/agent-crash-capture.sh +32 -0
package/.claude/hooks/post-tool-use.sh +3 -2
package/.claude/hooks/pre-tool-use.sh +35 -3
package/README.md +4 -4
package/claude-code/hooks/config-change.sh +18 -0
package/claude-code/hooks/instructions-reloaded.sh +7 -0
package/claude-code/hooks/worktree-create.sh +25 -0
package/claude-code/hooks/worktree-remove.sh +20 -0
package/config/code-constitution.json +130 -0
package/dashboard/middleware/auth.ts +134 -0
package/dashboard/middleware/constants.ts +21 -0
package/dashboard/public/index.html +2 -6
package/dashboard/public/styles.css +100 -97
package/dashboard/routes/auth.ts +38 -0
package/dashboard/server.ts +66 -25
package/dashboard/services/config.ts +26 -0
package/dashboard/services/db.ts +118 -0
package/dashboard/src/canvas/pixel-agent.ts +298 -0
package/dashboard/src/canvas/pixel-sprites.ts +440 -0
package/dashboard/src/canvas/shipyard-effects.ts +367 -0
package/dashboard/src/canvas/shipyard-scene.ts +616 -0
package/dashboard/src/canvas/submarine-layout.ts +267 -0
package/dashboard/src/components/header.ts +8 -7
package/dashboard/src/core/router.ts +1 -0
package/dashboard/src/design/submarine-theme.ts +253 -0
package/dashboard/src/main.ts +2 -0
package/dashboard/src/types/api.ts +2 -1
package/dashboard/src/views/activity.ts +2 -1
package/dashboard/src/views/shipyard.ts +39 -0
package/dashboard/types/index.ts +166 -0
package/docs/plans/2026-02-28-compound-audit-and-shipyard-design.md +186 -0
package/docs/plans/2026-02-28-skipper-shipwright-implementation-plan.md +1182 -0
package/docs/plans/2026-02-28-skipper-shipwright-integration-design.md +531 -0
package/docs/plans/2026-03-01-ai-powered-skill-injection-design.md +298 -0
package/docs/plans/2026-03-01-ai-powered-skill-injection-plan.md +1109 -0
package/docs/plans/2026-03-01-capabilities-cleanup-plan.md +658 -0
package/docs/plans/2026-03-01-clean-architecture-plan.md +924 -0
package/docs/plans/2026-03-01-compound-audit-cascade-design.md +191 -0
package/docs/plans/2026-03-01-compound-audit-cascade-plan.md +921 -0
package/docs/plans/2026-03-01-deep-integration-plan.md +851 -0
package/docs/plans/2026-03-01-pipeline-audit-trail-design.md +145 -0
package/docs/plans/2026-03-01-pipeline-audit-trail-plan.md +770 -0
package/docs/plans/2026-03-01-refined-depths-brand-design.md +382 -0
package/docs/plans/2026-03-01-refined-depths-implementation.md +599 -0
package/docs/plans/2026-03-01-skipper-kernel-integration-design.md +203 -0
package/docs/plans/2026-03-01-unified-platform-design.md +272 -0
package/docs/plans/2026-03-07-claude-code-feature-integration-design.md +189 -0
package/docs/plans/2026-03-07-claude-code-feature-integration-plan.md +1165 -0
package/docs/research/BACKLOG_QUICK_REFERENCE.md +352 -0
package/docs/research/CUTTING_EDGE_RESEARCH_2026.md +546 -0
package/docs/research/RESEARCH_INDEX.md +439 -0
package/docs/research/RESEARCH_SOURCES.md +440 -0
package/docs/research/RESEARCH_SUMMARY.txt +275 -0
package/docs/superpowers/specs/2026-03-10-pipeline-quality-revolution-design.md +341 -0
package/package.json +2 -2
package/scripts/lib/adaptive-model.sh +427 -0
package/scripts/lib/adaptive-timeout.sh +316 -0
package/scripts/lib/audit-trail.sh +309 -0
package/scripts/lib/auto-recovery.sh +471 -0
package/scripts/lib/bandit-selector.sh +431 -0
package/scripts/lib/bootstrap.sh +104 -2
package/scripts/lib/causal-graph.sh +455 -0
package/scripts/lib/compat.sh +126 -0
package/scripts/lib/compound-audit.sh +337 -0
package/scripts/lib/constitutional.sh +454 -0
package/scripts/lib/context-budget.sh +359 -0
package/scripts/lib/convergence.sh +594 -0
package/scripts/lib/cost-optimizer.sh +634 -0
package/scripts/lib/daemon-adaptive.sh +10 -0
package/scripts/lib/daemon-dispatch.sh +106 -17
package/scripts/lib/daemon-failure.sh +34 -4
package/scripts/lib/daemon-patrol.sh +23 -2
package/scripts/lib/daemon-poll-github.sh +361 -0
package/scripts/lib/daemon-poll-health.sh +299 -0
package/scripts/lib/daemon-poll.sh +27 -611
package/scripts/lib/daemon-state.sh +112 -66
package/scripts/lib/daemon-triage.sh +10 -0
package/scripts/lib/dod-scorecard.sh +442 -0
package/scripts/lib/error-actionability.sh +300 -0
package/scripts/lib/formal-spec.sh +461 -0
package/scripts/lib/helpers.sh +177 -4
package/scripts/lib/intent-analysis.sh +409 -0
package/scripts/lib/loop-convergence.sh +350 -0
package/scripts/lib/loop-iteration.sh +682 -0
package/scripts/lib/loop-progress.sh +48 -0
package/scripts/lib/loop-restart.sh +185 -0
package/scripts/lib/memory-effectiveness.sh +506 -0
package/scripts/lib/mutation-executor.sh +352 -0
package/scripts/lib/outcome-feedback.sh +521 -0
package/scripts/lib/pipeline-cli.sh +336 -0
package/scripts/lib/pipeline-commands.sh +1216 -0
package/scripts/lib/pipeline-detection.sh +100 -2
package/scripts/lib/pipeline-execution.sh +897 -0
package/scripts/lib/pipeline-github.sh +28 -3
package/scripts/lib/pipeline-intelligence-compound.sh +431 -0
package/scripts/lib/pipeline-intelligence-scoring.sh +407 -0
package/scripts/lib/pipeline-intelligence-skip.sh +181 -0
package/scripts/lib/pipeline-intelligence.sh +100 -1136
package/scripts/lib/pipeline-quality-bash-compat.sh +182 -0
package/scripts/lib/pipeline-quality-checks.sh +17 -715
package/scripts/lib/pipeline-quality-gates.sh +563 -0
package/scripts/lib/pipeline-stages-build.sh +730 -0
package/scripts/lib/pipeline-stages-delivery.sh +965 -0
package/scripts/lib/pipeline-stages-intake.sh +1133 -0
package/scripts/lib/pipeline-stages-monitor.sh +407 -0
package/scripts/lib/pipeline-stages-review.sh +1022 -0
package/scripts/lib/pipeline-stages.sh +59 -2929
package/scripts/lib/pipeline-state.sh +36 -5
package/scripts/lib/pipeline-util.sh +487 -0
package/scripts/lib/policy-learner.sh +438 -0
package/scripts/lib/process-reward.sh +493 -0
package/scripts/lib/project-detect.sh +649 -0
package/scripts/lib/quality-profile.sh +334 -0
package/scripts/lib/recruit-commands.sh +885 -0
package/scripts/lib/recruit-learning.sh +739 -0
package/scripts/lib/recruit-roles.sh +648 -0
package/scripts/lib/reward-aggregator.sh +458 -0
package/scripts/lib/rl-optimizer.sh +362 -0
package/scripts/lib/root-cause.sh +427 -0
package/scripts/lib/scope-enforcement.sh +445 -0
package/scripts/lib/session-restart.sh +493 -0
package/scripts/lib/skill-memory.sh +300 -0
package/scripts/lib/skill-registry.sh +775 -0
package/scripts/lib/spec-driven.sh +476 -0
package/scripts/lib/test-helpers.sh +18 -7
package/scripts/lib/test-holdout.sh +429 -0
package/scripts/lib/test-optimizer.sh +511 -0
package/scripts/shipwright-file-suggest.sh +45 -0
package/scripts/skills/adversarial-quality.md +61 -0
package/scripts/skills/api-design.md +44 -0
package/scripts/skills/architecture-design.md +50 -0
package/scripts/skills/brainstorming.md +43 -0
package/scripts/skills/data-pipeline.md +44 -0
package/scripts/skills/deploy-safety.md +64 -0
package/scripts/skills/documentation.md +38 -0
package/scripts/skills/frontend-design.md +45 -0
package/scripts/skills/generated/.gitkeep +0 -0
package/scripts/skills/generated/_refinements/.gitkeep +0 -0
package/scripts/skills/generated/_refinements/adversarial-quality.patch.md +3 -0
package/scripts/skills/generated/_refinements/architecture-design.patch.md +3 -0
package/scripts/skills/generated/_refinements/brainstorming.patch.md +3 -0
package/scripts/skills/generated/cli-version-management.md +29 -0
package/scripts/skills/generated/collection-system-validation.md +99 -0
package/scripts/skills/generated/large-scale-c-refactoring-coordination.md +97 -0
package/scripts/skills/generated/pattern-matching-similarity-scoring.md +195 -0
package/scripts/skills/generated/test-parallelization-detection.md +65 -0
package/scripts/skills/observability.md +79 -0
package/scripts/skills/performance.md +48 -0
package/scripts/skills/pr-quality.md +49 -0
package/scripts/skills/product-thinking.md +43 -0
package/scripts/skills/security-audit.md +49 -0
package/scripts/skills/systematic-debugging.md +40 -0
package/scripts/skills/testing-strategy.md +47 -0
package/scripts/skills/two-stage-review.md +52 -0
package/scripts/skills/validation-thoroughness.md +55 -0
package/scripts/sw +9 -3
package/scripts/sw-activity.sh +9 -2
package/scripts/sw-adaptive.sh +2 -1
package/scripts/sw-adversarial.sh +2 -1
package/scripts/sw-architecture-enforcer.sh +3 -1
package/scripts/sw-auth.sh +12 -2
package/scripts/sw-autonomous.sh +5 -1
package/scripts/sw-changelog.sh +4 -1
package/scripts/sw-checkpoint.sh +2 -1
package/scripts/sw-ci.sh +5 -1
package/scripts/sw-cleanup.sh +4 -26
package/scripts/sw-code-review.sh +10 -4
package/scripts/sw-connect.sh +2 -1
package/scripts/sw-context.sh +2 -1
package/scripts/sw-cost.sh +48 -3
package/scripts/sw-daemon.sh +66 -9
package/scripts/sw-dashboard.sh +3 -1
package/scripts/sw-db.sh +59 -16
package/scripts/sw-decide.sh +8 -2
package/scripts/sw-decompose.sh +360 -17
package/scripts/sw-deps.sh +4 -1
package/scripts/sw-developer-simulation.sh +4 -1
package/scripts/sw-discovery.sh +325 -2
package/scripts/sw-doc-fleet.sh +4 -1
package/scripts/sw-docs-agent.sh +3 -1
package/scripts/sw-docs.sh +2 -1
package/scripts/sw-doctor.sh +453 -2
package/scripts/sw-dora.sh +4 -1
package/scripts/sw-durable.sh +4 -3
package/scripts/sw-e2e-orchestrator.sh +17 -16
package/scripts/sw-eventbus.sh +7 -1
package/scripts/sw-evidence.sh +364 -12
package/scripts/sw-feedback.sh +550 -9
package/scripts/sw-fix.sh +20 -1
package/scripts/sw-fleet-discover.sh +6 -2
package/scripts/sw-fleet-viz.sh +4 -1
package/scripts/sw-fleet.sh +5 -1
package/scripts/sw-github-app.sh +16 -3
package/scripts/sw-github-checks.sh +3 -2
package/scripts/sw-github-deploy.sh +3 -2
package/scripts/sw-github-graphql.sh +18 -7
package/scripts/sw-guild.sh +5 -1
package/scripts/sw-heartbeat.sh +5 -30
package/scripts/sw-hello.sh +67 -0
package/scripts/sw-hygiene.sh +6 -1
package/scripts/sw-incident.sh +265 -1
package/scripts/sw-init.sh +18 -2
package/scripts/sw-instrument.sh +10 -2
package/scripts/sw-intelligence.sh +42 -6
package/scripts/sw-jira.sh +5 -1
package/scripts/sw-launchd.sh +2 -1
package/scripts/sw-linear.sh +4 -1
package/scripts/sw-logs.sh +4 -1
package/scripts/sw-loop.sh +432 -1128
package/scripts/sw-memory.sh +356 -2
package/scripts/sw-mission-control.sh +6 -1
package/scripts/sw-model-router.sh +481 -26
package/scripts/sw-otel.sh +13 -4
package/scripts/sw-oversight.sh +14 -5
package/scripts/sw-patrol-meta.sh +334 -0
package/scripts/sw-pipeline-composer.sh +5 -1
package/scripts/sw-pipeline-vitals.sh +2 -1
package/scripts/sw-pipeline.sh +53 -2664
package/scripts/sw-pm.sh +12 -5
package/scripts/sw-pr-lifecycle.sh +2 -1
package/scripts/sw-predictive.sh +7 -1
package/scripts/sw-prep.sh +185 -2
package/scripts/sw-ps.sh +5 -25
package/scripts/sw-public-dashboard.sh +15 -3
package/scripts/sw-quality.sh +2 -1
package/scripts/sw-reaper.sh +8 -25
package/scripts/sw-recruit.sh +156 -2303
package/scripts/sw-regression.sh +19 -12
package/scripts/sw-release-manager.sh +3 -1
package/scripts/sw-release.sh +4 -1
package/scripts/sw-remote.sh +3 -1
package/scripts/sw-replay.sh +7 -1
package/scripts/sw-retro.sh +158 -1
package/scripts/sw-review-rerun.sh +3 -1
package/scripts/sw-scale.sh +10 -3
package/scripts/sw-security-audit.sh +6 -1
package/scripts/sw-self-optimize.sh +6 -3
package/scripts/sw-session.sh +9 -3
package/scripts/sw-setup.sh +3 -1
package/scripts/sw-stall-detector.sh +406 -0
package/scripts/sw-standup.sh +15 -7
package/scripts/sw-status.sh +3 -1
package/scripts/sw-strategic.sh +4 -1
package/scripts/sw-stream.sh +7 -1
package/scripts/sw-swarm.sh +18 -6
package/scripts/sw-team-stages.sh +13 -6
package/scripts/sw-templates.sh +5 -29
package/scripts/sw-testgen.sh +7 -1
package/scripts/sw-tmux-pipeline.sh +4 -1
package/scripts/sw-tmux-role-color.sh +2 -0
package/scripts/sw-tmux-status.sh +1 -1
package/scripts/sw-tmux.sh +3 -1
package/scripts/sw-trace.sh +3 -1
package/scripts/sw-tracker-github.sh +3 -0
package/scripts/sw-tracker-jira.sh +3 -0
package/scripts/sw-tracker-linear.sh +3 -0
package/scripts/sw-tracker.sh +3 -1
package/scripts/sw-triage.sh +2 -1
package/scripts/sw-upgrade.sh +3 -1
package/scripts/sw-ux.sh +5 -2
package/scripts/sw-webhook.sh +3 -1
package/scripts/sw-widgets.sh +3 -1
package/scripts/sw-worktree.sh +15 -3
package/scripts/test-skill-injection.sh +1233 -0
package/templates/pipelines/autonomous.json +27 -3
package/templates/pipelines/cost-aware.json +34 -8
package/templates/pipelines/deployed.json +12 -0
package/templates/pipelines/enterprise.json +12 -0
package/templates/pipelines/fast.json +6 -0
package/templates/pipelines/full.json +27 -3
package/templates/pipelines/hotfix.json +6 -0
package/templates/pipelines/standard.json +12 -0
package/templates/pipelines/tdd.json +12 -0

package/scripts/skills/generated/pattern-matching-similarity-scoring.md ADDED Viewed

@@ -0,0 +1,195 @@
+# Pattern Matching & Failure Prevention Scoring
+## Overview
+This skill guides design and implementation of pattern-based proactive failure prevention: matching incoming issues against captured failure patterns, scoring similarity, injecting relevant context, and measuring whether patterns actually prevent repeat failures.
+## Similarity Scoring Algorithm (0-100 scale)
+For each incoming issue, compute a composite similarity score against each known failure pattern:
+### Component 1: Title Similarity (40% weight)
+- Fuzzy string matching using token overlap or Levenshtein distance normalized by string length
+- Captures semantic closeness of the problem description
+- Example: "API timeout on user endpoint" vs "Timeout in auth middleware" → ~0.7 similarity → 28 points
+### Component 2: File Overlap (35% weight)
+- Compare changed files in original failure vs incoming issue
+- Score = (overlapping_files / max(original_files, incoming_files)) * 100
+- Files touching the same components are more likely to have similar root causes
+- Example: Both touched `scripts/sw-daemon.sh` and `scripts/lib/daemon-dispatch.sh` → 35 points if full overlap
+### Component 3: Error Signature Match (25% weight)
+- Check if error message substrings or error codes appear in both
+- Extract from error-summary.json or stack trace (structured format preferred)
+- Example: Both contain "pipefail" or "ENOENT" → 25 points
+**Formula: score = (title_score * 0.4) + (file_score * 0.35) + (error_score * 0.25)**
+## Injection Thresholds
+- **Below 60**: Pattern not relevant, no injection
+- **60-80**: Inject with confidence tag ("medium confidence match")
+- **80-100**: Inject with high confidence ("strong pattern match")
+- **Configurable threshold**: daemon-config.json `memory_pattern_matching.similarity_threshold`
+## Proactive Injection Strategy
+When score > threshold (at pipeline spawn time, before plan stage):
+1. Extract relevant context from memory pattern:
+   - Root cause description
+   - Applied fix(es)
+   - Environment/version context if present
+   - What worked vs what didn't
+2. Inject into pipeline prompt:
+   ```
+   Similar pattern found (confidence: 85%): Issue #123 "API timeout on user endpoint"
+   Root cause: Unbounded goroutine creation in event loop
+   Applied fix: Add semaphore to limit concurrent handlers
+   Files affected: scripts/sw-daemon.sh, internal/loop.go
+   ```
+3. Tag injection metadata:
+   - `pattern_id`: which pattern was injected
+   - `injection_score`: 0-100 confidence
+   - `timestamp`: when injected
+   - `source_issue_id`: which failure this pattern came from
+4. **Important**: No coercion. Agent can ignore injected pattern if context doesn't match.
+## Outcome Tracking
+After pipeline completes, record:
+```json
+{
+  "pattern_injected": true,
+  "pattern_id": "mem_abc123",
+  "injection_score": 85,
+  "incoming_issue_id": "456",
+  "source_issue_id": "123",
+  "failure_occurred": false,
+  "failure_type": null,
+  "expected_failure_type": "timeout",
+  "failure_type_matched": null,
+  "failure_prevented": true,
+  "confidence_in_prevention": 0.7,
+  "notes": "Applied semaphore fix suggested by pattern; no timeout occurred."
+}
+```
+**Caveat**: `failure_prevented` is **inference**, not proof.
+- True positive: pattern injected, agent applied the fix, no failure occurred, failure type matched expected type
+- Could be false positive: pattern coincidentally matched, but agent's own skill prevented failure
+- Always include confidence score; never claim 100% causation
+## Effectiveness Metrics (Dashboard)
+**Per-pattern metrics:**
+- **Success Rate**: (times_injected AND failure_prevented) / times_injected
+- **Usage Frequency**: times_injected in last 30 days
+- **Confidence Distribution**: histogram of injection_scores
+- **False Positive Rate**: (times_injected AND failure_occurred) / times_injected
+**Aggregate metrics:**
+- **Overall Memory Injection ROI**: sum(successful_injections) / sum(total_injections)
+- **Patterns Needing Refinement**: patterns with > 30% usage but < 40% success rate (candidates for root cause re-analysis)
+- **Trending**: success rate on 7-day and 30-day windows; alert if trending down
+- **Pattern Lifecycle**: which patterns are becoming obsolete (< 1% usage in 90 days)?
+## Integration Points
+1. **sw-memory.sh**
+   - Call `memory_get_patterns()` to retrieve all patterns with timestamps, failure_type, root_cause
+   - Call `memory_add_outcome_tracking()` to record success/failure outcome
+2. **sw-intelligence.sh**
+   - Integrate pattern scoring into `intake` stage
+   - Score issue at pipeline spawn time (before plan stage)
+   - Return top 3 matching patterns sorted by score
+3. **Pipeline prompt composition**
+   - Add `memory_pattern_context` section to prompt if score > threshold
+   - Include confidence score so agent is aware this is a suggestion, not a fact
+4. **Pipeline state tracking**
+   - Add `memory_patterns` section to pipeline-state.md with injected pattern details
+   - Track injection_score, outcome_recorded=true/false
+5. **Loop iteration context**
+   - If issue re-runs in build loop, re-score with new error context
+   - Emerging error signatures may match different patterns on retry
+## Testing Strategy
+**Unit tests:**
+- Similarity scoring against known issue pairs with ground truth
+- Threshold boundary behavior (59, 60, 61)
+- Weight adjustment: verify 0.4 + 0.35 + 0.25 = 1.0
+**Edge cases:**
+- Empty pattern database → score undefined, no injection
+- Identical issues with different outcomes → verify both outcomes tracked
+- Pattern with malformed error_signature → graceful fallback
+- Very high similarity (> 95%) → verify no over-confidence
+**Integration tests:**
+- Inject pattern, verify it appears in pipeline prompt
+- Run build, record outcome, verify outcome_tracking fires
+- Query dashboard, verify metrics match recorded outcomes
+**Effectiveness validation:**
+- Mock a failure type, populate patterns database, run scoring
+- Verify pattern was injected at expected score
+- Mock outcome (failure_prevented=true/false), verify metrics compute correctly
+## Configuration Example
+```json
+{
+  "memory_pattern_matching": {
+    "enabled": true,
+    "similarity_threshold": 60,
+    "weights": {
+      "title_similarity": 0.4,
+      "file_overlap": 0.35,
+      "error_signature": 0.25
+    },
+    "confidence_tiers": {
+      "high": 80,
+      "medium": 60,
+      "low": 30
+    },
+    "max_patterns_to_inject": 3,
+    "metrics_retention_days": 90,
+    "anomaly_detection_enabled": true
+  }
+}
+```
+## Risk Mitigation
+**Risk 1: False positive injection**
+- Monitoring: alert if false_positive_rate > 15%
+- Mitigation: lower threshold, disable for specific pattern types, or retire pattern
+**Risk 2: Outcome attribution confusion**
+- Always show confidence_in_prevention as a float (0.0-1.0), never binary
+- Document that "prevented" is inferred, not measured
+- Quarterly review of patterns with low confidence
+**Risk 3: Circular reasoning**
+- Patterns must capture ROOT CAUSE, not just "solution"
+- Red flag: if pattern root_cause is identical to another pattern → merge
+- Quarterly audit of pattern root_cause quality
+**Risk 4: Performance at scale**
+- Scoring 100+ patterns should be < 500ms
+- Use cached similarity scores if possible
+- Parallel scoring if pattern database grows beyond 500
+**Risk 5: Stale patterns**
+- Patterns from > 180 days ago with < 5 uses → mark for review
+- Dashboard should surface "patterns never injected" for root cause analysis

package/scripts/skills/generated/test-parallelization-detection.md ADDED Viewed

@@ -0,0 +1,65 @@
+## Test Parallelization Detection & Coordination
+### Problem
+Test parallelization is dangerous: undetected shared state (temp files, global state, database connections) causes race conditions and flaky failures. This skill provides a systematic approach to detect parallelizable test suites and coordinate their execution safely.
+### Shared State Detection Heuristics
+**Static Analysis (file scanning):**
+- Scan test file imports for singleton patterns (db connections, file handles, global state modules)
+- Detect hardcoded file paths (temp dirs) and network ports — tests using fixed resources conflict
+- Check for `beforeAll`/`afterAll` hooks that modify global state
+- Identify test files importing shared fixtures/setup modules
+**Dynamic Analysis (test execution):**
+- Run test suite with `--detectOpenHandles` (Node.js) or equivalent to catch file/port leaks
+- Track temp directory usage per test file — any overlap = unsafe to parallelize
+- Monitor for test isolation violations (tests passing in isolation but failing when run together)
+**Safety Levels:**
+- **Green (parallelizable)**: No shared state detected, no fixture conflicts, passes isolation tests
+- **Yellow (conditional)**: Shared fixtures but isolated datasets, parallel execution with coordination (e.g., separate DB schemas)
+- **Red (sequential)**: Database transaction rollback, process spawning, hardware resource contention — must run serially
+### Affected-Test Detection via Git Diff
+**Module Dependency Tracking:**
+1. Build module-to-test mapping (which tests exercise which modules)
+2. On each commit, run `git diff --name-only HEAD~1` to identify changed modules
+3. Find all tests that import/test those modules
+4. Prioritize affected tests first in execution order (fail-fast on functionality regression)
+5. Cache mapping per commit to avoid re-scanning on retries
+**False Negatives to Handle:**
+- Integration tests that cross module boundaries (require broader analysis)
+- Tests that exercise shared utilities or base classes (conservative: mark as affected if any parent module changed)
+- Dynamic imports and string-based test discovery (fallback: scan test code for patterns)
+### Parallel Execution Coordination
+**Scheduler:**
+- Detect CPU core count, default to `cores - 1` (reserve 1 for OS)
+- Group parallelizable tests into batches, run batches in parallel
+- Within each batch, respect test file order (some test runners depend on execution order)
+- Run non-parallelizable (red) tests serially, either before or after parallel batches (configurable)
+**Fast-Fail Policy:**
+- Critical failures: assertion errors, uncaught exceptions → abort immediately
+- Flaky failures: timeout, process exit, known-flaky markers → retry up to N times before aborting
+- Aggregate results across parallel workers before reporting
+- Time tracking: measure wall-clock time for each batch, report parallelization efficiency (theoretical vs actual speedup)
+### Dashboard Integration
+- Display parallel execution summary: N tests in M workers, X% speedup
+- Visualize test dependency graph (which tests block which)
+- Alert on shared-state violations (test passed alone, failed in parallel)
+- Trend: parallelization efficiency over time (detect regressions where new tests add serial bottlenecks)
+### Key Decisions for This Issue
+1. **Minimum Parallelization Threshold**: What's the smallest safe granularity? (per file, per suite, per test?)
+2. **Flaky Detection**: How many retries before marking as critical failure? (recommend 3)
+3. **Shared-State Confidence**: Are heuristics sufficient, or require explicit opt-in per test file?
+4. **Fast-Fail Behavior**: Abort on first critical failure globally, or let all workers finish for faster feedback iteration?
+5. **Fallback**: If parallelization detection is uncertain, run serial — safety over speed.

package/scripts/skills/observability.md ADDED Viewed

@@ -0,0 +1,79 @@
+## Observability: Watch the Deploy Like a Hawk
+Post-deploy monitoring catches what tests miss. Real traffic reveals real problems.
+### What to Monitor (by Priority)
+**P0 — Immediate (first 5 minutes):**
+- Error rate: any increase over baseline?
+- Health check: still returning 200?
+- Latency: p50/p95/p99 within normal range?
+- Memory/CPU: any sudden spikes?
+**P1 — Short-term (5-30 minutes):**
+- Business metrics: are users completing key flows?
+- Queue depths: are background jobs processing normally?
+- Connection pools: any exhaustion or leak patterns?
+- Disk usage: any unexpected growth?
+**P2 — Medium-term (1-24 hours):**
+- Memory trends: gradual leak over time?
+- Error rate trends: slowly increasing?
+- User-reported issues: any new support tickets?
+- Performance degradation under sustained load?
+### Anomaly Detection Patterns
+- **Spike detection**: >2x baseline error rate in any 1-minute window
+- **Trend detection**: steadily increasing error rate over 5-minute window
+- **Absence detection**: expected periodic events stop occurring
+- **Latency shift**: p95 latency increases >50% from baseline
+### Log Analysis
+- Search for new ERROR/FATAL/PANIC entries not present before deploy
+- Check for stack traces — they indicate unhandled exceptions
+- Look for retry storms — repeated failed attempts at the same operation
+- Monitor for resource exhaustion messages (OOM, connection refused, disk full)
+### Auto-Rollback Triggers
+Automatically rollback if ANY of these occur:
+- Health check fails 3 consecutive times
+- Error rate exceeds threshold for 2+ minutes
+- Critical service dependency becomes unreachable
+- Memory usage exceeds 90% of limit
+### Monitoring by Issue Type
+**Frontend changes:**
+- JavaScript error rates in browser (if client-side monitoring exists)
+- Asset load failures (404s on new bundles)
+- Core Web Vitals regression (LCP, FID, CLS)
+**API changes:**
+- Response status code distribution (2xx vs 4xx vs 5xx)
+- Request throughput — drops indicate client-side breakage
+- Authentication failures — spikes indicate auth regression
+**Database changes:**
+- Query latency per endpoint
+- Connection pool utilization
+- Slow query log entries
+- Replication lag (if applicable)
+### Incident Escalation
+If monitoring detects issues:
+1. Execute rollback (if auto-rollback enabled)
+2. Create incident issue with monitoring data
+3. Attach relevant logs and metrics
+4. Tag the original issue with `incident` label
+5. Do NOT silence alerts — let them fire
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **Monitoring Checklist**: P0/P1/P2 metrics to watch (error rate, latency, memory, health checks) with specific thresholds
+2. **Anomaly Detection Triggers**: Explicit conditions that trigger alerts (spike detection >2x, trend detection over 5min, absence detection, latency shift >50%)
+3. **Log Analysis**: Search strategy for new ERROR/FATAL entries, stack traces, retry storms, resource exhaustion patterns
+4. **Auto-Rollback Decision Criteria**: Conditions that trigger automatic rollback (health check failures, error rate threshold, critical dependency unreachable, memory exhaustion)
+If any section is not applicable, explicitly state why it's skipped.

package/scripts/skills/performance.md ADDED Viewed

@@ -0,0 +1,48 @@
+## Performance Expertise
+Apply these optimization patterns:
+### Profiling First
+- Measure before optimizing — identify the actual bottleneck
+- Use profiling tools appropriate to the language/runtime
+- Focus on the critical path — optimize what users experience
+### Caching Strategy
+- Cache expensive computations and repeated queries
+- Set appropriate TTLs — stale data vs freshness trade-off
+- Invalidate caches on write operations
+- Use cache layers: in-memory (L1) → distributed (L2) → database (L3)
+### Database Performance
+- Add indexes for frequently queried columns (check EXPLAIN plans)
+- Avoid N+1 queries — use batch loading or JOINs
+- Use connection pooling
+- Consider read replicas for read-heavy workloads
+### Algorithm Complexity
+- Prefer O(n log n) over O(n²) for sorting/searching
+- Use appropriate data structures (hash maps for lookups, trees for ranges)
+- Avoid unnecessary allocations in hot paths
+- Pre-compute values that are used repeatedly
+### Network Optimization
+- Minimize round trips — batch API calls where possible
+- Use compression for large payloads
+- Implement pagination — never return unbounded result sets
+- Use CDNs for static assets
+### Benchmarking
+- Include before/after benchmarks for performance changes
+- Test with realistic data volumes (not just unit test fixtures)
+- Measure p50, p95, p99 latencies — not just averages
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **Baseline Metrics**: Current performance metrics before optimization (p50/p95/p99 latency, throughput, resource usage)
+2. **Optimization Targets**: Specific targets (e.g., "reduce p95 latency from 250ms to <100ms") with rationale
+3. **Profiling Strategy**: Tools and methodology to identify bottlenecks (CPU profiler, memory profiler, query analyzer, benchmarks)
+4. **Benchmark Plan**: Before/after benchmarks with realistic data volume and success criteria for each optimization
+If any section is not applicable, explicitly state why it's skipped.

package/scripts/skills/pr-quality.md ADDED Viewed

@@ -0,0 +1,49 @@
+## PR Quality: Ship a Reviewable Pull Request
+Write a PR that a reviewer can understand in 5 minutes.
+### PR Description Structure
+1. **What** — One sentence: what does this PR do?
+2. **Why** — Link to issue. Why is this change needed?
+3. **How** — Brief technical approach (2-3 sentences max)
+4. **Testing** — What was tested? How to verify?
+5. **Screenshots** — If UI changes, before/after screenshots
+### Commit Hygiene
+- Each commit should be a logical unit of work
+- Commit messages: imperative mood, 50-char subject, blank line, body explains WHY
+- No WIP/fixup/squash commits in final PR
+- No merge commits — rebase onto base branch
+- Separate refactoring commits from feature commits
+### Diff Quality
+- Remove all debugging artifacts (console.log, print statements, commented-out code)
+- No unrelated formatting changes mixed with logic changes
+- Generated files should be committed separately or excluded
+- File renames should be separate commits (so git tracks them)
+### Reviewer Empathy
+- If the diff is >500 lines, add a "Review guide" section explaining the reading order
+- Call out non-obvious decisions with inline comments
+- Flag areas where you're least confident and want careful review
+- If you changed a pattern used elsewhere, note whether existing code needs updating
+### Self-Review Checklist
+Before marking as ready:
+- [ ] PR description explains what, why, and how
+- [ ] All CI checks pass
+- [ ] No secrets, credentials, or API keys in diff
+- [ ] No TODO/FIXME comments without issue links
+- [ ] Breaking changes documented in description
+- [ ] Migration steps documented if applicable
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **PR Description**: What (one sentence), Why (issue link), How (2-3 sentence technical approach), Testing (what was tested)
+2. **Commit Hygiene Check**: Verification that each commit is a logical unit, no WIP/fixup/squash, no merge commits
+3. **Diff Review**: Confirmation that all debugging artifacts removed (console.log, commented code), no unrelated formatting changes
+4. **Self-Review Checklist Completion**: All items from checklist checked (secrets scanned, CI green, breaking changes documented)
+If any section is not applicable, explicitly state why it's skipped.

package/scripts/skills/product-thinking.md ADDED Viewed

@@ -0,0 +1,43 @@
+## Product Thinking Expertise
+Consider the user perspective in your implementation:
+### User Stories
+- Who is the user for this feature?
+- What problem does this solve for them?
+- What is their workflow before and after this change?
+- Define acceptance criteria from the user's perspective
+### User Experience
+- What is the simplest interaction that solves the problem?
+- How does the user discover this feature?
+- What happens when things go wrong? (error states, recovery)
+- Is the feature accessible to users with disabilities?
+### Edge Cases from User Perspective
+- What if the user has no data yet? (empty state)
+- What if the user has too much data? (pagination, filtering)
+- What if the user makes a mistake? (undo, confirmation)
+- What if the user is on a slow connection? (loading states)
+### Progressive Disclosure
+- Show the most important information first
+- Hide complexity behind progressive interactions
+- Don't overwhelm with options — provide sensible defaults
+- Use contextual help instead of documentation
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **User Stories**: In "As a..., I want..., So that..." format with at least one primary and one secondary user story
+2. **Acceptance Criteria**: Given/When/Then format for how to verify the feature works from the user's perspective
+3. **Edge Cases from User Perspective**: At least 3 specific scenarios (empty state, error state, overload state)
+If any section is not applicable, explicitly state why it's skipped.
+### Feedback & Communication
+- Confirm successful actions immediately
+- Explain errors in plain language — not error codes
+- Show progress for long-running operations
+- Preserve user context across navigation

package/scripts/skills/security-audit.md ADDED Viewed

@@ -0,0 +1,49 @@
+## Security Audit Expertise
+Apply OWASP Top 10 and security best practices:
+### Injection Prevention
+- Use parameterized queries for ALL database access
+- Sanitize user input before rendering in HTML/templates
+- Validate and sanitize file paths — prevent directory traversal
+- Never execute user-supplied strings as code or commands
+### Authentication
+- Hash passwords with bcrypt/argon2 (never MD5/SHA1)
+- Implement account lockout after failed attempts
+- Use secure session management (HttpOnly, Secure, SameSite cookies)
+- Require re-authentication for sensitive operations
+### Authorization
+- Check permissions server-side on EVERY request
+- Use deny-by-default — explicitly grant access
+- Verify resource ownership (user can only access their own data)
+- Log authorization failures for monitoring
+### Data Protection
+- Never log sensitive data (passwords, tokens, PII)
+- Encrypt sensitive data at rest
+- Use HTTPS for all communications
+- Set appropriate CORS headers — never use wildcard in production
+### Secrets Management
+- Never hardcode secrets in source code
+- Use environment variables or secret managers
+- Rotate secrets regularly
+- Check for accidentally committed secrets (API keys, passwords, tokens)
+### Dependency Security
+- Check for known vulnerabilities in dependencies
+- Pin dependency versions to prevent supply chain attacks
+- Review new dependencies before adding them
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **Threat Model (STRIDE)**: Identify threats across Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege
+2. **Auth Flow**: Step-by-step diagram of authentication/authorization flow with session/token handling
+3. **Input Validation Points**: List all places where user input enters the system and how each is validated/sanitized
+4. **Security Checklist**: Items verified (no secrets in code, secrets rotated, HTTPS enforced, CORS configured, rate limiting applied)
+If any section is not applicable, explicitly state why it's skipped.

package/scripts/skills/systematic-debugging.md ADDED Viewed

@@ -0,0 +1,40 @@
+## Systematic Debugging: Root Cause Analysis
+A previous attempt at this stage FAILED. Do NOT blindly retry the same approach. Follow this 4-phase investigation:
+### Phase 1: Evidence Collection
+- Read the error output from the previous attempt carefully
+- Identify the EXACT line/file where the failure occurred
+- Check if the error is a symptom or the root cause
+- Look for patterns: is this a known error type?
+### Phase 2: Hypothesis Formation
+- List 3 possible root causes for this failure
+- For each hypothesis, identify what evidence would confirm or deny it
+- Rank hypotheses by likelihood
+### Phase 3: Root Cause Verification
+- Test the most likely hypothesis first
+- Read the relevant source code — don't guess
+- Check if previous artifacts (plan.md, design.md) are correct or flawed
+- If the plan was correct but execution failed, focus on execution
+- If the plan was flawed, document what was wrong
+### Phase 4: Targeted Fix
+- Fix the ROOT CAUSE, not the symptom
+- If the previous approach was fundamentally wrong, choose a different approach
+- If it was a minor error, make the minimal fix
+- Document what went wrong and why the new approach is better
+IMPORTANT: If you find existing artifacts from a successful previous stage, USE them — don't regenerate from scratch.
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **Root Cause Hypothesis**: List 3 possible root causes ranked by likelihood with specific evidence that would confirm/deny each
+2. **Evidence Gathered**: Exact file:line location of failure, error messages, logs, code examination results, artifact validation (plan.md, design.md correctness)
+3. **Fix Strategy**: Description of the ROOT CAUSE fix (not the symptom), with rationale for why this approach differs from the previous failed attempt
+4. **Verification Plan**: How to verify the fix works (test cases, specific checks, expected behavior confirmation)
+If any section is not applicable, explicitly state why it's skipped.

package/scripts/skills/testing-strategy.md ADDED Viewed

@@ -0,0 +1,47 @@
+## Testing Strategy Expertise
+Apply these testing patterns:
+### Test Pyramid
+- **Unit tests** (70%): Test individual functions/methods in isolation
+- **Integration tests** (20%): Test component interactions and boundaries
+- **E2E tests** (10%): Test critical user flows end-to-end
+### What to Test
+- Happy path: the expected successful flow
+- Error cases: what happens when things go wrong?
+- Edge cases: empty inputs, maximum values, concurrent access
+- Boundary conditions: off-by-one, empty collections, null/undefined
+### Test Quality
+- Each test should verify ONE behavior
+- Test names should describe the expected behavior, not the implementation
+- Tests should be independent — no shared mutable state between tests
+- Tests should be deterministic — same result every run
+### Coverage Strategy
+- Aim for meaningful coverage, not 100% line coverage
+- Focus coverage on business logic and error handling
+- Don't test framework code or simple getters/setters
+- Cover the branches, not just the lines
+### Mocking Guidelines
+- Mock external dependencies (APIs, databases, file system)
+- Don't mock the code under test
+- Use realistic test data — edge cases reveal bugs
+- Verify mock interactions when the side effect IS the behavior
+### Regression Testing
+- Write a failing test FIRST that reproduces the bug
+- Then fix the bug and verify the test passes
+- Keep regression tests — they prevent the bug from recurring
+### Required Output (Mandatory)
+Your output MUST include these sections when this skill is active:
+1. **Test Pyramid Breakdown**: Explicit count of unit/integration/E2E tests and their coverage targets (e.g., "70 unit tests covering business logic, 12 integration tests for API boundaries, 3 E2E tests for critical paths")
+2. **Coverage Targets**: Target coverage percentage per layer and which critical paths MUST be tested
+3. **Critical Paths to Test**: Specific test cases for the happy path, 2+ error cases, and 2+ edge cases
+If any section is not applicable, explicitly state why it's skipped.

package/scripts/skills/two-stage-review.md ADDED Viewed

@@ -0,0 +1,52 @@
+## Two-Stage Code Review
+This review runs in two passes. Complete Pass 1 fully before starting Pass 2.
+### Pass 1: Spec Compliance
+Compare the implementation against the plan and issue requirements:
+1. **Task Checklist**: Does the code implement every task from plan.md?
+2. **Files Modified**: Were all planned files actually modified?
+3. **Requirements Coverage**: Does the implementation satisfy every requirement from the issue?
+4. **Missing Features**: Is anything from the plan NOT implemented?
+5. **Scope Creep**: Was anything added that WASN'T in the plan?
+For each gap found:
+- **[SPEC-GAP]** description — what was planned vs what was implemented
+If all requirements are met, write: "Spec compliance: PASS — all planned tasks implemented."
+---
+### Pass 2: Code Quality
+Now review the code for engineering quality:
+1. **Logic bugs** — incorrect conditions, off-by-one errors, null handling
+2. **Security** — injection, XSS, auth bypass, secret exposure
+3. **Error handling** — missing catch blocks, silent failures, unclear error messages
+4. **Performance** — unnecessary loops, missing indexes, N+1 queries
+5. **Naming and clarity** — confusing names, missing context, magic numbers
+6. **Test coverage** — are new code paths tested? Edge cases covered?
+For each issue found, use format:
+- **[SEVERITY]** file:line — description
+Severity: Critical, Bug, Security, Warning, Suggestion
+### Required Output (Mandatory)
+Your output MUST include these sections for EACH review pass:
+**Pass 1 Output:**
+- **Spec Compliance Verdict**: PASS or FAIL with explicit gaps found (if any)
+- **Unimplemented Tasks**: List any planned tasks NOT in the code
+- **Unplanned Code**: List any code added that was NOT in the plan
+**Pass 2 Output:**
+- **Code Quality Verdict**: PASS or list all findings by severity
+- **Critical/Security Issues Found**: Explicit count with file:line references
+- **Suggested Improvements**: Optional suggestions that don't block PASS
+If any finding is not applicable, explicitly state why it's skipped.