@massu/core 0.5.0 → 0.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (119) hide show
  1. package/README.md +40 -0
  2. package/agents/massu-architecture-reviewer.md +104 -0
  3. package/agents/massu-blast-radius-analyzer.md +84 -0
  4. package/agents/massu-competitive-scorer.md +126 -0
  5. package/agents/massu-help-sync.md +73 -0
  6. package/agents/massu-migration-writer.md +94 -0
  7. package/agents/massu-output-scorer.md +87 -0
  8. package/agents/massu-pattern-reviewer.md +84 -0
  9. package/agents/massu-plan-auditor.md +170 -0
  10. package/agents/massu-schema-sync-verifier.md +70 -0
  11. package/agents/massu-security-reviewer.md +98 -0
  12. package/agents/massu-ux-reviewer.md +106 -0
  13. package/commands/_shared-preamble.md +53 -23
  14. package/commands/_shared-references/auto-learning-protocol.md +71 -0
  15. package/commands/_shared-references/blast-radius-protocol.md +76 -0
  16. package/commands/_shared-references/security-pre-screen.md +64 -0
  17. package/commands/_shared-references/test-first-protocol.md +87 -0
  18. package/commands/_shared-references/verification-table.md +55 -0
  19. package/commands/massu-article-review.md +343 -0
  20. package/commands/massu-autoresearch/references/eval-runner.md +84 -0
  21. package/commands/massu-autoresearch/references/safety-rails.md +125 -0
  22. package/commands/massu-autoresearch/references/scoring-protocol.md +151 -0
  23. package/commands/massu-autoresearch.md +258 -0
  24. package/commands/massu-batch.md +44 -12
  25. package/commands/massu-bearings.md +42 -8
  26. package/commands/massu-checkpoint.md +588 -0
  27. package/commands/massu-ci-fix.md +2 -2
  28. package/commands/massu-command-health.md +132 -0
  29. package/commands/massu-command-improve.md +232 -0
  30. package/commands/massu-commit.md +205 -44
  31. package/commands/massu-create-plan.md +239 -57
  32. package/commands/massu-data/references/common-queries.md +79 -0
  33. package/commands/massu-data/references/table-guide.md +50 -0
  34. package/commands/massu-data.md +66 -0
  35. package/commands/massu-dead-code.md +29 -34
  36. package/commands/massu-debug/references/auto-learning.md +61 -0
  37. package/commands/massu-debug/references/codegraph-tracing.md +80 -0
  38. package/commands/massu-debug/references/common-shortcuts.md +98 -0
  39. package/commands/massu-debug/references/investigation-phases.md +294 -0
  40. package/commands/massu-debug/references/report-format.md +107 -0
  41. package/commands/massu-debug.md +105 -386
  42. package/commands/massu-docs.md +1 -1
  43. package/commands/massu-full-audit.md +61 -0
  44. package/commands/massu-gap-enhancement-analyzer.md +276 -16
  45. package/commands/massu-golden-path/references/approval-points.md +216 -0
  46. package/commands/massu-golden-path/references/competitive-mode.md +273 -0
  47. package/commands/massu-golden-path/references/error-handling.md +121 -0
  48. package/commands/massu-golden-path/references/phase-0-requirements.md +53 -0
  49. package/commands/massu-golden-path/references/phase-1-plan-creation.md +168 -0
  50. package/commands/massu-golden-path/references/phase-2-implementation.md +403 -0
  51. package/commands/massu-golden-path/references/phase-2.5-gap-analyzer.md +170 -0
  52. package/commands/massu-golden-path/references/phase-3-simplify.md +40 -0
  53. package/commands/massu-golden-path/references/phase-3.5-security-audit.md +108 -0
  54. package/commands/massu-golden-path/references/phase-4-commit.md +94 -0
  55. package/commands/massu-golden-path/references/phase-5-push.md +116 -0
  56. package/commands/massu-golden-path/references/phase-5.5-production-verify.md +170 -0
  57. package/commands/massu-golden-path/references/phase-6-completion.md +113 -0
  58. package/commands/massu-golden-path/references/qa-evaluator-spec.md +137 -0
  59. package/commands/massu-golden-path/references/sprint-contract-protocol.md +117 -0
  60. package/commands/massu-golden-path/references/vr-visual-calibration.md +73 -0
  61. package/commands/massu-golden-path.md +121 -844
  62. package/commands/massu-guide.md +72 -69
  63. package/commands/massu-hooks.md +27 -12
  64. package/commands/massu-hotfix.md +221 -144
  65. package/commands/massu-incident.md +49 -20
  66. package/commands/massu-infra-audit.md +187 -0
  67. package/commands/massu-learning-audit.md +211 -0
  68. package/commands/massu-loop/references/auto-learning.md +49 -0
  69. package/commands/massu-loop/references/checkpoint-audit.md +40 -0
  70. package/commands/massu-loop/references/guardrails.md +17 -0
  71. package/commands/massu-loop/references/iteration-structure.md +115 -0
  72. package/commands/massu-loop/references/loop-controller.md +188 -0
  73. package/commands/massu-loop/references/plan-extraction.md +78 -0
  74. package/commands/massu-loop/references/vr-plan-spec.md +140 -0
  75. package/commands/massu-loop-playwright.md +9 -9
  76. package/commands/massu-loop.md +115 -670
  77. package/commands/massu-new-pattern.md +423 -0
  78. package/commands/massu-perf.md +422 -0
  79. package/commands/massu-plan-audit.md +1 -1
  80. package/commands/massu-plan.md +389 -122
  81. package/commands/massu-production-verify.md +433 -0
  82. package/commands/massu-push.md +62 -378
  83. package/commands/massu-recap.md +29 -3
  84. package/commands/massu-rollback.md +613 -0
  85. package/commands/massu-scaffold-hook.md +2 -4
  86. package/commands/massu-scaffold-page.md +2 -3
  87. package/commands/massu-scaffold-router.md +1 -2
  88. package/commands/massu-security.md +619 -0
  89. package/commands/massu-simplify.md +115 -85
  90. package/commands/massu-squirrels.md +2 -2
  91. package/commands/massu-tdd.md +38 -22
  92. package/commands/massu-test.md +3 -3
  93. package/commands/massu-type-mismatch-audit.md +469 -0
  94. package/commands/massu-ui-audit.md +587 -0
  95. package/commands/massu-verify-playwright.md +287 -32
  96. package/commands/massu-verify.md +150 -46
  97. package/dist/cli.js +146 -95
  98. package/package.json +6 -2
  99. package/patterns/build-patterns.md +302 -0
  100. package/patterns/component-patterns.md +246 -0
  101. package/patterns/display-patterns.md +185 -0
  102. package/patterns/form-patterns.md +890 -0
  103. package/patterns/integration-testing-checklist.md +445 -0
  104. package/patterns/security-patterns.md +219 -0
  105. package/patterns/testing-patterns.md +569 -0
  106. package/patterns/tool-routing.md +81 -0
  107. package/patterns/ui-patterns.md +371 -0
  108. package/protocols/plan-implementation.md +267 -0
  109. package/protocols/recovery.md +225 -0
  110. package/protocols/verification.md +404 -0
  111. package/reference/command-taxonomy.md +178 -0
  112. package/reference/cr-rules-reference.md +76 -0
  113. package/reference/hook-execution-order.md +148 -0
  114. package/reference/lessons-learned.md +175 -0
  115. package/reference/patterns-quickref.md +208 -0
  116. package/reference/standards.md +135 -0
  117. package/reference/subagents-reference.md +17 -0
  118. package/reference/vr-verification-reference.md +867 -0
  119. package/src/commands/install-commands.ts +149 -53
@@ -0,0 +1,84 @@
1
+ # Eval Runner Protocol
2
+
3
+ ## Purpose
4
+
5
+ Defines how to spawn a subagent to simulate command execution and score the output against the eval checklist.
6
+
7
+ ## Why Subagent-Based Eval?
8
+
9
+ Claude Code cannot programmatically invoke `/massu-*` commands. The `claude` CLI accepts prompts but doesn't have a `--skill` flag for direct skill invocation. A subagent with the command text injected as instructions is the most practical simulation.
10
+
11
+ ## Eval Execution Steps
12
+
13
+ ### Step 1: Read inputs
14
+
15
+ Read three files:
16
+
17
+ 1. **Target command file** (current version): `.claude/commands/[command].md` (or `[command]/[command].md` for folder-based)
18
+ 2. **Eval checklist** (immutable): `.claude/evals/[command].md`
19
+ 3. **Fixture input**: `.claude/evals/fixtures/[command]/input-01.md`
20
+
21
+ ### Step 2: Spawn eval subagent
22
+
23
+ Use the Agent tool with `subagent_type="general-purpose"` and a prompt structured as:
24
+
25
+ ```
26
+ You are simulating the execution of a Claude Code command. Your job is to produce the output EXACTLY as the command specifies, including all required sections and formatting.
27
+
28
+ ## Command Instructions
29
+
30
+ [PASTE FULL COMMAND FILE TEXT HERE]
31
+
32
+ ## Input
33
+
34
+ [PASTE FIXTURE INPUT TEXT HERE]
35
+
36
+ ## Your Task
37
+
38
+ Execute the command against the input above. Produce the complete output as if you were running the command for real. Include every section the command specifies. Do not skip optional sections — treat everything as required for this simulation.
39
+
40
+ Do NOT include any meta-commentary about the simulation. Just produce the command output.
41
+ ```
42
+
43
+ ### Step 3: Score the output
44
+
45
+ After the subagent returns its output, score each check from the eval checklist:
46
+
47
+ For each check in `.claude/evals/[command].md`:
48
+
49
+ 1. Read the check's `pass_condition`
50
+ 2. Evaluate the subagent output against that condition
51
+ 3. Record `true` (pass) or `false` (fail)
52
+
53
+ Scoring rules:
54
+ - Checks are binary — no partial credit
55
+ - Section presence checks: the section header must exist AND contain substantive content (not just the header)
56
+ - Count checks (e.g., "at least 3 action items"): count actual items, not placeholders
57
+ - Conditional checks (e.g., "if score >= 40"): evaluate the condition first, then check
58
+
59
+ ### Step 4: Calculate score
60
+
61
+ ```
62
+ score = (passing_checks / total_checks) * 100
63
+ ```
64
+
65
+ ### Step 5: Return results
66
+
67
+ Return to the main loop:
68
+
69
+ 1. `score`: percentage (e.g., 75.0)
70
+ 2. `checks`: object mapping check_id to boolean (e.g., `{"has_executive_summary": true, "has_score_table": false, ...}`)
71
+ 3. `failing_checks`: list of check_ids that failed
72
+ 4. `status`: "success" or "crash" (if subagent errored or output was unparseable)
73
+
74
+ ## Output Containment
75
+
76
+ The full subagent output is NOT retained in the main agent context. After scoring:
77
+
78
+ - Discard the full output text
79
+ - Keep only: score, per-check results, failing check names
80
+ - This prevents context window exhaustion during long runs (Karpathy insight #4)
81
+
82
+ ## Crash Handling
83
+
84
+ If the subagent errors, times out, or produces output that cannot be scored, return `status: "crash"` to the main loop. See `safety-rails.md` for crash retry protocol.
@@ -0,0 +1,125 @@
1
+ # Safety Rails
2
+
3
+ ## Protected Line Patterns
4
+
5
+ Lines containing ANY of these patterns are READ-ONLY. The agent may add content adjacent to them but MUST NOT modify or delete these lines:
6
+
7
+ - `CR-` (Canonical Rule references)
8
+ - `VR-` (Verification Requirement references)
9
+ - `NON-NEGOTIABLE`
10
+ - `NEVER` — protected when: (a) at start of line, (b) after bullet/dash (`- NEVER`, `* NEVER`), or (c) in ALL-CAPS imperative context (`NEVER use`, `NEVER edit`, `NEVER import`). NOT protected in lowercase prose (`should never`, `would never`, `can never`).
11
+ - `MANDATORY`
12
+ - `FORBIDDEN`
13
+ - `MUST NOT`
14
+
15
+ These patterns encode incident-learned safety rules. Modifying them to pass an eval check would optimize the metric while degrading real safety — a Goodhart's Law violation.
16
+
17
+ ### Verification
18
+
19
+ Before applying any edit, grep the target file for protected patterns in the edit region. If the edit would modify a protected line, reject the edit and try a different approach.
20
+
21
+ ---
22
+
23
+ ## Backup Protocol
24
+
25
+ Before the first iteration of any autoresearch run:
26
+
27
+ 1. Create the backups directory if needed: `.claude/metrics/backups/`
28
+ 2. Copy the target file to: `.claude/metrics/backups/[command]-autoresearch-[YYYY-MM-DD-HHmmss].md.bak`
29
+ 3. Verify the backup exists and is non-empty
30
+
31
+ The backup is the ultimate rollback if the entire autoresearch branch needs to be abandoned.
32
+
33
+ ---
34
+
35
+ ## Git Branch Protocol
36
+
37
+ All autoresearch commits go to a dedicated branch:
38
+
39
+ 1. **Resolve target path**: Determine the actual target file path — `.claude/commands/[command].md` for flat commands, `.claude/commands/[command]/[command].md` for folder-based commands. Store as `TARGET_PATH` and use for ALL subsequent git add/checkout operations.
40
+ 2. **Create branch**: `git checkout -b autoresearch/[command]-[YYYY-MM-DD]`. If branch already exists (same command, same day), append a numeric suffix: `autoresearch/[command]-[YYYY-MM-DD]-2`, `-3`, etc.
41
+ 3. **Commit on KEEP**: `git add [TARGET_PATH] && git commit -m "autoresearch([command]): [1-line summary]"`
42
+ 4. **Revert on DISCARD**: `git checkout -- [TARGET_PATH]`
43
+ 5. **After run**: User reviews the branch and decides to merge, cherry-pick, or discard
44
+
45
+ The branch naming convention allows multiple autoresearch runs on different commands without conflict.
46
+
47
+ ---
48
+
49
+ ## Cost Tracking
50
+
51
+ Track estimated token usage per iteration:
52
+
53
+ - **Eval subagent prompt**: ~target file tokens + eval file tokens + fixture tokens
54
+ - **Eval subagent response**: ~2x fixture tokens (simulated output)
55
+ - **Main agent overhead**: ~500 tokens per iteration (scoring, logging, state)
56
+
57
+ Rough estimate: each iteration costs ~3x the target file size in tokens.
58
+
59
+ **Cost cap**: If cumulative estimated token usage exceeds 500K tokens, **stop the loop** and produce the final report. This is a hard backstop regardless of `--no-limit` mode — it is NOT a pause, it is a termination.
60
+
61
+ ---
62
+
63
+ ## Tristate Crash Handling
64
+
65
+ When an eval produces a CRASH result:
66
+
67
+ 1. **Log the crash**: Record in `autoresearch-runs.jsonl` with `"action":"crashed"`
68
+ 2. **Do NOT increment** the consecutive reverts counter
69
+ 3. **Retry once**: On the retry, use a SIMPLIFIED edit (smaller change, fewer lines modified)
70
+ 4. **If retry also crashes**: Log as DISCARD, increment consecutive reverts counter
71
+ 5. **Continue the loop** — a single crash does not warrant stopping
72
+
73
+ Crash accumulation: if `cumulative_crashed >= 5` in a single run, warn that there may be a systemic issue with the eval or fixture.
74
+
75
+ ---
76
+
77
+ ## "Think Harder" Graduated Escalation
78
+
79
+ When consecutive reverts trigger escalation, the edit strategy changes:
80
+
81
+ ### Level 0 (default)
82
+ Normal operation. Target the weakest failing check. Try standard edit types.
83
+
84
+ ### Level 1 (3 consecutive reverts)
85
+ **Switch edit type.** If previous attempts used "add rule", try in this order:
86
+ 1. Add a worked example (before/after demonstration)
87
+ 2. Restructure the section (reorder for emphasis)
88
+ 3. Add a banned anti-pattern
89
+ 4. Promote existing requirement higher in the file
90
+
91
+ ### Level 2 (5 consecutive reverts)
92
+ **Full context reload.** Before the next edit:
93
+ 1. Re-read the eval checklist from disk (not from memory)
94
+ 2. Re-read the target file from disk
95
+ 3. Review the changelog of ALL kept changes so far (from `autoresearch-runs.jsonl`)
96
+ 4. Identify which failing check has been attempted most without success
97
+ 5. Look for TWO previous near-miss approaches and try COMBINING them
98
+
99
+ ### Level 3 (8 consecutive reverts)
100
+ **Radical restructuring.** Try structural changes, not content changes:
101
+ 1. Reorder major sections of the target file
102
+ 2. Merge two related sections into one focused section
103
+ 3. Split an overloaded section into two specific sections
104
+ 4. Convert abstract rules into concrete worked examples
105
+ 5. Add a decision tree or flowchart for complex logic
106
+
107
+ ### BAIL (10 consecutive reverts)
108
+ **Stagnation exit.** Log: "STAGNATION: 10 consecutive reverts after Level 3 escalation."
109
+ - Revert any pending changes
110
+ - Produce the final report
111
+ - Exit the loop
112
+ - This is the only way CR-37 triggers — after Level 3 has been attempted
113
+
114
+ ---
115
+
116
+ ## Output Containment
117
+
118
+ Eval subagent output is captured and scored, NOT inlined into the main agent context.
119
+
120
+ The main agent context receives ONLY:
121
+ - Score percentage (e.g., 75.0%)
122
+ - Per-check pass/fail results (e.g., `has_executive_summary: true, has_score_table: false`)
123
+ - Names of failing checks
124
+
125
+ This is the Karpathy `> run.log 2>&1` pattern adapted for LLM context management.
@@ -0,0 +1,151 @@
1
+ # Scoring Protocol
2
+
3
+ ## Score Calculation
4
+
5
+ ```
6
+ score = (passing_checks / total_checks) * 100
7
+ ```
8
+
9
+ Each eval check is binary: pass (true) or fail (false). No partial credit. No weighting.
10
+
11
+ ---
12
+
13
+ ## Baseline Measurement
14
+
15
+ Before any edits, establish the baseline:
16
+
17
+ 1. Run the eval subagent 3 times with the same fixture and current command file
18
+ 2. Score each run independently
19
+ 3. Take the **median** score as the baseline
20
+ 4. Log as iteration 0 in `autoresearch-runs.jsonl`
21
+
22
+ The median (not mean) accounts for subagent output variability without being skewed by a single outlier run.
23
+
24
+ **Note**: Only the baseline uses 3 runs. Per-iteration evals use a single run as an intentional cost tradeoff — running 3 evals per iteration would triple token usage. The single-run approach works because the simplicity gate and convergence criterion (3 consecutive passes) provide noise resistance.
25
+
26
+ ---
27
+
28
+ ## Accept/Reject Threshold
29
+
30
+ After each iteration edit:
31
+
32
+ - `score_after > score_before` → candidate for KEEP (subject to simplicity gate)
33
+ - `score_after == score_before` → DISCARD (no improvement = no noise-driven drift) — **exception**: if `net_delta < 0` (deletion), proceed to simplicity gate instead
34
+ - `score_after < score_before` → DISCARD (regression)
35
+
36
+ **Strict greater-than** is required for non-deletion edits. Equal scores are rejected to prevent random walk drift where the command changes without improving. The one exception is deletion edits (`net_delta < 0`) which maintain score — these proceed to the simplicity gate where the "deletion bonus" rule applies.
37
+
38
+ ---
39
+
40
+ ## Simplicity Gate
41
+
42
+ After scoring, before accepting a KEEP candidate:
43
+
44
+ 1. Count `checks_improved`: how many checks flipped from false to true
45
+ 2. Calculate `net_delta`: lines_added - lines_removed in the edit
46
+ 3. Apply the simplicity criterion:
47
+
48
+ | checks_improved | net_delta | Decision | Log action |
49
+ |----------------|-----------|----------|------------|
50
+ | >= 2 | any | KEEP | `"kept"` |
51
+ | == 1 | > 0 | **REJECT** | `"rejected_complexity"` |
52
+ | == 1 | <= 0 | KEEP | `"kept"` |
53
+ | == 0 | < 0 | KEEP (if score maintained) | `"kept"` (simplification) |
54
+ | == 0 | >= 0 | DISCARD | `"discarded"` |
55
+
56
+ **Deletion bonus**: If `net_delta < 0` (change removes lines) AND `score_after >= score_before`, ALWAYS KEEP. Karpathy: "Improvement from deleting code = definitely keep."
57
+
58
+ ---
59
+
60
+ ## Convergence
61
+
62
+ The loop converges when score >= target (default 90%) for **3 consecutive iterations** where each iteration includes running an eval and scoring. The 3-consecutive requirement applies to iterations that produce a score (KEEP or DISCARD), not CRASH iterations. Three consecutive readings at target eliminates the possibility of a lucky single run.
63
+
64
+ ---
65
+
66
+ ## Score Logging Format
67
+
68
+ Append one JSONL line per iteration to `.claude/metrics/autoresearch-runs.jsonl`:
69
+
70
+ ```json
71
+ {
72
+ "command": "[target command name]",
73
+ "iteration": 1,
74
+ "timestamp": "2026-03-20T10:30:00Z",
75
+ "score_before": 62.5,
76
+ "score_after": 75.0,
77
+ "action": "kept",
78
+ "edit_type": "add_worked_example",
79
+ "edit_summary": "Added before/after example for gap analysis section",
80
+ "checks": {
81
+ "has_executive_summary": true,
82
+ "has_score_table": true,
83
+ "has_massu_comparison": true,
84
+ "has_gap_analysis": false,
85
+ "has_action_items": true,
86
+ "has_source_credibility": true,
87
+ "mentions_specific_commands": true,
88
+ "has_implementation_path": false
89
+ },
90
+ "lines_added": 8,
91
+ "lines_removed": 2,
92
+ "net_delta": 6,
93
+ "cumulative_kept": 1,
94
+ "cumulative_discarded": 0,
95
+ "cumulative_crashed": 0,
96
+ "cumulative_rejected": 0,
97
+ "escalation_level": 0,
98
+ "consecutive_reverts": 0
99
+ }
100
+ ```
101
+
102
+ ### Field Definitions
103
+
104
+ | Field | Type | Description |
105
+ |-------|------|-------------|
106
+ | `command` | string | Target command name (e.g., "massu-article-review") |
107
+ | `iteration` | number | 0 = baseline, 1+ = edit iterations |
108
+ | `timestamp` | string | ISO 8601 timestamp |
109
+ | `score_before` | number\|null | Score percentage before this iteration's edit. `null` for baseline (iteration 0) since there is no prior score. |
110
+ | `score_after` | number | Score percentage after this iteration's edit |
111
+ | `action` | string | One of: `"baseline"`, `"kept"`, `"discarded"`, `"crashed"`, `"rejected_complexity"` |
112
+ | `edit_type` | string\|null | `null` for baseline/dry-run. Otherwise one of: `"add_rule"`, `"add_example"`, `"add_worked_example"`, `"restructure"`, `"promote"`, `"ban_pattern"`, `"merge_sections"`, `"split_section"`, `"reorder"`, `"simplify"` |
113
+ | `edit_summary` | string | One-line human-readable description of the change |
114
+ | `checks` | object | Map of check_id to boolean pass/fail |
115
+ | `lines_added` | number | Lines added by this edit |
116
+ | `lines_removed` | number | Lines removed by this edit |
117
+ | `net_delta` | number | `lines_added - lines_removed` |
118
+ | `cumulative_kept` | number | Running total of kept changes this run |
119
+ | `cumulative_discarded` | number | Running total of discarded changes this run |
120
+ | `cumulative_crashed` | number | Running total of crashed evals this run |
121
+ | `cumulative_rejected` | number | Running total of complexity-rejected edits this run |
122
+ | `escalation_level` | number | Current escalation level (0-3) |
123
+ | `consecutive_reverts` | number | Current consecutive revert count (resets on KEEP) |
124
+
125
+ ---
126
+
127
+ ## Final Summary Format
128
+
129
+ After the loop exits, produce a summary:
130
+
131
+ ```
132
+ Score Trajectory:
133
+ [iteration 0] 62.5% (baseline)
134
+ [iteration 1] 75.0% KEPT "Added worked example for gap analysis"
135
+ [iteration 2] 75.0% DISCARD "Tried adding source credibility template"
136
+ [iteration 3] 87.5% KEPT "Promoted source credibility to mandatory section"
137
+ ...
138
+
139
+ Per-Check Progression:
140
+ Check Baseline Final
141
+ has_executive_summary PASS PASS
142
+ has_score_table PASS PASS
143
+ has_massu_comparison PASS PASS
144
+ has_gap_analysis FAIL PASS <-- improved
145
+ has_action_items PASS PASS
146
+ has_source_credibility FAIL PASS <-- improved
147
+ mentions_specific_commands PASS PASS
148
+ has_implementation_path FAIL FAIL (hardest check)
149
+ ```
150
+
151
+ This summary shows the optimization journey and identifies which checks responded to edits and which are structurally difficult.
@@ -0,0 +1,258 @@
1
+ ---
2
+ name: massu-autoresearch
3
+ description: "When user wants autonomous command optimization, says 'autoresearch', 'optimize this command', 'run overnight improvement loop', or wants Karpathy-style iterative prompt optimization"
4
+ allowed-tools: Bash(*), Read(*), Write(*), Edit(*), Grep(*), Glob(*), Agent(*), Task(*)
5
+ ---
6
+ name: massu-autoresearch
7
+
8
+ **Shared rules**: Read `.claude/commands/_shared-preamble.md` for POST-COMPACTION (CR-12), ENTERPRISE-GRADE (CR-14), AWS SECRETS (CR-5) rules.
9
+
10
+ # Massu Autoresearch: Autonomous Command Optimization Loop
11
+
12
+ ## Objective
13
+
14
+ Iteratively improve a target command file by scoring output against an eval checklist, keeping improvements (git commit), reverting failures (git checkout), and repeating until convergence. Based on Karpathy's autoresearch pattern (42K GitHub stars).
15
+
16
+ ---
17
+
18
+ ## ARGUMENTS
19
+
20
+ ```
21
+ /massu-autoresearch [command-name] # Default: 20 iterations, 90% target
22
+ /massu-autoresearch [command-name] --max-iterations N # Custom iteration cap
23
+ /massu-autoresearch [command-name] --target-score N # Custom target (default 90)
24
+ /massu-autoresearch [command-name] --no-limit # NEVER STOP mode (cost cap only)
25
+ ```
26
+
27
+ **Arguments from $ARGUMENTS**: {{ARGUMENTS}}
28
+
29
+ ---
30
+
31
+ ## THREE-FILE ARCHITECTURE
32
+
33
+ | Role | File | Mutable? |
34
+ |------|------|----------|
35
+ | **Target** (train.py) | `.claude/commands/[command].md` (or `[command]/[command].md` for folder-based) | YES — agent edits this |
36
+ | **Eval** (prepare.py) | `.claude/evals/[command].md` | **NO — IMMUTABLE** |
37
+ | **Instructions** (program.md) | This file | **NO — agent follows this** |
38
+
39
+ The target command is the ONLY file the agent may edit. The eval checklist and these instructions are READ-ONLY.
40
+
41
+ ---
42
+
43
+ ## NON-NEGOTIABLE RULES
44
+
45
+ 1. **ONE change per iteration** — single-variable perturbation only. Never batch multiple edits.
46
+ 2. **NEVER edit the eval file** — `.claude/evals/[command].md` is immutable during runs.
47
+ 3. **NEVER edit lines containing CR-*/VR-* references** — these encode incident-learned safety rules. See `references/safety-rails.md` for protected line patterns.
48
+ 4. **ALWAYS backup before first iteration** — save original to `.claude/metrics/backups/[command]-autoresearch-[timestamp].md.bak`
49
+ 5. **git commit on improvement / git checkout on regression** — tristate handling (see below).
50
+ 6. **Tristate results**: KEEP / DISCARD / CRASH — crashes do NOT count toward stagnation.
51
+ 7. **NEVER STOP mode** (when `--no-limit` passed): "Once the experiment loop has begun, do NOT pause to ask the human if you should continue." Run until convergence, cost cap (500K tokens), or manual interrupt (Ctrl+C). Default mode uses `--max-iterations` cap.
52
+
53
+ ---
54
+
55
+ ## TRISTATE RESULTS
56
+
57
+ Every iteration produces one of three outcomes:
58
+
59
+ | Outcome | Condition | Action |
60
+ |---------|-----------|--------|
61
+ | **KEEP** | `score_after > score_before` AND passes simplicity gate | `git add [target] && git commit -m "autoresearch([cmd]): [summary]"` — new baseline |
62
+ | **DISCARD** | `score_after <= score_before` OR fails simplicity gate | `git checkout -- [target]` — revert to baseline |
63
+ | **CRASH** | Eval subagent errored or produced unparseable output | Log the crash. Do NOT count toward stagnation. Retry once with simplified edit. If still failing, count as DISCARD. |
64
+
65
+ Crash handling details: see `references/safety-rails.md`.
66
+
67
+ ---
68
+
69
+ ## SIMPLICITY CRITERION
70
+
71
+ Apply the simplicity gate per `references/scoring-protocol.md`. Key principle (Karpathy): "Marginal improvement + added complexity = reject. Improvement from deleting code = definitely keep."
72
+
73
+ ---
74
+
75
+ ## "THINK HARDER" GRADUATED ESCALATION
76
+
77
+ Replaces CR-37 immediate bail. When stuck, escalate before giving up. Consecutive revert counter resets to 0 after any KEEP.
78
+
79
+ | Level | Trigger |
80
+ |-------|---------|
81
+ | **Level 1** | 3 consecutive reverts |
82
+ | **Level 2** | 5 consecutive reverts |
83
+ | **Level 3** | 8 consecutive reverts |
84
+ | **BAIL** | 10 consecutive reverts |
85
+
86
+ Per-level strategies: see `references/safety-rails.md`.
87
+
88
+ ---
89
+
90
+ ## CONVERGENCE CRITERIA
91
+
92
+ The loop stops when ANY of these conditions is met:
93
+
94
+ | Condition | Default | Overridable? |
95
+ |-----------|---------|-------------|
96
+ | Score >= target for 3 consecutive iterations | 90% | `--target-score N` |
97
+ | Max iterations reached | 20 | `--max-iterations N` or `--no-limit` |
98
+ | Stagnation bail (10 consecutive reverts after Level 3) | Always active | Not overridable |
99
+ | Cost cap (estimated cumulative token usage) | 500K tokens | Not overridable |
100
+
101
+ ---
102
+
103
+ ## LOOP CONTROLLER
104
+
105
+ ### Initialization (runs once)
106
+
107
+ 1. **Parse arguments**: Extract command name, max iterations, target score, no-limit flag
108
+ 2. **Validate files exist**:
109
+ - Target: `.claude/commands/[command].md` (or `.claude/commands/[command]/[command].md` for folder-based)
110
+ - Eval: `.claude/evals/[command].md`
111
+ - Fixture: `.claude/evals/fixtures/[command]/input-01.md`
112
+ 3. **Resolve target path**: Determine actual file — `.claude/commands/[command].md` (flat) or `.claude/commands/[command]/[command].md` (folder-based). Store as `TARGET_PATH` for all git operations.
113
+ 4. **Create git branch**: `git checkout -b autoresearch/[command]-[YYYY-MM-DD]` (if exists, append `-2`, `-3`, etc.)
114
+ 5. **Create backup**: Copy target to `.claude/metrics/backups/[command]-autoresearch-[timestamp].md.bak`
115
+ 6. **Measure baseline**: Run eval 3 times, take median score. Log as `"iteration":0` in `autoresearch-runs.jsonl`
116
+ 7. **Initialize state**: `consecutive_reverts = 0`, `escalation_level = 0`, `cumulative_kept = 0`, `cumulative_discarded = 0`, `cumulative_crashed = 0`, `cumulative_rejected = 0`
117
+
118
+ ### Per-Iteration Loop
119
+
120
+ For each iteration `i` from 1 to max_iterations (or unlimited if `--no-limit`):
121
+
122
+ **Step 1: Read current state**
123
+ - Read target command file (current version after any prior keeps)
124
+ - Read eval checklist (immutable)
125
+ - Read fixture input
126
+ - Review failing checks from previous iteration (or baseline)
127
+
128
+ **Step 2: Hypothesize improvement**
129
+ Based on failing checks and escalation level, choose ONE edit:
130
+ - What check are we targeting?
131
+ - What edit type? (add rule, add example, restructure, promote, ban pattern)
132
+ - What specific change? (exact text to add/modify)
133
+
134
+ **Step 3: Apply edit**
135
+ - Use Edit tool to make ONE change to the target file
136
+ - Count lines added and removed
137
+
138
+ **Step 4: Run eval**
139
+ Read `references/eval-runner.md` for the eval subagent protocol.
140
+
141
+ Spawn an eval subagent with:
142
+ - The CURRENT target command text (after edit)
143
+ - The eval checklist
144
+ - The fixture input
145
+
146
+ Extract: score percentage, per-check results, failing check names.
147
+
148
+ **Step 5: Score and decide**
149
+ - Compare `score_after` to `score_before` (baseline or last kept score)
150
+ - Apply simplicity criterion
151
+ - Determine outcome: KEEP, DISCARD, CRASH, or REJECTED_COMPLEXITY
152
+
153
+ **Step 6: Act on decision**
154
+
155
+ | Decision | Git Action | State Update |
156
+ |----------|-----------|--------------|
157
+ | KEEP | `git add [target] && git commit` | `score_before = score_after`, `consecutive_reverts = 0`, `cumulative_kept += 1` |
158
+ | DISCARD | `git checkout -- [target]` | `consecutive_reverts += 1`, `cumulative_discarded += 1` |
159
+ | CRASH | Log crash, retry once simplified | `cumulative_crashed += 1` (consecutive_reverts unchanged) |
160
+ | REJECTED_COMPLEXITY | `git checkout -- [target]` | `consecutive_reverts += 1`, `cumulative_rejected += 1` |
161
+
162
+ **Step 7: Log iteration**
163
+ Append to `.claude/metrics/autoresearch-runs.jsonl`. See `references/scoring-protocol.md` for format.
164
+
165
+ **Step 8: Check convergence**
166
+ - If score >= target for 3 consecutive scored iterations (KEEP or DISCARD, not CRASH): CONVERGED — exit loop. If baseline already >= target, run 3 verification iterations to confirm before converging.
167
+ - If consecutive_reverts triggers escalation level change: log escalation
168
+ - If consecutive_reverts >= 10 (after Level 3): STAGNATION — exit loop
169
+ - If max iterations reached: MAX_ITERATIONS — exit loop
170
+
171
+ **Step 9: Update session state**
172
+ Update `session-state/CURRENT.md` with current iteration count, score, and status.
173
+
174
+ **Step 10: Continue**
175
+ Return to Step 1.
176
+
177
+ ---
178
+
179
+ ## FINAL REPORT
180
+
181
+ After the loop exits (any reason), produce:
182
+
183
+ ```
184
+ AUTORESEARCH COMPLETE — [command]
185
+
186
+ Reason: [CONVERGED / MAX_ITERATIONS / STAGNATION / COST_CAP]
187
+ Iterations: [N]
188
+ Branch: autoresearch/[command]-[date]
189
+
190
+ Score Progression:
191
+ Baseline: [X]% ([N/M] checks)
192
+ Final: [Y]% ([N/M] checks)
193
+ Delta: [+/-Z]%
194
+
195
+ Results:
196
+ Kept: [N] commits
197
+ Discarded: [N] reverts
198
+ Crashed: [N] transient failures
199
+ Rejected: [N] complexity rejections
200
+
201
+ Net Lines: [+/-N] across all kept changes
202
+ Highest Escalation: Level [0-3]
203
+
204
+ Kept Changes:
205
+ 1. [iteration N] [edit summary] (+X checks)
206
+ 2. [iteration N] [edit summary] (+X checks)
207
+ ...
208
+
209
+ Still Failing:
210
+ - [check_name]: [brief description of why it's hard]
211
+ ...
212
+
213
+ Next Steps:
214
+ - Review changes: git log autoresearch/[command]-[date]
215
+ - Merge if satisfied: git merge autoresearch/[command]-[date]
216
+ - Or cherry-pick specific commits
217
+ ```
218
+
219
+ ---
220
+
221
+ ## SESSION STATE
222
+
223
+ Update `session-state/CURRENT.md` after each iteration:
224
+
225
+ ```
226
+ AUTHORIZED_COMMAND: massu-autoresearch
227
+ Target: [command]
228
+ Branch: autoresearch/[command]-[date]
229
+ Iteration: [N]/[max]
230
+ Current Score: [X]%
231
+ Consecutive Reverts: [N]
232
+ Escalation Level: [0-3]
233
+ ```
234
+
235
+ ---
236
+
237
+ ## Skill Contents
238
+
239
+ This skill is a folder. The following files are available for reference:
240
+
241
+ | File | Purpose | Read When |
242
+ |------|---------|-----------|
243
+ | `references/eval-runner.md` | Eval subagent spawn protocol | Before running any eval |
244
+ | `references/safety-rails.md` | Protected lines, backup, branch, cost cap, crash handling, escalation | Before any edit or on error |
245
+ | `references/scoring-protocol.md` | Score calculation, accept/reject, JSONL format, final summary | Before scoring or logging |
246
+
247
+ ---
248
+
249
+ ## START NOW
250
+
251
+ 1. Parse `{{ARGUMENTS}}` for command name, flags
252
+ 2. Validate target + eval + fixture exist
253
+ 3. Create branch `autoresearch/[command]-[date]`
254
+ 4. Backup target file
255
+ 5. Measure baseline (3 eval runs, median)
256
+ 6. Enter iteration loop
257
+ 7. On exit: produce final report
258
+ 8. **DO NOT ask "should I continue?" between iterations — just continue.**