@massu/core 0.5.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +40 -0
- package/agents/massu-architecture-reviewer.md +104 -0
- package/agents/massu-blast-radius-analyzer.md +84 -0
- package/agents/massu-competitive-scorer.md +126 -0
- package/agents/massu-help-sync.md +73 -0
- package/agents/massu-migration-writer.md +94 -0
- package/agents/massu-output-scorer.md +87 -0
- package/agents/massu-pattern-reviewer.md +84 -0
- package/agents/massu-plan-auditor.md +170 -0
- package/agents/massu-schema-sync-verifier.md +70 -0
- package/agents/massu-security-reviewer.md +98 -0
- package/agents/massu-ux-reviewer.md +106 -0
- package/commands/_shared-preamble.md +53 -23
- package/commands/_shared-references/auto-learning-protocol.md +71 -0
- package/commands/_shared-references/blast-radius-protocol.md +76 -0
- package/commands/_shared-references/security-pre-screen.md +64 -0
- package/commands/_shared-references/test-first-protocol.md +87 -0
- package/commands/_shared-references/verification-table.md +52 -0
- package/commands/massu-article-review.md +343 -0
- package/commands/massu-autoresearch/references/eval-runner.md +84 -0
- package/commands/massu-autoresearch/references/safety-rails.md +125 -0
- package/commands/massu-autoresearch/references/scoring-protocol.md +151 -0
- package/commands/massu-autoresearch.md +258 -0
- package/commands/massu-batch.md +44 -12
- package/commands/massu-bearings.md +42 -8
- package/commands/massu-checkpoint.md +588 -0
- package/commands/massu-ci-fix.md +2 -2
- package/commands/massu-command-health.md +132 -0
- package/commands/massu-command-improve.md +232 -0
- package/commands/massu-commit.md +205 -44
- package/commands/massu-create-plan.md +239 -57
- package/commands/massu-data/references/common-queries.md +79 -0
- package/commands/massu-data/references/table-guide.md +50 -0
- package/commands/massu-data.md +66 -0
- package/commands/massu-dead-code.md +29 -34
- package/commands/massu-debug/references/auto-learning.md +61 -0
- package/commands/massu-debug/references/codegraph-tracing.md +80 -0
- package/commands/massu-debug/references/common-shortcuts.md +98 -0
- package/commands/massu-debug/references/investigation-phases.md +294 -0
- package/commands/massu-debug/references/report-format.md +107 -0
- package/commands/massu-debug.md +105 -386
- package/commands/massu-docs.md +1 -1
- package/commands/massu-full-audit.md +61 -0
- package/commands/massu-gap-enhancement-analyzer.md +276 -16
- package/commands/massu-golden-path/references/approval-points.md +216 -0
- package/commands/massu-golden-path/references/competitive-mode.md +273 -0
- package/commands/massu-golden-path/references/error-handling.md +121 -0
- package/commands/massu-golden-path/references/phase-0-requirements.md +53 -0
- package/commands/massu-golden-path/references/phase-1-plan-creation.md +168 -0
- package/commands/massu-golden-path/references/phase-2-implementation.md +397 -0
- package/commands/massu-golden-path/references/phase-2.5-gap-analyzer.md +156 -0
- package/commands/massu-golden-path/references/phase-3-simplify.md +40 -0
- package/commands/massu-golden-path/references/phase-4-commit.md +94 -0
- package/commands/massu-golden-path/references/phase-5-push.md +116 -0
- package/commands/massu-golden-path/references/phase-5.5-production-verify.md +170 -0
- package/commands/massu-golden-path/references/phase-6-completion.md +113 -0
- package/commands/massu-golden-path/references/qa-evaluator-spec.md +137 -0
- package/commands/massu-golden-path/references/sprint-contract-protocol.md +117 -0
- package/commands/massu-golden-path/references/vr-visual-calibration.md +73 -0
- package/commands/massu-golden-path.md +114 -848
- package/commands/massu-guide.md +72 -69
- package/commands/massu-hooks.md +27 -12
- package/commands/massu-hotfix.md +221 -144
- package/commands/massu-incident.md +49 -20
- package/commands/massu-infra-audit.md +187 -0
- package/commands/massu-learning-audit.md +211 -0
- package/commands/massu-loop/references/auto-learning.md +49 -0
- package/commands/massu-loop/references/checkpoint-audit.md +40 -0
- package/commands/massu-loop/references/guardrails.md +17 -0
- package/commands/massu-loop/references/iteration-structure.md +115 -0
- package/commands/massu-loop/references/loop-controller.md +188 -0
- package/commands/massu-loop/references/plan-extraction.md +78 -0
- package/commands/massu-loop/references/vr-plan-spec.md +140 -0
- package/commands/massu-loop-playwright.md +9 -9
- package/commands/massu-loop.md +115 -670
- package/commands/massu-new-pattern.md +423 -0
- package/commands/massu-perf.md +422 -0
- package/commands/massu-plan-audit.md +1 -1
- package/commands/massu-plan.md +389 -122
- package/commands/massu-production-verify.md +433 -0
- package/commands/massu-push.md +62 -378
- package/commands/massu-recap.md +29 -3
- package/commands/massu-rollback.md +613 -0
- package/commands/massu-scaffold-hook.md +2 -4
- package/commands/massu-scaffold-page.md +2 -3
- package/commands/massu-scaffold-router.md +1 -2
- package/commands/massu-security.md +619 -0
- package/commands/massu-simplify.md +115 -85
- package/commands/massu-squirrels.md +2 -2
- package/commands/massu-tdd.md +38 -22
- package/commands/massu-test.md +3 -3
- package/commands/massu-type-mismatch-audit.md +469 -0
- package/commands/massu-ui-audit.md +587 -0
- package/commands/massu-verify-playwright.md +287 -32
- package/commands/massu-verify.md +150 -46
- package/dist/cli.js +146 -95
- package/package.json +6 -2
- package/patterns/build-patterns.md +302 -0
- package/patterns/component-patterns.md +246 -0
- package/patterns/display-patterns.md +185 -0
- package/patterns/form-patterns.md +890 -0
- package/patterns/integration-testing-checklist.md +445 -0
- package/patterns/security-patterns.md +219 -0
- package/patterns/testing-patterns.md +569 -0
- package/patterns/tool-routing.md +81 -0
- package/patterns/ui-patterns.md +371 -0
- package/protocols/plan-implementation.md +267 -0
- package/protocols/recovery.md +225 -0
- package/protocols/verification.md +404 -0
- package/reference/command-taxonomy.md +178 -0
- package/reference/cr-rules-reference.md +76 -0
- package/reference/hook-execution-order.md +148 -0
- package/reference/lessons-learned.md +175 -0
- package/reference/patterns-quickref.md +208 -0
- package/reference/standards.md +135 -0
- package/reference/subagents-reference.md +17 -0
- package/reference/vr-verification-reference.md +867 -0
- package/src/commands/install-commands.ts +149 -53
|
@@ -0,0 +1,84 @@
|
|
|
1
|
+
# Eval Runner Protocol
|
|
2
|
+
|
|
3
|
+
## Purpose
|
|
4
|
+
|
|
5
|
+
Defines how to spawn a subagent to simulate command execution and score the output against the eval checklist.
|
|
6
|
+
|
|
7
|
+
## Why Subagent-Based Eval?
|
|
8
|
+
|
|
9
|
+
Claude Code cannot programmatically invoke `/massu-*` commands. The `claude` CLI accepts prompts but doesn't have a `--skill` flag for direct skill invocation. A subagent with the command text injected as instructions is the most practical simulation.
|
|
10
|
+
|
|
11
|
+
## Eval Execution Steps
|
|
12
|
+
|
|
13
|
+
### Step 1: Read inputs
|
|
14
|
+
|
|
15
|
+
Read three files:
|
|
16
|
+
|
|
17
|
+
1. **Target command file** (current version): `.claude/commands/[command].md` (or `[command]/[command].md` for folder-based)
|
|
18
|
+
2. **Eval checklist** (immutable): `.claude/evals/[command].md`
|
|
19
|
+
3. **Fixture input**: `.claude/evals/fixtures/[command]/input-01.md`
|
|
20
|
+
|
|
21
|
+
### Step 2: Spawn eval subagent
|
|
22
|
+
|
|
23
|
+
Use the Agent tool with `subagent_type="general-purpose"` and a prompt structured as:
|
|
24
|
+
|
|
25
|
+
```
|
|
26
|
+
You are simulating the execution of a Claude Code command. Your job is to produce the output EXACTLY as the command specifies, including all required sections and formatting.
|
|
27
|
+
|
|
28
|
+
## Command Instructions
|
|
29
|
+
|
|
30
|
+
[PASTE FULL COMMAND FILE TEXT HERE]
|
|
31
|
+
|
|
32
|
+
## Input
|
|
33
|
+
|
|
34
|
+
[PASTE FIXTURE INPUT TEXT HERE]
|
|
35
|
+
|
|
36
|
+
## Your Task
|
|
37
|
+
|
|
38
|
+
Execute the command against the input above. Produce the complete output as if you were running the command for real. Include every section the command specifies. Do not skip optional sections — treat everything as required for this simulation.
|
|
39
|
+
|
|
40
|
+
Do NOT include any meta-commentary about the simulation. Just produce the command output.
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### Step 3: Score the output
|
|
44
|
+
|
|
45
|
+
After the subagent returns its output, score each check from the eval checklist:
|
|
46
|
+
|
|
47
|
+
For each check in `.claude/evals/[command].md`:
|
|
48
|
+
|
|
49
|
+
1. Read the check's `pass_condition`
|
|
50
|
+
2. Evaluate the subagent output against that condition
|
|
51
|
+
3. Record `true` (pass) or `false` (fail)
|
|
52
|
+
|
|
53
|
+
Scoring rules:
|
|
54
|
+
- Checks are binary — no partial credit
|
|
55
|
+
- Section presence checks: the section header must exist AND contain substantive content (not just the header)
|
|
56
|
+
- Count checks (e.g., "at least 3 action items"): count actual items, not placeholders
|
|
57
|
+
- Conditional checks (e.g., "if score >= 40"): evaluate the condition first, then check
|
|
58
|
+
|
|
59
|
+
### Step 4: Calculate score
|
|
60
|
+
|
|
61
|
+
```
|
|
62
|
+
score = (passing_checks / total_checks) * 100
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
### Step 5: Return results
|
|
66
|
+
|
|
67
|
+
Return to the main loop:
|
|
68
|
+
|
|
69
|
+
1. `score`: percentage (e.g., 75.0)
|
|
70
|
+
2. `checks`: object mapping check_id to boolean (e.g., `{"has_executive_summary": true, "has_score_table": false, ...}`)
|
|
71
|
+
3. `failing_checks`: list of check_ids that failed
|
|
72
|
+
4. `status`: "success" or "crash" (if subagent errored or output was unparseable)
|
|
73
|
+
|
|
74
|
+
## Output Containment
|
|
75
|
+
|
|
76
|
+
The full subagent output is NOT retained in the main agent context. After scoring:
|
|
77
|
+
|
|
78
|
+
- Discard the full output text
|
|
79
|
+
- Keep only: score, per-check results, failing check names
|
|
80
|
+
- This prevents context window exhaustion during long runs (Karpathy insight #4)
|
|
81
|
+
|
|
82
|
+
## Crash Handling
|
|
83
|
+
|
|
84
|
+
If the subagent errors, times out, or produces output that cannot be scored, return `status: "crash"` to the main loop. See `safety-rails.md` for crash retry protocol.
|
|
@@ -0,0 +1,125 @@
|
|
|
1
|
+
# Safety Rails
|
|
2
|
+
|
|
3
|
+
## Protected Line Patterns
|
|
4
|
+
|
|
5
|
+
Lines containing ANY of these patterns are READ-ONLY. The agent may add content adjacent to them but MUST NOT modify or delete these lines:
|
|
6
|
+
|
|
7
|
+
- `CR-` (Canonical Rule references)
|
|
8
|
+
- `VR-` (Verification Requirement references)
|
|
9
|
+
- `NON-NEGOTIABLE`
|
|
10
|
+
- `NEVER` — protected when: (a) at start of line, (b) after bullet/dash (`- NEVER`, `* NEVER`), or (c) in ALL-CAPS imperative context (`NEVER use`, `NEVER edit`, `NEVER import`). NOT protected in lowercase prose (`should never`, `would never`, `can never`).
|
|
11
|
+
- `MANDATORY`
|
|
12
|
+
- `FORBIDDEN`
|
|
13
|
+
- `MUST NOT`
|
|
14
|
+
|
|
15
|
+
These patterns encode incident-learned safety rules. Modifying them to pass an eval check would optimize the metric while degrading real safety — a Goodhart's Law violation.
|
|
16
|
+
|
|
17
|
+
### Verification
|
|
18
|
+
|
|
19
|
+
Before applying any edit, grep the target file for protected patterns in the edit region. If the edit would modify a protected line, reject the edit and try a different approach.
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Backup Protocol
|
|
24
|
+
|
|
25
|
+
Before the first iteration of any autoresearch run:
|
|
26
|
+
|
|
27
|
+
1. Create the backups directory if needed: `.claude/metrics/backups/`
|
|
28
|
+
2. Copy the target file to: `.claude/metrics/backups/[command]-autoresearch-[YYYY-MM-DD-HHmmss].md.bak`
|
|
29
|
+
3. Verify the backup exists and is non-empty
|
|
30
|
+
|
|
31
|
+
The backup is the ultimate rollback if the entire autoresearch branch needs to be abandoned.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## Git Branch Protocol
|
|
36
|
+
|
|
37
|
+
All autoresearch commits go to a dedicated branch:
|
|
38
|
+
|
|
39
|
+
1. **Resolve target path**: Determine the actual target file path — `.claude/commands/[command].md` for flat commands, `.claude/commands/[command]/[command].md` for folder-based commands. Store as `TARGET_PATH` and use for ALL subsequent git add/checkout operations.
|
|
40
|
+
2. **Create branch**: `git checkout -b autoresearch/[command]-[YYYY-MM-DD]`. If branch already exists (same command, same day), append a numeric suffix: `autoresearch/[command]-[YYYY-MM-DD]-2`, `-3`, etc.
|
|
41
|
+
3. **Commit on KEEP**: `git add [TARGET_PATH] && git commit -m "autoresearch([command]): [1-line summary]"`
|
|
42
|
+
4. **Revert on DISCARD**: `git checkout -- [TARGET_PATH]`
|
|
43
|
+
5. **After run**: User reviews the branch and decides to merge, cherry-pick, or discard
|
|
44
|
+
|
|
45
|
+
The branch naming convention allows multiple autoresearch runs on different commands without conflict.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## Cost Tracking
|
|
50
|
+
|
|
51
|
+
Track estimated token usage per iteration:
|
|
52
|
+
|
|
53
|
+
- **Eval subagent prompt**: ~target file tokens + eval file tokens + fixture tokens
|
|
54
|
+
- **Eval subagent response**: ~2x fixture tokens (simulated output)
|
|
55
|
+
- **Main agent overhead**: ~500 tokens per iteration (scoring, logging, state)
|
|
56
|
+
|
|
57
|
+
Rough estimate: each iteration costs ~3x the target file size in tokens.
|
|
58
|
+
|
|
59
|
+
**Cost cap**: If cumulative estimated token usage exceeds 500K tokens, **stop the loop** and produce the final report. This is a hard backstop regardless of `--no-limit` mode — it is NOT a pause, it is a termination.
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## Tristate Crash Handling
|
|
64
|
+
|
|
65
|
+
When an eval produces a CRASH result:
|
|
66
|
+
|
|
67
|
+
1. **Log the crash**: Record in `autoresearch-runs.jsonl` with `"action":"crashed"`
|
|
68
|
+
2. **Do NOT increment** the consecutive reverts counter
|
|
69
|
+
3. **Retry once**: On the retry, use a SIMPLIFIED edit (smaller change, fewer lines modified)
|
|
70
|
+
4. **If retry also crashes**: Log as DISCARD, increment consecutive reverts counter
|
|
71
|
+
5. **Continue the loop** — a single crash does not warrant stopping
|
|
72
|
+
|
|
73
|
+
Crash accumulation: if `cumulative_crashed >= 5` in a single run, warn that there may be a systemic issue with the eval or fixture.
|
|
74
|
+
|
|
75
|
+
---
|
|
76
|
+
|
|
77
|
+
## "Think Harder" Graduated Escalation
|
|
78
|
+
|
|
79
|
+
When consecutive reverts trigger escalation, the edit strategy changes:
|
|
80
|
+
|
|
81
|
+
### Level 0 (default)
|
|
82
|
+
Normal operation. Target the weakest failing check. Try standard edit types.
|
|
83
|
+
|
|
84
|
+
### Level 1 (3 consecutive reverts)
|
|
85
|
+
**Switch edit type.** If previous attempts used "add rule", try in this order:
|
|
86
|
+
1. Add a worked example (before/after demonstration)
|
|
87
|
+
2. Restructure the section (reorder for emphasis)
|
|
88
|
+
3. Add a banned anti-pattern
|
|
89
|
+
4. Promote existing requirement higher in the file
|
|
90
|
+
|
|
91
|
+
### Level 2 (5 consecutive reverts)
|
|
92
|
+
**Full context reload.** Before the next edit:
|
|
93
|
+
1. Re-read the eval checklist from disk (not from memory)
|
|
94
|
+
2. Re-read the target file from disk
|
|
95
|
+
3. Review the changelog of ALL kept changes so far (from `autoresearch-runs.jsonl`)
|
|
96
|
+
4. Identify which failing check has been attempted most without success
|
|
97
|
+
5. Look for TWO previous near-miss approaches and try COMBINING them
|
|
98
|
+
|
|
99
|
+
### Level 3 (8 consecutive reverts)
|
|
100
|
+
**Radical restructuring.** Try structural changes, not content changes:
|
|
101
|
+
1. Reorder major sections of the target file
|
|
102
|
+
2. Merge two related sections into one focused section
|
|
103
|
+
3. Split an overloaded section into two specific sections
|
|
104
|
+
4. Convert abstract rules into concrete worked examples
|
|
105
|
+
5. Add a decision tree or flowchart for complex logic
|
|
106
|
+
|
|
107
|
+
### BAIL (10 consecutive reverts)
|
|
108
|
+
**Stagnation exit.** Log: "STAGNATION: 10 consecutive reverts after Level 3 escalation."
|
|
109
|
+
- Revert any pending changes
|
|
110
|
+
- Produce the final report
|
|
111
|
+
- Exit the loop
|
|
112
|
+
- This is the only way CR-37 triggers — after Level 3 has been attempted
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Output Containment
|
|
117
|
+
|
|
118
|
+
Eval subagent output is captured and scored, NOT inlined into the main agent context.
|
|
119
|
+
|
|
120
|
+
The main agent context receives ONLY:
|
|
121
|
+
- Score percentage (e.g., 75.0%)
|
|
122
|
+
- Per-check pass/fail results (e.g., `has_executive_summary: true, has_score_table: false`)
|
|
123
|
+
- Names of failing checks
|
|
124
|
+
|
|
125
|
+
This is the Karpathy `> run.log 2>&1` pattern adapted for LLM context management.
|
|
@@ -0,0 +1,151 @@
|
|
|
1
|
+
# Scoring Protocol
|
|
2
|
+
|
|
3
|
+
## Score Calculation
|
|
4
|
+
|
|
5
|
+
```
|
|
6
|
+
score = (passing_checks / total_checks) * 100
|
|
7
|
+
```
|
|
8
|
+
|
|
9
|
+
Each eval check is binary: pass (true) or fail (false). No partial credit. No weighting.
|
|
10
|
+
|
|
11
|
+
---
|
|
12
|
+
|
|
13
|
+
## Baseline Measurement
|
|
14
|
+
|
|
15
|
+
Before any edits, establish the baseline:
|
|
16
|
+
|
|
17
|
+
1. Run the eval subagent 3 times with the same fixture and current command file
|
|
18
|
+
2. Score each run independently
|
|
19
|
+
3. Take the **median** score as the baseline
|
|
20
|
+
4. Log as iteration 0 in `autoresearch-runs.jsonl`
|
|
21
|
+
|
|
22
|
+
The median (not mean) accounts for subagent output variability without being skewed by a single outlier run.
|
|
23
|
+
|
|
24
|
+
**Note**: Only the baseline uses 3 runs. Per-iteration evals use a single run as an intentional cost tradeoff — running 3 evals per iteration would triple token usage. The single-run approach works because the simplicity gate and convergence criterion (3 consecutive passes) provide noise resistance.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## Accept/Reject Threshold
|
|
29
|
+
|
|
30
|
+
After each iteration edit:
|
|
31
|
+
|
|
32
|
+
- `score_after > score_before` → candidate for KEEP (subject to simplicity gate)
|
|
33
|
+
- `score_after == score_before` → DISCARD (no improvement = no noise-driven drift) — **exception**: if `net_delta < 0` (deletion), proceed to simplicity gate instead
|
|
34
|
+
- `score_after < score_before` → DISCARD (regression)
|
|
35
|
+
|
|
36
|
+
**Strict greater-than** is required for non-deletion edits. Equal scores are rejected to prevent random walk drift where the command changes without improving. The one exception is deletion edits (`net_delta < 0`) which maintain score — these proceed to the simplicity gate where the "deletion bonus" rule applies.
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## Simplicity Gate
|
|
41
|
+
|
|
42
|
+
After scoring, before accepting a KEEP candidate:
|
|
43
|
+
|
|
44
|
+
1. Count `checks_improved`: how many checks flipped from false to true
|
|
45
|
+
2. Calculate `net_delta`: lines_added - lines_removed in the edit
|
|
46
|
+
3. Apply the simplicity criterion:
|
|
47
|
+
|
|
48
|
+
| checks_improved | net_delta | Decision | Log action |
|
|
49
|
+
|----------------|-----------|----------|------------|
|
|
50
|
+
| >= 2 | any | KEEP | `"kept"` |
|
|
51
|
+
| == 1 | > 0 | **REJECT** | `"rejected_complexity"` |
|
|
52
|
+
| == 1 | <= 0 | KEEP | `"kept"` |
|
|
53
|
+
| == 0 | < 0 | KEEP (if score maintained) | `"kept"` (simplification) |
|
|
54
|
+
| == 0 | >= 0 | DISCARD | `"discarded"` |
|
|
55
|
+
|
|
56
|
+
**Deletion bonus**: If `net_delta < 0` (change removes lines) AND `score_after >= score_before`, ALWAYS KEEP. Karpathy: "Improvement from deleting code = definitely keep."
|
|
57
|
+
|
|
58
|
+
---
|
|
59
|
+
|
|
60
|
+
## Convergence
|
|
61
|
+
|
|
62
|
+
The loop converges when score >= target (default 90%) for **3 consecutive iterations** where each iteration includes running an eval and scoring. The 3-consecutive requirement applies to iterations that produce a score (KEEP or DISCARD), not CRASH iterations. Three consecutive readings at target eliminates the possibility of a lucky single run.
|
|
63
|
+
|
|
64
|
+
---
|
|
65
|
+
|
|
66
|
+
## Score Logging Format
|
|
67
|
+
|
|
68
|
+
Append one JSONL line per iteration to `.claude/metrics/autoresearch-runs.jsonl`:
|
|
69
|
+
|
|
70
|
+
```json
|
|
71
|
+
{
|
|
72
|
+
"command": "[target command name]",
|
|
73
|
+
"iteration": 1,
|
|
74
|
+
"timestamp": "2026-03-20T10:30:00Z",
|
|
75
|
+
"score_before": 62.5,
|
|
76
|
+
"score_after": 75.0,
|
|
77
|
+
"action": "kept",
|
|
78
|
+
"edit_type": "add_worked_example",
|
|
79
|
+
"edit_summary": "Added before/after example for gap analysis section",
|
|
80
|
+
"checks": {
|
|
81
|
+
"has_executive_summary": true,
|
|
82
|
+
"has_score_table": true,
|
|
83
|
+
"has_massu_comparison": true,
|
|
84
|
+
"has_gap_analysis": false,
|
|
85
|
+
"has_action_items": true,
|
|
86
|
+
"has_source_credibility": true,
|
|
87
|
+
"mentions_specific_commands": true,
|
|
88
|
+
"has_implementation_path": false
|
|
89
|
+
},
|
|
90
|
+
"lines_added": 8,
|
|
91
|
+
"lines_removed": 2,
|
|
92
|
+
"net_delta": 6,
|
|
93
|
+
"cumulative_kept": 1,
|
|
94
|
+
"cumulative_discarded": 0,
|
|
95
|
+
"cumulative_crashed": 0,
|
|
96
|
+
"cumulative_rejected": 0,
|
|
97
|
+
"escalation_level": 0,
|
|
98
|
+
"consecutive_reverts": 0
|
|
99
|
+
}
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
### Field Definitions
|
|
103
|
+
|
|
104
|
+
| Field | Type | Description |
|
|
105
|
+
|-------|------|-------------|
|
|
106
|
+
| `command` | string | Target command name (e.g., "massu-article-review") |
|
|
107
|
+
| `iteration` | number | 0 = baseline, 1+ = edit iterations |
|
|
108
|
+
| `timestamp` | string | ISO 8601 timestamp |
|
|
109
|
+
| `score_before` | number\|null | Score percentage before this iteration's edit. `null` for baseline (iteration 0) since there is no prior score. |
|
|
110
|
+
| `score_after` | number | Score percentage after this iteration's edit |
|
|
111
|
+
| `action` | string | One of: `"baseline"`, `"kept"`, `"discarded"`, `"crashed"`, `"rejected_complexity"` |
|
|
112
|
+
| `edit_type` | string\|null | `null` for baseline/dry-run. Otherwise one of: `"add_rule"`, `"add_example"`, `"add_worked_example"`, `"restructure"`, `"promote"`, `"ban_pattern"`, `"merge_sections"`, `"split_section"`, `"reorder"`, `"simplify"` |
|
|
113
|
+
| `edit_summary` | string | One-line human-readable description of the change |
|
|
114
|
+
| `checks` | object | Map of check_id to boolean pass/fail |
|
|
115
|
+
| `lines_added` | number | Lines added by this edit |
|
|
116
|
+
| `lines_removed` | number | Lines removed by this edit |
|
|
117
|
+
| `net_delta` | number | `lines_added - lines_removed` |
|
|
118
|
+
| `cumulative_kept` | number | Running total of kept changes this run |
|
|
119
|
+
| `cumulative_discarded` | number | Running total of discarded changes this run |
|
|
120
|
+
| `cumulative_crashed` | number | Running total of crashed evals this run |
|
|
121
|
+
| `cumulative_rejected` | number | Running total of complexity-rejected edits this run |
|
|
122
|
+
| `escalation_level` | number | Current escalation level (0-3) |
|
|
123
|
+
| `consecutive_reverts` | number | Current consecutive revert count (resets on KEEP) |
|
|
124
|
+
|
|
125
|
+
---
|
|
126
|
+
|
|
127
|
+
## Final Summary Format
|
|
128
|
+
|
|
129
|
+
After the loop exits, produce a summary:
|
|
130
|
+
|
|
131
|
+
```
|
|
132
|
+
Score Trajectory:
|
|
133
|
+
[iteration 0] 62.5% (baseline)
|
|
134
|
+
[iteration 1] 75.0% KEPT "Added worked example for gap analysis"
|
|
135
|
+
[iteration 2] 75.0% DISCARD "Tried adding source credibility template"
|
|
136
|
+
[iteration 3] 87.5% KEPT "Promoted source credibility to mandatory section"
|
|
137
|
+
...
|
|
138
|
+
|
|
139
|
+
Per-Check Progression:
|
|
140
|
+
Check Baseline Final
|
|
141
|
+
has_executive_summary PASS PASS
|
|
142
|
+
has_score_table PASS PASS
|
|
143
|
+
has_massu_comparison PASS PASS
|
|
144
|
+
has_gap_analysis FAIL PASS <-- improved
|
|
145
|
+
has_action_items PASS PASS
|
|
146
|
+
has_source_credibility FAIL PASS <-- improved
|
|
147
|
+
mentions_specific_commands PASS PASS
|
|
148
|
+
has_implementation_path FAIL FAIL (hardest check)
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
This summary shows the optimization journey and identifies which checks responded to edits and which are structurally difficult.
|
|
@@ -0,0 +1,258 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: massu-autoresearch
|
|
3
|
+
description: "When user wants autonomous command optimization, says 'autoresearch', 'optimize this command', 'run overnight improvement loop', or wants Karpathy-style iterative prompt optimization"
|
|
4
|
+
allowed-tools: Bash(*), Read(*), Write(*), Edit(*), Grep(*), Glob(*), Agent(*), Task(*)
|
|
5
|
+
---
|
|
6
|
+
name: massu-autoresearch
|
|
7
|
+
|
|
8
|
+
**Shared rules**: Read `.claude/commands/_shared-preamble.md` for POST-COMPACTION (CR-12), ENTERPRISE-GRADE (CR-14), AWS SECRETS (CR-5) rules.
|
|
9
|
+
|
|
10
|
+
# Massu Autoresearch: Autonomous Command Optimization Loop
|
|
11
|
+
|
|
12
|
+
## Objective
|
|
13
|
+
|
|
14
|
+
Iteratively improve a target command file by scoring output against an eval checklist, keeping improvements (git commit), reverting failures (git checkout), and repeating until convergence. Based on Karpathy's autoresearch pattern (42K GitHub stars).
|
|
15
|
+
|
|
16
|
+
---
|
|
17
|
+
|
|
18
|
+
## ARGUMENTS
|
|
19
|
+
|
|
20
|
+
```
|
|
21
|
+
/massu-autoresearch [command-name] # Default: 20 iterations, 90% target
|
|
22
|
+
/massu-autoresearch [command-name] --max-iterations N # Custom iteration cap
|
|
23
|
+
/massu-autoresearch [command-name] --target-score N # Custom target (default 90)
|
|
24
|
+
/massu-autoresearch [command-name] --no-limit # NEVER STOP mode (cost cap only)
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
**Arguments from $ARGUMENTS**: {{ARGUMENTS}}
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## THREE-FILE ARCHITECTURE
|
|
32
|
+
|
|
33
|
+
| Role | File | Mutable? |
|
|
34
|
+
|------|------|----------|
|
|
35
|
+
| **Target** (train.py) | `.claude/commands/[command].md` (or `[command]/[command].md` for folder-based) | YES — agent edits this |
|
|
36
|
+
| **Eval** (prepare.py) | `.claude/evals/[command].md` | **NO — IMMUTABLE** |
|
|
37
|
+
| **Instructions** (program.md) | This file | **NO — agent follows this** |
|
|
38
|
+
|
|
39
|
+
The target command is the ONLY file the agent may edit. The eval checklist and these instructions are READ-ONLY.
|
|
40
|
+
|
|
41
|
+
---
|
|
42
|
+
|
|
43
|
+
## NON-NEGOTIABLE RULES
|
|
44
|
+
|
|
45
|
+
1. **ONE change per iteration** — single-variable perturbation only. Never batch multiple edits.
|
|
46
|
+
2. **NEVER edit the eval file** — `.claude/evals/[command].md` is immutable during runs.
|
|
47
|
+
3. **NEVER edit lines containing CR-*/VR-* references** — these encode incident-learned safety rules. See `references/safety-rails.md` for protected line patterns.
|
|
48
|
+
4. **ALWAYS backup before first iteration** — save original to `.claude/metrics/backups/[command]-autoresearch-[timestamp].md.bak`
|
|
49
|
+
5. **git commit on improvement / git checkout on regression** — tristate handling (see below).
|
|
50
|
+
6. **Tristate results**: KEEP / DISCARD / CRASH — crashes do NOT count toward stagnation.
|
|
51
|
+
7. **NEVER STOP mode** (when `--no-limit` passed): "Once the experiment loop has begun, do NOT pause to ask the human if you should continue." Run until convergence, cost cap (500K tokens), or manual interrupt (Ctrl+C). Default mode uses `--max-iterations` cap.
|
|
52
|
+
|
|
53
|
+
---
|
|
54
|
+
|
|
55
|
+
## TRISTATE RESULTS
|
|
56
|
+
|
|
57
|
+
Every iteration produces one of three outcomes:
|
|
58
|
+
|
|
59
|
+
| Outcome | Condition | Action |
|
|
60
|
+
|---------|-----------|--------|
|
|
61
|
+
| **KEEP** | `score_after > score_before` AND passes simplicity gate | `git add [target] && git commit -m "autoresearch([cmd]): [summary]"` — new baseline |
|
|
62
|
+
| **DISCARD** | `score_after <= score_before` OR fails simplicity gate | `git checkout -- [target]` — revert to baseline |
|
|
63
|
+
| **CRASH** | Eval subagent errored or produced unparseable output | Log the crash. Do NOT count toward stagnation. Retry once with simplified edit. If still failing, count as DISCARD. |
|
|
64
|
+
|
|
65
|
+
Crash handling details: see `references/safety-rails.md`.
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
## SIMPLICITY CRITERION
|
|
70
|
+
|
|
71
|
+
Apply the simplicity gate per `references/scoring-protocol.md`. Key principle (Karpathy): "Marginal improvement + added complexity = reject. Improvement from deleting code = definitely keep."
|
|
72
|
+
|
|
73
|
+
---
|
|
74
|
+
|
|
75
|
+
## "THINK HARDER" GRADUATED ESCALATION
|
|
76
|
+
|
|
77
|
+
Replaces CR-37 immediate bail. When stuck, escalate before giving up. Consecutive revert counter resets to 0 after any KEEP.
|
|
78
|
+
|
|
79
|
+
| Level | Trigger |
|
|
80
|
+
|-------|---------|
|
|
81
|
+
| **Level 1** | 3 consecutive reverts |
|
|
82
|
+
| **Level 2** | 5 consecutive reverts |
|
|
83
|
+
| **Level 3** | 8 consecutive reverts |
|
|
84
|
+
| **BAIL** | 10 consecutive reverts |
|
|
85
|
+
|
|
86
|
+
Per-level strategies: see `references/safety-rails.md`.
|
|
87
|
+
|
|
88
|
+
---
|
|
89
|
+
|
|
90
|
+
## CONVERGENCE CRITERIA
|
|
91
|
+
|
|
92
|
+
The loop stops when ANY of these conditions is met:
|
|
93
|
+
|
|
94
|
+
| Condition | Default | Overridable? |
|
|
95
|
+
|-----------|---------|-------------|
|
|
96
|
+
| Score >= target for 3 consecutive iterations | 90% | `--target-score N` |
|
|
97
|
+
| Max iterations reached | 20 | `--max-iterations N` or `--no-limit` |
|
|
98
|
+
| Stagnation bail (10 consecutive reverts after Level 3) | Always active | Not overridable |
|
|
99
|
+
| Cost cap (estimated cumulative token usage) | 500K tokens | Not overridable |
|
|
100
|
+
|
|
101
|
+
---
|
|
102
|
+
|
|
103
|
+
## LOOP CONTROLLER
|
|
104
|
+
|
|
105
|
+
### Initialization (runs once)
|
|
106
|
+
|
|
107
|
+
1. **Parse arguments**: Extract command name, max iterations, target score, no-limit flag
|
|
108
|
+
2. **Validate files exist**:
|
|
109
|
+
- Target: `.claude/commands/[command].md` (or `.claude/commands/[command]/[command].md` for folder-based)
|
|
110
|
+
- Eval: `.claude/evals/[command].md`
|
|
111
|
+
- Fixture: `.claude/evals/fixtures/[command]/input-01.md`
|
|
112
|
+
3. **Resolve target path**: Determine actual file — `.claude/commands/[command].md` (flat) or `.claude/commands/[command]/[command].md` (folder-based). Store as `TARGET_PATH` for all git operations.
|
|
113
|
+
4. **Create git branch**: `git checkout -b autoresearch/[command]-[YYYY-MM-DD]` (if exists, append `-2`, `-3`, etc.)
|
|
114
|
+
5. **Create backup**: Copy target to `.claude/metrics/backups/[command]-autoresearch-[timestamp].md.bak`
|
|
115
|
+
6. **Measure baseline**: Run eval 3 times, take median score. Log as `"iteration":0` in `autoresearch-runs.jsonl`
|
|
116
|
+
7. **Initialize state**: `consecutive_reverts = 0`, `escalation_level = 0`, `cumulative_kept = 0`, `cumulative_discarded = 0`, `cumulative_crashed = 0`, `cumulative_rejected = 0`
|
|
117
|
+
|
|
118
|
+
### Per-Iteration Loop
|
|
119
|
+
|
|
120
|
+
For each iteration `i` from 1 to max_iterations (or unlimited if `--no-limit`):
|
|
121
|
+
|
|
122
|
+
**Step 1: Read current state**
|
|
123
|
+
- Read target command file (current version after any prior keeps)
|
|
124
|
+
- Read eval checklist (immutable)
|
|
125
|
+
- Read fixture input
|
|
126
|
+
- Review failing checks from previous iteration (or baseline)
|
|
127
|
+
|
|
128
|
+
**Step 2: Hypothesize improvement**
|
|
129
|
+
Based on failing checks and escalation level, choose ONE edit:
|
|
130
|
+
- What check are we targeting?
|
|
131
|
+
- What edit type? (add rule, add example, restructure, promote, ban pattern)
|
|
132
|
+
- What specific change? (exact text to add/modify)
|
|
133
|
+
|
|
134
|
+
**Step 3: Apply edit**
|
|
135
|
+
- Use Edit tool to make ONE change to the target file
|
|
136
|
+
- Count lines added and removed
|
|
137
|
+
|
|
138
|
+
**Step 4: Run eval**
|
|
139
|
+
Read `references/eval-runner.md` for the eval subagent protocol.
|
|
140
|
+
|
|
141
|
+
Spawn an eval subagent with:
|
|
142
|
+
- The CURRENT target command text (after edit)
|
|
143
|
+
- The eval checklist
|
|
144
|
+
- The fixture input
|
|
145
|
+
|
|
146
|
+
Extract: score percentage, per-check results, failing check names.
|
|
147
|
+
|
|
148
|
+
**Step 5: Score and decide**
|
|
149
|
+
- Compare `score_after` to `score_before` (baseline or last kept score)
|
|
150
|
+
- Apply simplicity criterion
|
|
151
|
+
- Determine outcome: KEEP, DISCARD, CRASH, or REJECTED_COMPLEXITY
|
|
152
|
+
|
|
153
|
+
**Step 6: Act on decision**
|
|
154
|
+
|
|
155
|
+
| Decision | Git Action | State Update |
|
|
156
|
+
|----------|-----------|--------------|
|
|
157
|
+
| KEEP | `git add [target] && git commit` | `score_before = score_after`, `consecutive_reverts = 0`, `cumulative_kept += 1` |
|
|
158
|
+
| DISCARD | `git checkout -- [target]` | `consecutive_reverts += 1`, `cumulative_discarded += 1` |
|
|
159
|
+
| CRASH | Log crash, retry once simplified | `cumulative_crashed += 1` (consecutive_reverts unchanged) |
|
|
160
|
+
| REJECTED_COMPLEXITY | `git checkout -- [target]` | `consecutive_reverts += 1`, `cumulative_rejected += 1` |
|
|
161
|
+
|
|
162
|
+
**Step 7: Log iteration**
|
|
163
|
+
Append to `.claude/metrics/autoresearch-runs.jsonl`. See `references/scoring-protocol.md` for format.
|
|
164
|
+
|
|
165
|
+
**Step 8: Check convergence**
|
|
166
|
+
- If score >= target for 3 consecutive scored iterations (KEEP or DISCARD, not CRASH): CONVERGED — exit loop. If baseline already >= target, run 3 verification iterations to confirm before converging.
|
|
167
|
+
- If consecutive_reverts triggers escalation level change: log escalation
|
|
168
|
+
- If consecutive_reverts >= 10 (after Level 3): STAGNATION — exit loop
|
|
169
|
+
- If max iterations reached: MAX_ITERATIONS — exit loop
|
|
170
|
+
|
|
171
|
+
**Step 9: Update session state**
|
|
172
|
+
Update `session-state/CURRENT.md` with current iteration count, score, and status.
|
|
173
|
+
|
|
174
|
+
**Step 10: Continue**
|
|
175
|
+
Return to Step 1.
|
|
176
|
+
|
|
177
|
+
---
|
|
178
|
+
|
|
179
|
+
## FINAL REPORT
|
|
180
|
+
|
|
181
|
+
After the loop exits (any reason), produce:
|
|
182
|
+
|
|
183
|
+
```
|
|
184
|
+
AUTORESEARCH COMPLETE — [command]
|
|
185
|
+
|
|
186
|
+
Reason: [CONVERGED / MAX_ITERATIONS / STAGNATION / COST_CAP]
|
|
187
|
+
Iterations: [N]
|
|
188
|
+
Branch: autoresearch/[command]-[date]
|
|
189
|
+
|
|
190
|
+
Score Progression:
|
|
191
|
+
Baseline: [X]% ([N/M] checks)
|
|
192
|
+
Final: [Y]% ([N/M] checks)
|
|
193
|
+
Delta: [+/-Z]%
|
|
194
|
+
|
|
195
|
+
Results:
|
|
196
|
+
Kept: [N] commits
|
|
197
|
+
Discarded: [N] reverts
|
|
198
|
+
Crashed: [N] transient failures
|
|
199
|
+
Rejected: [N] complexity rejections
|
|
200
|
+
|
|
201
|
+
Net Lines: [+/-N] across all kept changes
|
|
202
|
+
Highest Escalation: Level [0-3]
|
|
203
|
+
|
|
204
|
+
Kept Changes:
|
|
205
|
+
1. [iteration N] [edit summary] (+X checks)
|
|
206
|
+
2. [iteration N] [edit summary] (+X checks)
|
|
207
|
+
...
|
|
208
|
+
|
|
209
|
+
Still Failing:
|
|
210
|
+
- [check_name]: [brief description of why it's hard]
|
|
211
|
+
...
|
|
212
|
+
|
|
213
|
+
Next Steps:
|
|
214
|
+
- Review changes: git log autoresearch/[command]-[date]
|
|
215
|
+
- Merge if satisfied: git merge autoresearch/[command]-[date]
|
|
216
|
+
- Or cherry-pick specific commits
|
|
217
|
+
```
|
|
218
|
+
|
|
219
|
+
---
|
|
220
|
+
|
|
221
|
+
## SESSION STATE
|
|
222
|
+
|
|
223
|
+
Update `session-state/CURRENT.md` after each iteration:
|
|
224
|
+
|
|
225
|
+
```
|
|
226
|
+
AUTHORIZED_COMMAND: massu-autoresearch
|
|
227
|
+
Target: [command]
|
|
228
|
+
Branch: autoresearch/[command]-[date]
|
|
229
|
+
Iteration: [N]/[max]
|
|
230
|
+
Current Score: [X]%
|
|
231
|
+
Consecutive Reverts: [N]
|
|
232
|
+
Escalation Level: [0-3]
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
---
|
|
236
|
+
|
|
237
|
+
## Skill Contents
|
|
238
|
+
|
|
239
|
+
This skill is a folder. The following files are available for reference:
|
|
240
|
+
|
|
241
|
+
| File | Purpose | Read When |
|
|
242
|
+
|------|---------|-----------|
|
|
243
|
+
| `references/eval-runner.md` | Eval subagent spawn protocol | Before running any eval |
|
|
244
|
+
| `references/safety-rails.md` | Protected lines, backup, branch, cost cap, crash handling, escalation | Before any edit or on error |
|
|
245
|
+
| `references/scoring-protocol.md` | Score calculation, accept/reject, JSONL format, final summary | Before scoring or logging |
|
|
246
|
+
|
|
247
|
+
---
|
|
248
|
+
|
|
249
|
+
## START NOW
|
|
250
|
+
|
|
251
|
+
1. Parse `{{ARGUMENTS}}` for command name, flags
|
|
252
|
+
2. Validate target + eval + fixture exist
|
|
253
|
+
3. Create branch `autoresearch/[command]-[date]`
|
|
254
|
+
4. Backup target file
|
|
255
|
+
5. Measure baseline (3 eval runs, median)
|
|
256
|
+
6. Enter iteration loop
|
|
257
|
+
7. On exit: produce final report
|
|
258
|
+
8. **DO NOT ask "should I continue?" between iterations — just continue.**
|