maxsimcli 3.5.3 → 3.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/.tsbuildinfo +1 -1
- package/dist/assets/CHANGELOG.md +21 -0
- package/dist/assets/dashboard/server.js +1 -1
- package/dist/assets/templates/agents/maxsim-code-reviewer.md +169 -0
- package/dist/assets/templates/agents/maxsim-debugger.md +47 -0
- package/dist/assets/templates/agents/maxsim-executor.md +113 -0
- package/dist/assets/templates/agents/maxsim-phase-researcher.md +46 -0
- package/dist/assets/templates/agents/maxsim-plan-checker.md +45 -0
- package/dist/assets/templates/agents/maxsim-planner.md +48 -0
- package/dist/assets/templates/agents/maxsim-spec-reviewer.md +150 -0
- package/dist/assets/templates/agents/maxsim-verifier.md +43 -0
- package/dist/assets/templates/commands/maxsim/init-existing.md +42 -0
- package/dist/assets/templates/skills/systematic-debugging/SKILL.md +118 -0
- package/dist/assets/templates/skills/tdd/SKILL.md +118 -0
- package/dist/assets/templates/skills/verification-before-completion/SKILL.md +102 -0
- package/dist/assets/templates/workflows/init-existing.md +1099 -0
- package/dist/cli.cjs +55 -4
- package/dist/cli.cjs.map +1 -1
- package/dist/cli.js +2 -1
- package/dist/cli.js.map +1 -1
- package/dist/core/core.js +1 -1
- package/dist/core/core.js.map +1 -1
- package/dist/core/index.d.ts +2 -2
- package/dist/core/index.d.ts.map +1 -1
- package/dist/core/index.js +2 -1
- package/dist/core/index.js.map +1 -1
- package/dist/core/init.d.ts +24 -2
- package/dist/core/init.d.ts.map +1 -1
- package/dist/core/init.js +61 -0
- package/dist/core/init.js.map +1 -1
- package/dist/core/roadmap.js +2 -2
- package/dist/core/roadmap.js.map +1 -1
- package/dist/install.cjs +38 -0
- package/dist/install.cjs.map +1 -1
- package/dist/install.js +49 -0
- package/dist/install.js.map +1 -1
- package/package.json +1 -1
|
@@ -34,6 +34,17 @@ Goal-backward verification starts from the outcome and works backwards:
|
|
|
34
34
|
3. What must be WIRED for those artifacts to function?
|
|
35
35
|
|
|
36
36
|
Then verify each level against the actual codebase.
|
|
37
|
+
|
|
38
|
+
**Evidence Gate:** Every verification finding must be backed by evidence:
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
CLAIM: [what you are verifying]
|
|
42
|
+
EVIDENCE: [exact command or file read performed]
|
|
43
|
+
OUTPUT: [relevant excerpt of actual output]
|
|
44
|
+
VERDICT: PASS | FAIL
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
Do NOT state "verified" without producing an evidence block. Do NOT trust SUMMARY.md claims — verify against actual code and command output.
|
|
37
48
|
</core_principle>
|
|
38
49
|
|
|
39
50
|
<verification_process>
|
|
@@ -588,6 +599,38 @@ return <div>No messages</div> // Always shows "no messages"
|
|
|
588
599
|
|
|
589
600
|
</stub_detection_patterns>
|
|
590
601
|
|
|
602
|
+
<anti_rationalization>
|
|
603
|
+
|
|
604
|
+
## Iron Law
|
|
605
|
+
|
|
606
|
+
<HARD-GATE>
|
|
607
|
+
NO VERIFICATION PASS WITHOUT INDEPENDENT EVIDENCE FOR EVERY TRUTH.
|
|
608
|
+
SUMMARY.md says it's done. CODE says otherwise. Trust the code.
|
|
609
|
+
</HARD-GATE>
|
|
610
|
+
|
|
611
|
+
## Common Rationalizations — REJECT THESE
|
|
612
|
+
|
|
613
|
+
| Excuse | Why It Violates the Rule |
|
|
614
|
+
|--------|--------------------------|
|
|
615
|
+
| "SUMMARY says it's done" | SUMMARYs document what Claude SAID. You verify what EXISTS. |
|
|
616
|
+
| "Task completed = goal achieved" | Task completion ≠ goal achievement. Verify the goal. |
|
|
617
|
+
| "Tests pass = requirements met" | Tests can pass with incomplete implementation. Check requirements individually. |
|
|
618
|
+
| "I trust the executor" | Trust is not verification. Check the code yourself. |
|
|
619
|
+
| "The build succeeds" | A successful build does not prove functional correctness. |
|
|
620
|
+
| "Most truths hold" | ALL truths must hold. Partial ≠ complete. |
|
|
621
|
+
|
|
622
|
+
## Red Flags — STOP and reassess if you catch yourself:
|
|
623
|
+
|
|
624
|
+
- About to mark a truth as "verified" without reading the actual code
|
|
625
|
+
- Trusting SUMMARY.md claims without grep/read verification
|
|
626
|
+
- Skipping a truth because "it was tested"
|
|
627
|
+
- Writing "PASS" before checking every must_have individually
|
|
628
|
+
- Feeling rushed to complete verification quickly
|
|
629
|
+
|
|
630
|
+
**If any red flag triggers: STOP. Read the code. Run the command. Produce the evidence block. THEN make the claim.**
|
|
631
|
+
|
|
632
|
+
</anti_rationalization>
|
|
633
|
+
|
|
591
634
|
<success_criteria>
|
|
592
635
|
|
|
593
636
|
- [ ] Previous VERIFICATION.md checked (Step 0)
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: maxsim:init-existing
|
|
3
|
+
description: Initialize MAXSIM in an existing project with codebase scanning and smart defaults
|
|
4
|
+
argument-hint: "[--auto]"
|
|
5
|
+
allowed-tools:
|
|
6
|
+
- Read
|
|
7
|
+
- Bash
|
|
8
|
+
- Write
|
|
9
|
+
- Task
|
|
10
|
+
- AskUserQuestion
|
|
11
|
+
---
|
|
12
|
+
<context>
|
|
13
|
+
**Flags:**
|
|
14
|
+
- `--auto` — Automatic mode. Runs full codebase scan, infers everything from code, creates all docs without interaction. Review recommended after auto mode.
|
|
15
|
+
</context>
|
|
16
|
+
|
|
17
|
+
<objective>
|
|
18
|
+
Initialize MAXSIM in an existing codebase through scan-first flow: codebase analysis, conflict resolution, scan-informed questioning, stage-aware document generation.
|
|
19
|
+
|
|
20
|
+
**Creates:**
|
|
21
|
+
- `.planning/codebase/` — full codebase analysis (4 mapper agents)
|
|
22
|
+
- `.planning/PROJECT.md` — project context with current state summary
|
|
23
|
+
- `.planning/config.json` — workflow preferences
|
|
24
|
+
- `.planning/REQUIREMENTS.md` — stage-aware requirements
|
|
25
|
+
- `.planning/ROADMAP.md` — milestone + suggested phases
|
|
26
|
+
- `.planning/STATE.md` — pre-populated project memory
|
|
27
|
+
|
|
28
|
+
**After this command:** Run `/maxsim:plan-phase 1` to start execution.
|
|
29
|
+
</objective>
|
|
30
|
+
|
|
31
|
+
<execution_context>
|
|
32
|
+
@./workflows/init-existing.md
|
|
33
|
+
@./references/questioning.md
|
|
34
|
+
@./references/ui-brand.md
|
|
35
|
+
@./templates/project.md
|
|
36
|
+
@./templates/requirements.md
|
|
37
|
+
</execution_context>
|
|
38
|
+
|
|
39
|
+
<process>
|
|
40
|
+
Execute the init-existing workflow from @./workflows/init-existing.md end-to-end.
|
|
41
|
+
Preserve all workflow gates (conflict resolution, scan completion, validation, approvals, commits).
|
|
42
|
+
</process>
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: systematic-debugging
|
|
3
|
+
description: Use when encountering any bug, test failure, or unexpected behavior — requires root cause investigation before attempting any fix
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Systematic Debugging
|
|
7
|
+
|
|
8
|
+
Random fixes waste time and create new bugs. Find the root cause first.
|
|
9
|
+
|
|
10
|
+
**If you have not identified the root cause, you are guessing — not debugging.**
|
|
11
|
+
|
|
12
|
+
## The Iron Law
|
|
13
|
+
|
|
14
|
+
<HARD-GATE>
|
|
15
|
+
NO FIX ATTEMPTS WITHOUT UNDERSTANDING ROOT CAUSE.
|
|
16
|
+
If you have not completed the REPRODUCE and HYPOTHESIZE steps, you CANNOT propose a fix.
|
|
17
|
+
"Let me just try this" is guessing, not debugging.
|
|
18
|
+
Violating this rule is a violation — not a time-saving shortcut.
|
|
19
|
+
</HARD-GATE>
|
|
20
|
+
|
|
21
|
+
## The Gate Function
|
|
22
|
+
|
|
23
|
+
Follow these steps IN ORDER for every bug, test failure, or unexpected behavior.
|
|
24
|
+
|
|
25
|
+
### 1. REPRODUCE — Confirm the Problem
|
|
26
|
+
|
|
27
|
+
- Run the failing command or test. Capture the EXACT error output.
|
|
28
|
+
- Can you trigger it reliably? What are the exact steps?
|
|
29
|
+
- If not reproducible: gather more data — do not guess.
|
|
30
|
+
|
|
31
|
+
```bash
|
|
32
|
+
# Example: reproduce a test failure
|
|
33
|
+
npx vitest run path/to/failing.test.ts
|
|
34
|
+
```
|
|
35
|
+
|
|
36
|
+
### 2. HYPOTHESIZE — Form a Theory
|
|
37
|
+
|
|
38
|
+
- Read the error message COMPLETELY (stack trace, line numbers, exit codes)
|
|
39
|
+
- Check recent changes: `git diff`, recent commits, new dependencies
|
|
40
|
+
- Trace data flow: where does the bad value originate?
|
|
41
|
+
- State your hypothesis clearly: "I think X is the root cause because Y"
|
|
42
|
+
|
|
43
|
+
### 3. ISOLATE — Narrow the Scope
|
|
44
|
+
|
|
45
|
+
- Find the SMALLEST reproduction case
|
|
46
|
+
- In multi-component systems, add diagnostic logging at each boundary
|
|
47
|
+
- Identify which SPECIFIC layer or component is failing
|
|
48
|
+
- Compare against working examples in the codebase
|
|
49
|
+
|
|
50
|
+
### 4. VERIFY — Test Your Hypothesis
|
|
51
|
+
|
|
52
|
+
- Make the SMALLEST possible change to test your hypothesis
|
|
53
|
+
- Change ONE variable at a time — never multiple things simultaneously
|
|
54
|
+
- If hypothesis is wrong: form a NEW hypothesis, do not stack fixes
|
|
55
|
+
|
|
56
|
+
### 5. FIX — Address the Root Cause
|
|
57
|
+
|
|
58
|
+
- Write a failing test that reproduces the bug (see TDD skill)
|
|
59
|
+
- Implement a SINGLE fix that addresses the root cause
|
|
60
|
+
- No "while I'm here" improvements — fix only the identified issue
|
|
61
|
+
|
|
62
|
+
### 6. CONFIRM — Verify the Fix
|
|
63
|
+
|
|
64
|
+
- Run the original failing test: it must now pass
|
|
65
|
+
- Run the full test suite: no regressions
|
|
66
|
+
- Verify the original error no longer occurs
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
# Confirm the specific fix
|
|
70
|
+
npx vitest run path/to/fixed.test.ts
|
|
71
|
+
# Confirm no regressions
|
|
72
|
+
npx vitest run
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## Common Rationalizations — REJECT THESE
|
|
76
|
+
|
|
77
|
+
| Excuse | Why It Violates the Rule |
|
|
78
|
+
|--------|--------------------------|
|
|
79
|
+
| "I think I know what it is" | Thinking is not evidence. Reproduce first, then hypothesize. |
|
|
80
|
+
| "Let me just try this fix" | "Just try" = guessing. You have skipped REPRODUCE and HYPOTHESIZE. |
|
|
81
|
+
| "Quick patch for now, investigate later" | "Later" never comes. Patches mask the real problem. |
|
|
82
|
+
| "Multiple changes at once saves time" | You cannot isolate what worked. You will create new bugs. |
|
|
83
|
+
| "The issue is simple, I don't need the process" | Simple bugs have root causes too. The process is fast for simple bugs. |
|
|
84
|
+
| "I'm under time pressure" | Systematic debugging IS faster than guess-and-check thrashing. |
|
|
85
|
+
| "The reference is too long, I'll skim it" | Partial understanding guarantees partial fixes. Read it completely. |
|
|
86
|
+
|
|
87
|
+
## Red Flags — STOP If You Catch Yourself:
|
|
88
|
+
|
|
89
|
+
- Changing code before reproducing the error
|
|
90
|
+
- Proposing a fix before reading the full error message and stack trace
|
|
91
|
+
- Trying random fixes hoping one will work
|
|
92
|
+
- Changing multiple things simultaneously
|
|
93
|
+
- Saying "it's probably X" without evidence
|
|
94
|
+
- Applying a fix that did not work, then adding another fix on top
|
|
95
|
+
- On your 3rd failed fix attempt (this signals an architectural problem — escalate)
|
|
96
|
+
|
|
97
|
+
**If any red flag triggers: STOP. Return to step 1 (REPRODUCE).**
|
|
98
|
+
|
|
99
|
+
**If 3+ fix attempts have failed:** The issue is likely architectural, not a simple bug. Document what you have tried and escalate to the user for a design decision.
|
|
100
|
+
|
|
101
|
+
## Verification Checklist
|
|
102
|
+
|
|
103
|
+
Before claiming a bug is fixed, confirm:
|
|
104
|
+
|
|
105
|
+
- [ ] The original error has been reproduced reliably
|
|
106
|
+
- [ ] Root cause has been identified with evidence (not guessed)
|
|
107
|
+
- [ ] A failing test reproduces the bug
|
|
108
|
+
- [ ] A single, targeted fix addresses the root cause
|
|
109
|
+
- [ ] The failing test now passes
|
|
110
|
+
- [ ] The full test suite passes (no regressions)
|
|
111
|
+
- [ ] The original error no longer occurs when running the original steps
|
|
112
|
+
|
|
113
|
+
## Debugging in MAXSIM Context
|
|
114
|
+
|
|
115
|
+
When debugging during plan execution, MAXSIM deviation rules apply:
|
|
116
|
+
- **Rule 1 (Auto-fix bugs):** You may auto-fix bugs found during execution, but you must still follow this debugging process.
|
|
117
|
+
- **Rule 4 (Architectural changes):** If 3+ fix attempts fail, STOP and return a checkpoint — this is an architectural decision for the user.
|
|
118
|
+
- Track all debugging deviations for SUMMARY.md documentation.
|
|
@@ -0,0 +1,118 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: tdd
|
|
3
|
+
description: Use when implementing any feature or bug fix — requires writing a failing test before any implementation code
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Test-Driven Development (TDD)
|
|
7
|
+
|
|
8
|
+
Write the test first. Watch it fail. Write minimal code to pass. Clean up.
|
|
9
|
+
|
|
10
|
+
**If you did not watch the test fail, you do not know if it tests the right thing.**
|
|
11
|
+
|
|
12
|
+
## The Iron Law
|
|
13
|
+
|
|
14
|
+
<HARD-GATE>
|
|
15
|
+
NO IMPLEMENTATION CODE WITHOUT A FAILING TEST FIRST.
|
|
16
|
+
If you wrote production code before the test, DELETE IT. Start over.
|
|
17
|
+
No exceptions. No "I'll add tests after." No "keep as reference."
|
|
18
|
+
Violating this rule is a violation — not a judgment call.
|
|
19
|
+
</HARD-GATE>
|
|
20
|
+
|
|
21
|
+
## The Gate Function
|
|
22
|
+
|
|
23
|
+
Follow this cycle for every behavior change, feature addition, or bug fix.
|
|
24
|
+
|
|
25
|
+
### 1. RED — Write Failing Test
|
|
26
|
+
|
|
27
|
+
- Write ONE minimal test that describes the desired behavior
|
|
28
|
+
- Test name describes what SHOULD happen, not implementation details
|
|
29
|
+
- Use real code paths — mocks only when unavoidable (external APIs, databases)
|
|
30
|
+
|
|
31
|
+
### 2. VERIFY RED — Run the Test
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
# Run the test suite for this file
|
|
35
|
+
npx vitest run path/to/test.test.ts
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
- Test MUST fail (not error — fail with an assertion)
|
|
39
|
+
- Failure message must match the missing behavior
|
|
40
|
+
- If test passes immediately: you are testing existing behavior — rewrite it
|
|
41
|
+
|
|
42
|
+
### 3. GREEN — Write Minimal Code
|
|
43
|
+
|
|
44
|
+
- Write the SIMPLEST code that makes the test pass
|
|
45
|
+
- Do NOT add features the test does not require
|
|
46
|
+
- Do NOT refactor yet — that comes next
|
|
47
|
+
|
|
48
|
+
### 4. VERIFY GREEN — Run All Tests
|
|
49
|
+
|
|
50
|
+
```bash
|
|
51
|
+
npx vitest run
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
- The new test MUST pass
|
|
55
|
+
- ALL existing tests MUST still pass
|
|
56
|
+
- If any test fails: fix code, not tests
|
|
57
|
+
|
|
58
|
+
### 5. REFACTOR — Clean Up (Tests Still Green)
|
|
59
|
+
|
|
60
|
+
- Remove duplication, improve names, extract helpers
|
|
61
|
+
- Run tests after every change — they must stay green
|
|
62
|
+
- Do NOT add new behavior during refactor
|
|
63
|
+
|
|
64
|
+
### 6. REPEAT — Next failing test for next behavior
|
|
65
|
+
|
|
66
|
+
## Common Rationalizations — REJECT THESE
|
|
67
|
+
|
|
68
|
+
| Excuse | Why It Violates the Rule |
|
|
69
|
+
|--------|--------------------------|
|
|
70
|
+
| "Too simple to test" | Simple code breaks. The test takes 30 seconds to write. |
|
|
71
|
+
| "I'll add tests after" | Tests written after pass immediately — they prove nothing. |
|
|
72
|
+
| "The test framework isn't set up yet" | Set it up. That is part of the task, not a reason to skip. |
|
|
73
|
+
| "I know the code works" | Knowledge is not evidence. A passing test is evidence. |
|
|
74
|
+
| "TDD is slower for this task" | TDD is faster than debugging. Every "quick skip" creates debt. |
|
|
75
|
+
| "Let me keep the code as reference" | You will adapt it instead of writing test-first. Delete means delete. |
|
|
76
|
+
| "I need to explore the design first" | Explore, then throw it away. Start implementation with TDD. |
|
|
77
|
+
|
|
78
|
+
## Red Flags — STOP If You Catch Yourself:
|
|
79
|
+
|
|
80
|
+
- Writing implementation code before writing a test
|
|
81
|
+
- Writing a test that passes on the first run (you are testing existing behavior)
|
|
82
|
+
- Skipping the VERIFY RED step ("I know it will fail")
|
|
83
|
+
- Adding features beyond what the current test requires
|
|
84
|
+
- Skipping the REFACTOR step to save time
|
|
85
|
+
- Rationalizing "just this once" or "this is different"
|
|
86
|
+
- Keeping pre-TDD code "as reference" while writing tests
|
|
87
|
+
|
|
88
|
+
**If any red flag triggers: STOP. Delete the implementation. Write the test first.**
|
|
89
|
+
|
|
90
|
+
## Verification Checklist
|
|
91
|
+
|
|
92
|
+
Before claiming TDD compliance, confirm:
|
|
93
|
+
|
|
94
|
+
- [ ] Every new function/method has a corresponding test
|
|
95
|
+
- [ ] Each test was written BEFORE its implementation
|
|
96
|
+
- [ ] Each test was observed to FAIL before implementation was written
|
|
97
|
+
- [ ] Each test failed for the expected reason (missing behavior, not syntax error)
|
|
98
|
+
- [ ] Minimal code was written to pass each test
|
|
99
|
+
- [ ] All tests pass after implementation
|
|
100
|
+
- [ ] Refactoring (if any) did not break any tests
|
|
101
|
+
|
|
102
|
+
Cannot check all boxes? You skipped TDD. Start over.
|
|
103
|
+
|
|
104
|
+
## When Stuck
|
|
105
|
+
|
|
106
|
+
| Problem | Solution |
|
|
107
|
+
|---------|----------|
|
|
108
|
+
| Don't know how to test it | Write the assertion first. What should the output be? |
|
|
109
|
+
| Test setup is too complex | The design is too complex. Simplify the interface. |
|
|
110
|
+
| Must mock everything | Code is too coupled. Use dependency injection. |
|
|
111
|
+
| Existing code has no tests | Add tests for the code you are changing. Start the cycle now. |
|
|
112
|
+
|
|
113
|
+
## Integration with MAXSIM
|
|
114
|
+
|
|
115
|
+
In MAXSIM plan execution, tasks marked `tdd="true"` follow this cycle with per-step commits:
|
|
116
|
+
- **RED commit:** `test({phase}-{plan}): add failing test for [feature]`
|
|
117
|
+
- **GREEN commit:** `feat({phase}-{plan}): implement [feature]`
|
|
118
|
+
- **REFACTOR commit (if changes made):** `refactor({phase}-{plan}): clean up [feature]`
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: verification-before-completion
|
|
3
|
+
description: Use before claiming any work is complete, fixed, or passing — requires running verification commands and reading output before making success claims
|
|
4
|
+
---
|
|
5
|
+
|
|
6
|
+
# Verification Before Completion
|
|
7
|
+
|
|
8
|
+
Claiming work is complete without verification is dishonesty, not efficiency.
|
|
9
|
+
|
|
10
|
+
**Evidence before claims, always.**
|
|
11
|
+
|
|
12
|
+
## The Iron Law
|
|
13
|
+
|
|
14
|
+
<HARD-GATE>
|
|
15
|
+
NO COMPLETION CLAIMS WITHOUT FRESH VERIFICATION EVIDENCE.
|
|
16
|
+
If you have not run the verification command in this turn, you CANNOT claim it passes.
|
|
17
|
+
"Should work" is not evidence. "I'm confident" is not evidence.
|
|
18
|
+
Violating this rule is a violation — not a special case.
|
|
19
|
+
</HARD-GATE>
|
|
20
|
+
|
|
21
|
+
## The Gate Function
|
|
22
|
+
|
|
23
|
+
BEFORE claiming any status, expressing satisfaction, or marking a task done:
|
|
24
|
+
|
|
25
|
+
1. **IDENTIFY:** What command proves this claim?
|
|
26
|
+
2. **RUN:** Execute the FULL command (fresh, in this turn — not a previous run)
|
|
27
|
+
3. **READ:** Read the FULL output. Check the exit code. Count failures.
|
|
28
|
+
4. **VERIFY:** Does the output actually confirm the claim?
|
|
29
|
+
- If NO: State the actual status with evidence
|
|
30
|
+
- If YES: State the claim WITH the evidence
|
|
31
|
+
5. **CLAIM:** Only now may you assert completion
|
|
32
|
+
|
|
33
|
+
**Skip any step = lying, not verifying.**
|
|
34
|
+
|
|
35
|
+
### Evidence Block Format
|
|
36
|
+
|
|
37
|
+
When claiming task completion, build completion, or test passage, produce:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
CLAIM: [what you are claiming]
|
|
41
|
+
EVIDENCE: [exact command run in this turn]
|
|
42
|
+
OUTPUT: [relevant excerpt of actual output]
|
|
43
|
+
VERDICT: PASS | FAIL
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
This format is required for task completion claims in MAXSIM plan execution. It is NOT required for intermediate status updates like "I have read the file" or "here is the plan."
|
|
47
|
+
|
|
48
|
+
## Common Rationalizations — REJECT THESE
|
|
49
|
+
|
|
50
|
+
| Excuse | Why It Violates the Rule |
|
|
51
|
+
|--------|--------------------------|
|
|
52
|
+
| "Should work now" | "Should" is not evidence. RUN the command. |
|
|
53
|
+
| "I'm confident in the logic" | Confidence is not evidence. Run it. |
|
|
54
|
+
| "The linter passed" | Linter passing does not mean tests pass or build succeeds. |
|
|
55
|
+
| "Just this once" | NO EXCEPTIONS. This is the rule, not a guideline. |
|
|
56
|
+
| "I only changed one line" | One line can break everything. Verify. |
|
|
57
|
+
| "The subagent reported success" | Trust test output and VCS diffs, not agent reports. |
|
|
58
|
+
| "Partial check is enough" | Partial proves nothing about the unchecked parts. |
|
|
59
|
+
|
|
60
|
+
## Red Flags — STOP If You Catch Yourself:
|
|
61
|
+
|
|
62
|
+
- Using "should", "probably", "seems to", or "looks good" about unverified work
|
|
63
|
+
- Expressing satisfaction ("Great!", "Perfect!", "Done!") before running verification
|
|
64
|
+
- About to commit or push without running the test/build command in THIS turn
|
|
65
|
+
- Trusting a subagent's completion report without independent verification
|
|
66
|
+
- Thinking "the last run was clean, I only changed one line"
|
|
67
|
+
- About to mark a MAXSIM task as done without running the `<verify>` block
|
|
68
|
+
- Relying on a previous turn's test output as current evidence
|
|
69
|
+
|
|
70
|
+
**If any red flag triggers: STOP. Run the command. Read the output. THEN make the claim.**
|
|
71
|
+
|
|
72
|
+
## What Counts as Verification
|
|
73
|
+
|
|
74
|
+
| Claim | Requires | NOT Sufficient |
|
|
75
|
+
|-------|----------|----------------|
|
|
76
|
+
| "Tests pass" | Test command output showing 0 failures | Previous run, "should pass", partial run |
|
|
77
|
+
| "Build succeeds" | Build command with exit code 0 | Linter passing, "logs look clean" |
|
|
78
|
+
| "Bug is fixed" | Original failing test now passes | "Code changed, assumed fixed" |
|
|
79
|
+
| "Task is complete" | All done criteria checked with evidence | "I implemented everything in the plan" |
|
|
80
|
+
| "No regressions" | Full test suite passing | "I only changed one file" |
|
|
81
|
+
|
|
82
|
+
## Verification Checklist
|
|
83
|
+
|
|
84
|
+
Before marking any work as complete:
|
|
85
|
+
|
|
86
|
+
- [ ] Identified the verification command for every claim
|
|
87
|
+
- [ ] Ran each verification command fresh in this turn
|
|
88
|
+
- [ ] Read the full output (not just the summary line)
|
|
89
|
+
- [ ] Checked exit codes (0 = success, non-zero = failure)
|
|
90
|
+
- [ ] Evidence supports every completion claim
|
|
91
|
+
- [ ] No "should", "probably", or "seems to" in your completion statement
|
|
92
|
+
- [ ] Evidence block produced for the task completion claim
|
|
93
|
+
|
|
94
|
+
## In MAXSIM Plan Execution
|
|
95
|
+
|
|
96
|
+
The executor's task commit protocol requires verification BEFORE committing:
|
|
97
|
+
1. Run the task's `<verify>` block (automated checks)
|
|
98
|
+
2. Confirm the `<done>` criteria are met with evidence
|
|
99
|
+
3. Produce an evidence block for the task completion
|
|
100
|
+
4. Only then: stage files and commit
|
|
101
|
+
|
|
102
|
+
The verifier agent independently re-checks all claims — do not assume the verifier will catch what you missed.
|